FASTA parameter description



5.  Options

     Command line options are available to change the scoring
parameters and output display. Command line options must preceed
other program arguments, such as the query and library file
names.

5.1.  Command line options

-a   (fasta3, ssearch3 only) show both sequences in their
     entirety.

-A   force Smith-Waterman alignments for fasta3 DNA sequences.
     By default, only fasta3 protein sequence comparisons use
     Smith-Waterman alignments.

-B   Show normalized score as a z-score, rather than a bit-score
     in the list of best scores.

-b # Number of sequence scores to be shown on output.  In the
     absence of this option, fasta (and tfasta and ssearch)
     display all library sequences obtaining similarity scores
     with expectations less than 10.0 if optimized score are
     used, or 2.0 if they are not. The -b option can limit the
     display further, but it will not cause additional sequences
     to be displayed.

-c # Threshold score for optimization (OPTCUT).  Set "-c 1" to
     optimize every sequence in a database.

-E # Limit the number of scores and alignments shown based on the
     expected number of scores.  Used to override the expectation
     value of 10.0 used by default.  When used with -Q, -E 2.0
     will show all library sequences with scores with an
     expectation value <= 2.0.

-d # Maximum number of alignments to be displayed.  Ignored if
     "-Q" is not used.

-F # Limit the number of scores and alignments shown based on the
     expected number of scores. "-E #" sets the highest E()-value
     shown; "-F #" sets the lowest E()-value. Thus, "-F 0.0001"
     will not show any matches or alignments with E() < 0.0001.
     This allows one to skip over close relationships in searches
     for more distant relationships.

-f   Penalty for the first residue in a gap (-12 by default for
     proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).

-g   Penalty for additional residues in a gap (-2 by default for
     proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).

-h   Penalty for frameshift (fastx3/y3, tfastx3/y3 only).

-H   Omit histogram.

-i   Invert (reverse complement) the query sequence if it is DNA.
     For tfasta3/x3/y3, search the reverse complement of the
     library sequence only.

-j # Penalty for frameshift within a codon (fasty3/tfasty3 only).

-l file
     Location of library menu file (FASTLIBS).

-L   Display more information about the library sequence in the
     alignment.

-M low-high
     Range of amino acid sequence lengths to be included in the
     search.

-m # Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 9, 10

           -m 0          -m 1          -m 2          -m 3        -m 4

         MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT
         ::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------
         MWKSCGYPYT   MWKSCGYPYT


In addition  -m 10 is a new, parseable format for use with other
programs.  See the file"readme.v20u4" for a more complete
description.

-m 5 provides a combination of -m 4 and -m 0. -m 6 provides -m 5
plus HTML formatting. -m 9 provides percent identify and coordinates
with the initial list of high scores, as well as conventional -m 0
alignments.

-M low-high
     Include library sequences (proteins only) with lengths
     between low and high.

-n   Force the query sequence to be treated as a DNA sequence.
     This is particularly useful for query sequences that contain
     a large number of ambiguous residues, e.g. transcription
     factor binding sites.

-O   Send copy of results to "filename."  Helpful for
     environments without STDOUT (mostly for the Macintosh).

-o   Turn off default optimization of all scores greater than
     OPTCUT. Sort results by "initn" scores (reduces the accuracy
     of statistical estimates).

-p   Force query to be treated as protein sequence.

-Q,-q
     Quiet - does not prompt for any input.  Writes scores and
     alignments to the terminal or standard output file.

-r   Specify match/mismatch scores for DNA comparisons.  The
     default is "+5/-4". "+3/-2" can perform better in some
     cases.

-R file
     Save a results summary line for every sequence in the
     sequence library.  The summary line includes the sequence
     identifier, superfamily number (if available) position in
     the library, and the similarity scores calculated.  This
     option can be used to evaluate the sensitivity and
     selectivity of different search strategies (Pearson, 1995,
     Pearson, 1998).

-s file
     Specify the scoring matrix file.  fasta3 uses the same
     scoring matrices as Blast1.4/2.0.  Several scoring matrix
     files are included in the standard distribution.  For
     protein sequences: codaa.mat - based on minimum mutation
     matrix; idnaa.mat - identity matrix; pam250.mat - the PAM250
     matrix developed by Dayhoff et al. (Dayhoff et al., 1978);
     pam120.mat - a PAM120 matrix.  The default scoring matrix is
     BLOSUM50 ("-s BL50"). Other matrices available from within
     the program are: PAM250/"-s P250", PAM120/"-s P120",
     PAM40/"-s P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 - M40"
     (MDM are modern PAM matrices from Jones et al. (Jones et
     al., 1992),), BLOSUM50, 62, and 80/"-s BL50", "-s BL62", "-s
     BL80".

-S   Treat lower-case characters in the query or library
     sequences as "low-complexity" ("seg"-ed) residues.
     Traditionally, the "seg" program (Wootton and
     Federhen, 1993) is used to remove low complexity regions in
     DNA sequences by replacing the residues with an "X".  When
     the "-S" option is used, the FASTA33 programs provide a
     potentially more informative approach.  With "-S", lower
     case characters in the query or database sequences are
     treated as "X"'s during the initial scan, but are treated as
     normal residues during the final alignment display.  Since
     statistical significance is calculated from the similarity
     score calculated during the library search, when the lower
     case residues are "X"'s, low complexity regions will not
     produce statistically significant matches.  However, if a
     significant alignment contains low complexity regions, their
     alignmen is shown.  With "-S", lower case characters may be
     included in the alignment to indicate low complexity
     regions, and the final alignment score may be higher than
     the score obtained during the search.

     The pseg program can be used to produce databases (or query
     sequences) with lower case residues indicating low
     complexity regions using the command:

         pseg database.fasta -z 1 -q  > database.lc_seg

     (seg can also be used with some post processing, see
     readme.v33tx.)

-w # Line length (width) = number (<200)

-x # Specify the penalty for a match to an 'X', independently of the
     PAM matrix.  Particularly useful for fastx3/fasty3, where
     termination codons are encoded as 'X'.

-X   Specifies offsets for the beginning of the query and library
     sequence.  For example, if you are comparing upstream
     regions for two genes, and the first sequence contains 500
     nt of upstream sequence while the second contains 300 nt of
     upstream sequence, you might try:

         fasta -X "-500 -300" seq1.nt seq2.nt

     If the -X option is not used, FASTA assumes numbering starts with
     1.  (You should double check to be certain the negative numbering
     works properly.)

-y   Set the width of the band used for calculating "optimized"
     scores.  For proteins and ktup=2, the width is 16.  For
     proteins with ktup=1, the width is 32 by default.  For DNA
     the width is 16.

-z -1,0,1,2,3,4,5
     -z -1 turns off statistical calculations. z 0 estimates the
     significance of the match from the mean and standard
     deviation of the library scores, without correcting for
     library sequence length.  -z 1 (the default) uses a weighted
     regression of average score vs library sequence length; -z 2
     uses maximum likelihood estimates of Lambda and K; -z 3 uses
     Altschul-Gish parameters (Altschul and Gish, 1996); -z 4 - 5
     uses two variations on the -z 1 strategy. -z 1 and -z 2 are
     the best methods, in general.

-z 11,12,14,15
     estimate the statistical parameters from shuffled copies of
     each library sequence.  This doubles the time required for a
     search, but allows accurate statistics to be estimated for
     libraries comprised of a single protein family.

-Z db_size 
     set the apparent size of the database to be used when calculating
     expectation E() values.  If you searched a database with 1,000
     sequences, but would like to have the E()-values calculated in
     the context of a 100,000 sequence database, use '-Z 100000'.

-1   sort output by init1 score (for compatibility with FASTP -
     do not use).

-3   translate only three forward frames

For example:

    fasta -w 80 -a seq1.aa seq.aa

would compare the sequence in seq1.aa to that in seq2.aa and
display the results with 80 residues on an output line, showing
all of the residues in both sequences.  Be sure to enter the
options before entering the file names, or just enter the options
on the command line, and the program will prompt for the file
names.