FASTA parameter description
5. Options
Command line options are available to change the scoring
parameters and output display. Command line options must preceed
other program arguments, such as the query and library file
names.
5.1. Command line options
-a (fasta3, ssearch3 only) show both sequences in their
entirety.
-A force Smith-Waterman alignments for fasta3 DNA sequences.
By default, only fasta3 protein sequence comparisons use
Smith-Waterman alignments.
-B Show normalized score as a z-score, rather than a bit-score
in the list of best scores.
-b # Number of sequence scores to be shown on output. In the
absence of this option, fasta (and tfasta and ssearch)
display all library sequences obtaining similarity scores
with expectations less than 10.0 if optimized score are
used, or 2.0 if they are not. The -b option can limit the
display further, but it will not cause additional sequences
to be displayed.
-c # Threshold score for optimization (OPTCUT). Set "-c 1" to
optimize every sequence in a database.
-E # Limit the number of scores and alignments shown based on the
expected number of scores. Used to override the expectation
value of 10.0 used by default. When used with -Q, -E 2.0
will show all library sequences with scores with an
expectation value <= 2.0.
-d # Maximum number of alignments to be displayed. Ignored if
"-Q" is not used.
-F # Limit the number of scores and alignments shown based on the
expected number of scores. "-E #" sets the highest E()-value
shown; "-F #" sets the lowest E()-value. Thus, "-F 0.0001"
will not show any matches or alignments with E() < 0.0001.
This allows one to skip over close relationships in searches
for more distant relationships.
-f Penalty for the first residue in a gap (-12 by default for
proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).
-g Penalty for additional residues in a gap (-2 by default for
proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).
-h Penalty for frameshift (fastx3/y3, tfastx3/y3 only).
-H Omit histogram.
-i Invert (reverse complement) the query sequence if it is DNA.
For tfasta3/x3/y3, search the reverse complement of the
library sequence only.
-j # Penalty for frameshift within a codon (fasty3/tfasty3 only).
-l file
Location of library menu file (FASTLIBS).
-L Display more information about the library sequence in the
alignment.
-M low-high
Range of amino acid sequence lengths to be included in the
search.
-m # Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 9, 10
-m 0 -m 1 -m 2 -m 3 -m 4
MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT
::..:: ::: xx X ..KS..Y... MWKSCGYPYT ----------
MWKSCGYPYT MWKSCGYPYT
In addition -m 10 is a new, parseable format for use with other
programs. See the file"readme.v20u4" for a more complete
description.
-m 5 provides a combination of -m 4 and -m 0. -m 6 provides -m 5
plus HTML formatting. -m 9 provides percent identify and coordinates
with the initial list of high scores, as well as conventional -m 0
alignments.
-M low-high
Include library sequences (proteins only) with lengths
between low and high.
-n Force the query sequence to be treated as a DNA sequence.
This is particularly useful for query sequences that contain
a large number of ambiguous residues, e.g. transcription
factor binding sites.
-O Send copy of results to "filename." Helpful for
environments without STDOUT (mostly for the Macintosh).
-o Turn off default optimization of all scores greater than
OPTCUT. Sort results by "initn" scores (reduces the accuracy
of statistical estimates).
-p Force query to be treated as protein sequence.
-Q,-q
Quiet - does not prompt for any input. Writes scores and
alignments to the terminal or standard output file.
-r Specify match/mismatch scores for DNA comparisons. The
default is "+5/-4". "+3/-2" can perform better in some
cases.
-R file
Save a results summary line for every sequence in the
sequence library. The summary line includes the sequence
identifier, superfamily number (if available) position in
the library, and the similarity scores calculated. This
option can be used to evaluate the sensitivity and
selectivity of different search strategies (Pearson, 1995,
Pearson, 1998).
-s file
Specify the scoring matrix file. fasta3 uses the same
scoring matrices as Blast1.4/2.0. Several scoring matrix
files are included in the standard distribution. For
protein sequences: codaa.mat - based on minimum mutation
matrix; idnaa.mat - identity matrix; pam250.mat - the PAM250
matrix developed by Dayhoff et al. (Dayhoff et al., 1978);
pam120.mat - a PAM120 matrix. The default scoring matrix is
BLOSUM50 ("-s BL50"). Other matrices available from within
the program are: PAM250/"-s P250", PAM120/"-s P120",
PAM40/"-s P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 - M40"
(MDM are modern PAM matrices from Jones et al. (Jones et
al., 1992),), BLOSUM50, 62, and 80/"-s BL50", "-s BL62", "-s
BL80".
-S Treat lower-case characters in the query or library
sequences as "low-complexity" ("seg"-ed) residues.
Traditionally, the "seg" program (Wootton and
Federhen, 1993) is used to remove low complexity regions in
DNA sequences by replacing the residues with an "X". When
the "-S" option is used, the FASTA33 programs provide a
potentially more informative approach. With "-S", lower
case characters in the query or database sequences are
treated as "X"'s during the initial scan, but are treated as
normal residues during the final alignment display. Since
statistical significance is calculated from the similarity
score calculated during the library search, when the lower
case residues are "X"'s, low complexity regions will not
produce statistically significant matches. However, if a
significant alignment contains low complexity regions, their
alignmen is shown. With "-S", lower case characters may be
included in the alignment to indicate low complexity
regions, and the final alignment score may be higher than
the score obtained during the search.
The pseg program can be used to produce databases (or query
sequences) with lower case residues indicating low
complexity regions using the command:
pseg database.fasta -z 1 -q > database.lc_seg
(seg can also be used with some post processing, see
readme.v33tx.)
-w # Line length (width) = number (<200)
-x # Specify the penalty for a match to an 'X', independently of the
PAM matrix. Particularly useful for fastx3/fasty3, where
termination codons are encoded as 'X'.
-X Specifies offsets for the beginning of the query and library
sequence. For example, if you are comparing upstream
regions for two genes, and the first sequence contains 500
nt of upstream sequence while the second contains 300 nt of
upstream sequence, you might try:
fasta -X "-500 -300" seq1.nt seq2.nt
If the -X option is not used, FASTA assumes numbering starts with
1. (You should double check to be certain the negative numbering
works properly.)
-y Set the width of the band used for calculating "optimized"
scores. For proteins and ktup=2, the width is 16. For
proteins with ktup=1, the width is 32 by default. For DNA
the width is 16.
-z -1,0,1,2,3,4,5
-z -1 turns off statistical calculations. z 0 estimates the
significance of the match from the mean and standard
deviation of the library scores, without correcting for
library sequence length. -z 1 (the default) uses a weighted
regression of average score vs library sequence length; -z 2
uses maximum likelihood estimates of Lambda and K; -z 3 uses
Altschul-Gish parameters (Altschul and Gish, 1996); -z 4 - 5
uses two variations on the -z 1 strategy. -z 1 and -z 2 are
the best methods, in general.
-z 11,12,14,15
estimate the statistical parameters from shuffled copies of
each library sequence. This doubles the time required for a
search, but allows accurate statistics to be estimated for
libraries comprised of a single protein family.
-Z db_size
set the apparent size of the database to be used when calculating
expectation E() values. If you searched a database with 1,000
sequences, but would like to have the E()-values calculated in
the context of a 100,000 sequence database, use '-Z 100000'.
-1 sort output by init1 score (for compatibility with FASTP -
do not use).
-3 translate only three forward frames
For example:
fasta -w 80 -a seq1.aa seq.aa
would compare the sequence in seq1.aa to that in seq2.aa and
display the results with 80 residues on an output line, showing
all of the residues in both sequences. Be sure to enter the
options before entering the file names, or just enter the options
on the command line, and the program will prompt for the file
names.