BLAST

BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds.

DESCRIPTION

[ Previous | Top | Next ]

BLAST, or Basic Local Alignment Search Tool, uses the method of Altschul et al. (J. Mol. Biol. 215: 403-410 (1990)) to search for similarities between a query sequence and all the sequences in a database.

This release of BLAST implements version 2 of BLAST from the National Center for Biotechnology Information (NCBI) described in Altschul et al. (Nucleic Acids Res. 25(17): 3389-3402 (1997)). BLAST is known as "gapped BLAST" because, in addition to offering a three-fold speedup over the original BLAST, it generates gapped alignments between query and database sequences.

You can specify any number of query sequences to BLAST, and they may be in any combination of protein or nucleic acid sequences. You can also specify any number of databases to BLAST, as long as all of the databases are of the same type (protein or nucleic acid). In the current release, if you want to specify multiple databases you must do so on the command line. In other words, you cannot specify more than one database from the interactive menu. For example:

% blast -INfile2=PIR,SWPLUS

You can also specify multiple queries using any valid multiple sequence specification. For example:

% blast -INfile1=hsp70.msf{*}

The GCG Wisconsin Package BLAST program supports five different programs in the BLAST family:

BLASTP, Protein Query Searching a Protein Database
BLASTX, Nucleotide Query Searching a Protein Database
BLASTN, Nucleotide Query Searching a Nucleotide Database
TBLASTN, Protein Query Searching a Nucleotide Database
TBLASTX, Nucleotide Query Searching a Nucleotide Database

Normally, BLAST decides which BLAST program you want to use simply by looking at the type (protein or nucleic acid) of your query sequence and the database you have selected. In the case of nucleotide-nucleotide searches, there are two programs that can do the search. By default, BLASTN is used. To search using TBLASTX instead, use -TBLASTX (but remember that gapped alignments are not available when using TBLASTX).

BLAST performs only local searches: It searches databases maintained at your institution. Local searches can consume significant computing resources, and require diligent maintenance of local databases. An alternative to running searches locally is to use NetBLAST which sends your query sequences over the internet to a server at NCBI, in Bethesda, MD. Keep in mind, however, that NCBI imposes some limititions on NetBLAST searches such as restricting the number of searches that a user is permitted to run in a single day, and prohibiting TBLASTX searches against the NR database. Additionally, NetBLAST does not support as many search options as are available with BLAST.

BLAST is a statistically driven search method that finds regions of similarity between your query and database sequences and produces gapped alignments of these regions. Within these aligned regions, the sum of the scoring matrix values of their constituent symbol pairs is higher than some level that you would expect to occur by chance alone.

You are prompted to set an expectation level for the entire search. The expectation of a sequence is the probability of the current search finding a sequence with as good a score by chance alone. Therefore setting the maximum expectation level to 10.0, the default, limits the reported sequences to those with scores high enough to be have been found by chance only ten or fewer times.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using BLAST to find the sequences in PIR with similarities to a myoglobin gene:

% blast

 BLAST with what query sequence(s) ?  mywhp.pep

                  Begin (* 1 *) ?
                End (*   153 *) ?

 Search for query in what sequence database:

   1) pir     p Protein Information Resource
   2) swplus  p SWISS-PROT+SP-TREMBL
   3) genembl n GenBank+EMBL
   4) est     n Expressed Sequence Tags
   5) sts     n Sequence Tagged Sites
   6) gss     n Genome Survey Sequences

 Please choose one (* 1 *):

 Ignore hits expected to occur by chance more than (* 10.0 *) times?

 Limit the number of sequences in my output to (* 500 *) ?

 What should I call the output file (* mywhp.blastp *) ?

 1 Searching pir with query pir1:mywhp...done.

   CPU time (sec): 10.4
      Output file: mywhp.blastp

 Number of query sequences searched: 1
                     CPU time (sec): 10.4

%

OUTPUT

[ Previous | Top | Next ]

Below is part of the output from the search in the example session:

The output has four parts: 1) an introduction that tells where the search occurred and what database and query were compared; 2) a list of the sequences in the database containing HSPs (high-scoring segment pairs) whose scores were least likely to have occurred by chance (the entries in this list have begin and end ranges on them if -FRAGments is specified); 3) a display of the alignments of the HSPs showing identical and similar residues; and 4) a complete list of the parameter settings used for the search.

By default, BLAST looks for alignments that contain gaps. If you only look for alignments that do not contain gaps, there will often be more than one segment pair associated with each database sequence.

///////////////////////////////////////////////////////////////////////////////

BLASTP 2.0.5 [May-5-1998]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.
Query= PIR1:MYWHP
         (153 letters)

Database: pir
           107,076 sequences; 34,138,851 total letters

Searching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
///////////////////////////////////////////////////////////////////////////////
                                                                   Score     E
Sequences producing significant alignments:                        (bits)  Value
 ..
PIR1:MYWHP  Begin: 1 End: 153 !myoglobin - sperm whale                313  1e-85
PIR1:MYWHW  Begin: 1 End: 153 !myoglobin - dwarf sperm whale          305  3e-83
///////////////////////////////////////////////////////////////////////////////

PIR1:MYSLG  Begin: 2 End: 153 !myoglobin - gray seal                  270  1e-72
///////////////////////////////////////////////////////////////////////////////

PIR1:MYAQ  Begin: 3 End: 154 !myoglobin - American alligator          206  2e-53
\\End of List
>PIR1:MYWHP myoglobin - sperm whale
           Length = 153

 Score =  313 bits (794), Expect = 1e-85
 Identities = 153/153 (100%), Positives = 153/153 (100%)

Query: 1   VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60
           VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
Sbjct: 1   VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60

Query: 61  LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120
           LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
Sbjct: 61  LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120

Query: 121 GDFGADAQGAMNKALELFRKDIAAKYKELGYQG 153
           GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
Sbjct: 121 GDFGADAQGAMNKALELFRKDIAAKYKELGYQG 153

>PIR1:MYWHW myoglobin - dwarf sperm whale
           Length = 153

 Score =  305 bits (773), Expect = 3e-83
 Identities = 148/153 (96%), Positives = 151/153 (97%)

Query: 1   VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60
           VLSEGEWQLVLHVWAKVEAD+AGHGQDILIRLFK HPETLEKFDRFKHLK+EAEMKASED
Sbjct: 1   VLSEGEWQLVLHVWAKVEADIAGHGQDILIRLFKHHPETLEKFDRFKHLKSEAEMKASED 60

Query: 61  LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120
           LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
Sbjct: 61  LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120

Query: 121 GDFGADAQGAMNKALELFRKDIAAKYKELGYQG 153
            DFGADAQGAM+KALELFRKDIAAKYKELGYQG
Sbjct: 121 ADFGADAQGAMSKALELFRKDIAAKYKELGYQG 153

///////////////////////////////////////////////////////////////////////////////
>PIR1:MYSLG myoglobin - gray seal
           Length = 153

 Score =  270 bits (682), Expect = 1e-72
 Identities = 127/152 (83%), Positives = 140/152 (91%)

Query: 2   LSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDL 61
           LS+GEW LVL+VW KVE D+AGHGQ++LIRLFKSHPETLEKFD+FKHLK+E +M+ SEDL
Sbjct: 2   LSDGEWHLVLNVWGKVETDLAGHGQEVLIRLFKSHPETLEKFDKFKHLKSEDDMRRSEDL 61

Query: 62  KKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPG 121
           +KHG TVLTALG ILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHS+HP
Sbjct: 62  RKHGNTVLTALGGILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSKHPA 121

Query: 122 DFGADAQGAMNKALELFRKDIAAKYKELGYQG 153
           +FGADAQ AM KALELFR DIAAKYKELG+ G
Sbjct: 122 EFGADAQAAMKKALELFRNDIAAKYKELGFHG 153

///////////////////////////////////////////////////////////////////////////////
>PIR1:MYAQ myoglobin - American alligator
           Length = 154

 Score =  206 bits (519), Expect = 2e-53
 Identities = 94/152 (61%), Positives = 122/152 (79%)

Query: 2   LSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDL 61
           LS+ EW+ VL +W KVE+ +  HG +++IRL + HPET E+F++FKH+KT  EMK+SE +
Sbjct: 3   LSDQEWKHVLDIWTKVESKLPEHGHEVIIRLLQEHPETQERFEKFKHMKTADEMKSSEKM 62

Query: 62  KKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPG 121
           K+HG TV TALG ILK+KG+H   LKPLA+SHA +HKIP+KYLEFISE I+ V+  ++P
Sbjct: 63  KQHGNTVFTALGNILKQKGNHAEVLKPLAKSHALEHKIPVKYLEFISEIIVKVIAEKYPA 122

Query: 122 DFGADAQGAMNKALELFRKDIAAKYKELGYQG 153
           DFGAD+Q AM KALELFR D+A+KYKE GYQG
Sbjct: 123 DFGADSQAAMRKALELFRNDMASKYKEFGYQG 154

///////////////////////////////////////////////////////////////////////////////
  Database: pir
    Posted date:  Jul 8, 1998  4:54 PM
  Number of letters in database: 34,138,851
  Number of sequences in database:  107,076

Lambda     K      H
   0.318    0.135    0.392

Gapped
Lambda     K      H
   0.270   0.0470    0.230

Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Hits to DB: 10131438
Number of Sequences: 107076
Number of extensions: 379101
Number of successful extensions: 1983
Number of sequences better than 10: 645
Number of HSP's better than 10.0 without gapping: 572
Number of HSP's successfully gapped in prelim test: 73
Number of HSP's that attempted gapping in prelim test: 1152
Number of HSP's gapped (non-prelim): 652
length of query: 153
length of database: 34138851
effective HSP length: 50
effective length of query: 103
effective length of database: 28785051
effective search space: 2964860253
T: 11
A: 40
X1: 16 ( 7.3 bits)
X2: 38 (14.8 bits)
X3: 64 (24.9 bits)
S1: 41 (21.7 bits)
S2: 61 (28.2 bits)

The BLAST output is a list file that is suitable for input to any GCG program that allows indirect file specifications. For information about indirect file specification, see Chapter 2 of the User's Guide, Using Sequence Files and Databases.

INTERPRETING OUTPUT

[ Previous | Top | Next ]

Bit Score
E Value
N
BLAST Parameters
http://www.ncbi.nlm.nih.gov/BLAST/newblast.html.

INPUT FILES

[ Previous | Top | Next ]

BLAST accepts any number of protein or nucleic acid sequences as input. The search set is a specially formatted database. See the GCGToBLAST entry in the Program Manual for information on how to create a local database that BLAST can search from a set of sequences in GCG format.

The function of BLAST depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

PSIBLAST iteratively searches one or more protein databases for sequences similar to one or more protein query sequences. PSIBLAST is similar to BLAST except that it uses position-specific scoring matrices derived during the search.

NetBLAST searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

GCGToBLAST combines any set of GCG sequences into a database that you can search with BLAST.

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST.

TFastA does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

FastX does a Pearson and Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. FastX translates both strands of the nucleic sequence before performing the comparison. It is designed to answer the question, "What implied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?"

TFastX does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement for TFastA, and like TFastA, it is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

SSearch does a rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). This may be the most sensitive method available for similarity searches. Compared to BLAST and FastA, it can be very slow.

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.

WordSearch identifies sequences in the database that share large numbers of common words in the same register of comparison with your query sequence. The output of WordSearch can be displayed with Segments.

ProfileSearch and MotifSearch use a profile (derived from a set of aligned sequences) instead of a query sequence to search a collection of sequences.

HmmerSearch uses a profile hidden Markov model as a query to search a sequence database to find sequences similar to the family from which the profile HMM was built. Profile HMMs can be created using HmmerBuild.

FindPatterns uses a pattern described by a regular expression to search a collection of sequences. Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds.

RESTRICTIONS

[ Previous | Top | Next ]

Because of the way BLAST must estimate certain statistical parameters (see the ALGORITHM topic elsewhere in this document), the number of scoring matrices available for use with BLAST is limited. Currently, valid choices for the -MATRix parameter are BLOSUM62 (the default), BLOSUM45, BLOSUM80, PAM30, and PAM70.

Gap creation and gap extension penalties are supported in limited combinations depending upon which scoring matrix is in use. The following table shows the allowed combinations for amino acids. The first values listed are the defaults for each scoring matrix.





 Scoring Matrix    Gap Opening Penalty    Gap Extension Penalty 









   BLOSUM62                 11                       1 


                             7                       2


                             8                       2


                             9                       2


                            10                       1


                            12                       1









   BLOSUM80                 10                       1 


                             6                       2


                             7                       2


                             8                       2


                             9                       1


                            11                       1









   BLOSUM45                 14                       2      


                            10                       3


                            11                       3


                            12                       3


                            13                       3


                            12                       2


                            13                       2


                            15                       2


                            16                       1


                            17                       1


                            18                       1


                            19                       1









    PAM30                    9                       1 


                             5                       2


                             6                       2


                             7                       2


                             8                       1


                            10                       1









    PAM70                   10                       1 


                             6                       2


                             7                       2


                             8                       2


                             9                       1


                            11                       1

Gapped alignments are not an option when running TBLASTX.

You may choose multiple query sequences, any of which may be either nucleic acid or protein. You may also choose multiple databases against which to search, however each of these must be of the same type.

If you used GCGToBLAST to create your BLAST databases from any source other than a GCG-formatted database (such as from arbitrary sequence files, an MSF or RSF file, etc.), then BLAST's list file output won't be a functional list file. If you want to take full advantage of BLAST's list file output, make sure that you generate your BLAST databases from a GCG-formatted database. You can use DataSet to generate such databases from any set of sequences in GCG format.

CHOOSING SEARCH SETS

[ Previous | Top | Next ]

BLAST can search only a specially compressed form of the data. Therefore, you can search only those databases that are available in this form, and you must search them in their entirety. If you want to restrict the search to a specific set of sequences, use the program GCGToBLAST to create a specially compressed database consisting of just those sequences.

To name a searchable database interactively, choose the number of the database of interest from the menu. Use a parameter like -INfile2=genbank to choose the name of the database you want to search.

If a nucleic acid and a protein database share the same name, BLAST cannot be sure which one of them you mean when you specify one of them using the -INfile2 parameter. If the database you want to search cannot be named unambiguously with the -INfile2 parameter, add either -DBNucleotideonly or -DBProteinonly to the command line.

ALGORITHM

[ Previous | Top | Next ]

BLAST is a client for an implementation of gapped BLAST (Altschul et al., Nucleic Acids Research 25; 3389-3402 (1997)), an heuristic algorithm for searching protein and nucleic acid databases for similarities to query sequences.

The above example demonstrates BLASTP, which searches for similarities between protein queries and protein databases, as a prototype for BLAST. However, the ideas are immediately applicable to comparisons involving conceptual translations of query sequences and databases, and extend to similarity searches between nucleic acid sequences as well.

BLAST compares a query sequence with a database sequence by first locating two non-overlapping sequence segments in common within a certain distance of each other, and then attempts to extend these putative "hits" into locally optimal alignments between the sequences being compared. A more detailed description is provided below.

Preliminaries

[ Previous | Top | Next ]

BLAST uses a substitution matrix (such as the BLOSUM or PAM matrices) to assign a score to the alignment of any pair of amino acids. An aggregate score for an alignment segment can be computed by summing the scores of each amino acid pair in that segment. When given two sequences to compare, the original (ungapped) BLAST algorithm searches for arbitrary but equal length segments within each sequence that have a maximal aggregate score which meets or exceeds some threshold or cutoff score. BLAST looks for locally optimal alignments between the two sequences whose scores cannot be improved either by extending or trimming. Such locally optimal alignments are called "high-scoring segment pairs," or HSPs.

If you assume a simple protein model in which amino acids occur randomly at all positions and in proportion to the frequencies at which they are found within the database and query sequences, then we can compute a normalized score (expressed in units called bits) from the nominal score of an HSP. Such normalized scores allow direct statistical comparison of results regardless of the scoring system used (see "Generating Gapped Extensions" for a caveat to this). Furthermore, the normalized score can be used to compute an expect value, or E-value, which is the number of distinct HSPs having at least that normalized score expected to occur by chance. This theory has not been proved for gapped local alignments and their associated scores, but there are indications that it remains valid (Altschul et al., 1997).

Turning Hits Into HSPs

[ Previous | Top | Next ]

The central idea of the BLAST algorithm is that any statistically significant alignment between two sequences is likely to contain a high-scoring pair of aligned words. A word is simply a sequence segment of specified length (usually 3 for protein sequences). BLAST begins its comparison of a query sequence to a database by scanning the database for words that score at least the threshold score T when aligned with some word within the query sequence. Any word pair satisfying this condition is called a hit. The diagonal of a hit involving words starting at positions (x, y) of the database and query sequences is defined as x-y. The distance between two hits on the same diagonal is defined as the difference between their first coordinates.

Once a hit is found, BLAST determines whether the hit lies within an alignment having an aggregate score high enough to be reported. It does this by extending the hit in both directions until the running alignment's score has dropped more than some quantity X below the maximum score yet attained. This extension step is quite costly, taking upwards of 90% of BLAST's execution time under most circumstances.

In order to reduce the number of extensions it has to perform, BLAST takes advantage of the fact that an interesting HSP is typically much longer than a single hit. In fact, it is likely to contain multiple hits on the same diagonal within a relatively short distance of one another. Therefore, BLAST chooses a length A and invokes an ungapped extension if and only if two non-overlapping hits are found on the same diagonal within distance A of one another. (Any hit that overlaps the most recent one is ignored.)

Generating Gapped Extensions

[ Previous | Top | Next ]

Gapped extensions allow BLAST to maintain its sensitivity while tolerating a much higher chance of missing any single moderately scoring HSP. However, gapped extensions take about 500 times longer to execute than ungapped extensions. Therefore, BLAST triggers a gapped extension for an HSP only when its score exceeds a moderate score (Sg) specifically chosen so that no more than about one gapped extension is invoked per 50 database sequences.

To generate the gapped local alignment, BLAST uses a standard dynamic programming algorithm for pairwise sequence alignment which traverses the cells of a path graph, the dimensions of which are the lengths of the two sequences being compared, performing a fixed amount of computation per each cell. Starting from a single aligned pair of residues, called the seed, the dynamic programming proceeds both forward and backward through the path graph considering only those cells for which the optimal local alignment score falls no more than X below the best alignment score yet found. (This description is a generalization of BLAST's method for constructing HSPs.) The region of the path graph explored adapts to the alignment being produced.

The seed for the dynamic programming is the central residue pair of the length-11 segment of the HSP having the highest alignment score. If the HSP itself is shorter than 11 residues in length, its central pair of residues is chosen.

The resulting gapped alignment is reported only if it has an E-value low enough to be of interest. For any alignment actually reported, BLAST performs a gapped extension that records "traceback" information (Sankoff and Kruskal, 1983) using a substantially larger X parameter than that employed during the search stage to increase the accuracy of the alignment.

Because BLAST produces gapped alignments only for those few database sequences likely to be related to the query, it cannot estimate the parameters necessary to compute normalized scores on the fly. Instead, BLAST must rely on estimates of these parameters generated beforehand by random simulation. For this reason, BLAST cannot use a scoring system for which no simulation has been performed and still produce accurate estimates of statistical significance.

CONSIDERATIONS

[ Previous | Top | Next ]

Bit Scores and the Size of the Search
Using BLAST for Nucleotide Searches
Increasing Program Speed Using Multithreading
When Blastall Produces No Output
Using PSI-TBLASTN

SUGGESTIONS

[ Previous | Top | Next ]

List Size Limit
Segment Pair Alignment Limit
Sensitivity
Batch Queue
Relationship to FastA

FILTERING OUT LOW COMPLEXITY SEQUENCES

[ Previous | Top | Next ]

BLAST filters out regions of low complexity from query sequences by default. You can turn filtering off by using the -NOFILter parameter. Searches against a nucleotide database with nucleotide queries (blastn) employ the DUST filter program (Hancock and Armstrong, Comput. Appl. Biosci. 10: 67-70 (1994); Tatusov and Lipman, unpublished). All other searches employ the SEG filter program (Wootton and Federhen, Computers in Chemistry 17: 149-163 (1993); Wootton and Federhen, Methods in Enzymology 266: 554-571 (1996)). For a general discussion of the role of filtering in search strategies, see Altschul et al., Nature Genetics 6: 119-129 (1994).

Short repeats and low complexity sequences, such as glutamine-rich regions, confound most database searching methods. For BLAST, the random model against which the significance of segment pair scores is evaluated assumes that at each position, each residue has a probability of occurring which is proportional to its composition in the database as a whole. Low complexity or highly repetitive sequences are inconsistent with this assumption.

Low complexity sequence found by the filter program is substituted using the letter N in nucleotide sequence and the letter X in amino acid sequence. Here is an example of a sequence aligned to a filtered copy of itself to show which parts are filtered out:

  1 MAAKIFCLIMXXXXXXXXXXXXIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60
  1 MAAKIFCLIMLLGLSASAATASIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60

 61 AIAAGIXXXXXXXXXXXXXXXXXXXXXXXXXXNIRXXXXXXXXXXXXXXYSQQQQFLPFN 120
 61 AIAAGILPLSPLFLQQSSALLQQLPLVHLLAQNIRAQQLQQLVLANLAAYSQQQQFLPFN 120

121 QXXXXXXXXXXXXXXXXPFSQLAAAYPRQFLPFNQLAALNSHAYVXXXXXXPFSQLAAVS 180
121 QLAALNSAAYLQQQQLLPFSQLAAAYPRQFLPFNQLAALNSHAYVQQQQLLPFSQLAAVS 180

181 PAAFLTQQQLLPFYLHTAPNVGTXXXXXXXXXXXXXXXTNPAAFYQQPIIGGALF 235
181 PAAFLTQQQLLPFYLHTAPNVGTLLQLQQLLPFDQLALTNPAAFYQQPIIGGALF 235

AMINO ACID SCORING

[ Previous | Top | Next ]

BLAST normally uses the BLOSUM62 scoring matrix from Henikoff and Henikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)) whenever the sequences being compared are proteins (including cases where nucleotide databases or query sequences are translated into protein sequences before comparison). You can use other BLOSUM45, BLOSUM80, or the more traditional PAM70 and PAM30 scoring matrices with -MATrix, for example-MATrix=PAM40. Each matrix is most sensitive for finding homologs at the corresponding PAM distance. The seminal paper on this subject is Stephen Altschul's "Amino acid substitution matrices from an information theoretic perspective" (J. Mol. Biol. 219; 555-565 (1991)). If you are new to this literature, an easier place to start reading might be Altschul et al., "Issues in searching molecular sequence databases" (Nature Genetics, 6; 119-129 (1994)).

NUCLEOTIDE SCORING

[ Previous | Top | Next ]

There is no external scoring matrix for nucleotide-nucleotide searches (that is, searches where both the query and the database are nucleotide sequences and where you have not used -TBLASTX. But as is explained below you can specify a nucleotide-nucleotide scoring matrix for any PAM distance by changing the match/mismatch ratio. The default ratio is +1/-3. You can change the ratio by specifying a new value for the numerator using -MATCH.

ALTERNATIVE GENETIC CODES

[ Previous | Top | Next ]

BLAST normally uses the standard genetic code if either the query or the database sequences requires translation. If your query comes from a system where this genetic code is inappropriate, you can select any of these alternative codes by the numbers given in the following table:

     1 Standard or Universal
     2 Vertebrate Mitochondrial
     3 Yeast Mitochondrial
     4 Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma
     5 Invertebrate Mitochondrial
     6 Ciliate Macronuclear
     7 [Do not use this index]
     8 [Do not use this index]
     9 Echinodermate Mitochondrial
    10 Alternative Ciliate(Euplotid) Macronuclear
    11 Eubacterial
    12 Alternative Yeast
    13 Ascidian Mitochondrial
    14 Flatworm Mitochondrial
    15 Alternate Ciliate (Blepharisma) Nuclear
    16 Chlorophycean Mitochondrial
    21 Trematode Mitochondrial

You can specify the genetic code for the query and the database independently. Use -TRANSlate=2 to tell BLAST to use the vertebrate mitochondrial code to translate the query. Use -DBTRANSlate=3 to tell BLAST to use the yeast mitochondrial code to translate the database. (Note that most of the genes in GenBank will be translated inappropriately if you select a nonstandard genetic code for database translation.)

NETWORK CONSIDERATIONS

[ Previous | Top | Next ]

BLAST searches only local databases. See the NetBlast entry in the Program Manual for information on how to run BLAST searches remotely.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use-CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

Minimal Syntax: % blast [-INfile1=]pir:mywhp  -Default

Prompted Parameters:

-BEGin=1 -END=153        sets the ranges of interest in  query sequences
[-INfile2=]pir           specifies database(s) to search
-EXPect=10.0             ignores scores that would occur by chance
                           more than 10 times
-LIStsize=500            sets maximum number of sequences listed in the output
[-OUTfile=]mywhp.blastp  names the output file

Local Data Files:

[-DATa2=blast.ldbs]      names the list of available local databases
[-DATa3=blast.sdbs]      names the list of available site-specific databases

Optional Parameters:

-PROCessors=1            sets the number of processors to use
-TBLASTX                 if query and database are both nucleotide,
                           translates both and does protein comparisons
-DBNucleotideonly        searches only nucleic databases
-DBProteinonly           searches only protein databases
-WORdsize=0              sets word size (0 selects program default)
-MATch=1                 sets nucleotide match reward
-MISmatch=-3             sets nucleotide mismatch penalty
-MATRix=blosum62         assigns the scoring matrix for proteins
-GAPweight=0             sets gap creation penalty
-LENgthweight=0          sets gap extension penalty
-HITEXTTHRESHold=0       sets minimum score to extend hits
-NOFILter                suppresses filtering of low complexity segments
                           out of nucleotide and protein query sequences
-TRANSlate=1             names genetic code for translating query
-DBTRANSlate=1           names genetic code for translating database
-EFFdbsize=0             sets effective database size (0 real size)
-NOFRAgments             suppresses showing list file entries as fragments
-ALIgnments=250          sets number of sequences for which to show
                           alignments
-VIEW=0                  selects alignment view type (0-8 allowed)
-NOGAPS                  suppresses gapped alignments
-XDRopoff=0              sets X dropoff value for gapped alignments (X2)
-MEGAblast               uses MegaBLAST algorithm for search
-REStorecheckpoint[=mywp.chk] reads checkpoint file and runs PSI-TBLASTN
-LOWercasemask           filters lower case characters in query sequence
-HITWindow=40            sets multiple hist window size (A)
-BESthits                sets number of best hits from a region to keep (K)
-HTML                    uses HTML for output format
-NATive                  produces unmodified BLAST2 output
-APPend="string"         appends "string" to pass-through command line
-BATch                   submits program to batch queue
-DBReport                lists valid databases then exits

CITING BLAST

[ Previous | Top | Next ]

The original paper describing BLAST is Altschul, Stephen F., Gish, Warren, Miller, Webb, Myers, Eugene W., and Lipman, David J. (1990). Basic local alignment search tool. J. Mol. Biol. 215; 403-410. Gapped BLAST is described in Altschul, Stephen F., Madden, Thomas L., Schaffer, Alejandro A., Zhang, Jinghui, Zhang, Zheng, Miller, Webb, and Lipman, David J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17); 3389-3402.

ACKNOWLEDGEMENT

[ Previous | Top | Next ]

BLAST was written by Warren Gish, formerly of the National Center for Biotechnology Information (NCBI), in collaboration with Stephen Altschul, Webb Miller, Eugene Myers, David Lipman, and David States. The document you are now reading was written by John Devereux, with some modifications for BLAST2 by Ted Slater.

Blastall (NCBI's implementation of BLAST 2.0) was written for NCBI by Tom Madden. GCG's client for blastall was written by Ted Slater for distribution with the Wisconsin Package Version 10.0; some portions were taken from GCG's original BLAST client written by Scott Rose. The output post-processor for release 10.0 was written by Ron Stewart.

We are extremely grateful to Stephen Altschul and Warren Gish for their careful and original work on BLAST and for their critical comments on GCG's BLAST documentation, and we are very grateful to NCBI for making these programs and services available to the molecular biology community.

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

BLAST reads two files, blast.ldbs (local databases), and blast.sdbs (site-specific databases). These together list the search sets in the menu. We update blast.ldbs when we send database updates to your institution. If you have sequences of local interest that you would like to search with BLAST, read the documentation for GCGToBLAST to see how to create local BLAST-searchable databases, then fetch the file blast.sdbs, and add the name of the local search set so that it appears in the menu.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

Following some of the optional parameters described below is a letter or short expression in parentheses. These are the names of the corresponding parameters at the bottom of your BLAST output.

-EXPect=10.0
-LIStsize=500
-PROCessors=2
-TBLASTX

The search set menu can scroll off your screen if it contains all of the searchable databases supported locally on your computer. The next two parameters can reduce the size of that menu.

-DBNucleotideonly
-DBProteinonly
-WORdsize=0
-MATCH=1
-MISmatch=-3
-MATrix=BLOSUM62
-GAPweight=11
-LENgthweight=1
-HITEXTTHRESHold=0
-NOFILter
-TRANSlate=1
-DBTRANSlate=1
-EFFdbsize=0
-NOFRAgments
-ALIgnments=250
-VIEW=0
-NOGAPS
-XDRopoff=0 [X2]
-MEGAblast[=mywp.chk]
-REStorecheckpoint[=mywp.chk]
-LOWercasemask
-HITWindow=40
-BESthits=0
-HTML
-NATIVE
-APPend="string"
-BATch
-DBReport

The release notes for BLAST 2.0 can be found at

http://www.ncbi.nlm.nih.gov/BLAST/newblast.html.

Printed: January 9, 2002 13:45 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]

Technical Support: support-us@accelrys.com
or support-eu@accelrys.com

Licenses and Trademarks Wisconsin Package is a trademark and GCG and the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.