This appendix contains descriptions of the following types of data files used by Wisconsin Package programs:
Restriction Enzymes Scoring Matrices Proteolytic Enzymes and Reagents Protein Analysis Data Files Transcription Factor Database (TFD) Codon Frequency Tables Translation Tables PROSITE Profiles Version 2.0 Profiles
Most Wisconsin Package programs analyze nucleic acid or protein sequences stored in files or in sequence databases. Additionally, many programs require nonsequence information, or data files, which they use to analyze the sequences. For example, the nucleic acid mapping programs require two data files: enzyme.dat, which contains restriction enzyme names and their corresponding recognition sites; and translate.txt, which associates codons with their corresponding amino acids.
All programs that require a data file have a default file they use, so as a new user, you don't need to worry about supplying one. These default files are public data files. Public data files are located in the public directory with the logical name GenRunData and may be accessed by everyone who uses the Package. When you run a program that requires a data file, it automatically finds the appropriate default file in this directory without you having to specify the directory and file name.
The Wisconsin Package also supplies alternative public data files you can have a program use instead of the default. These files are located in the directory with the logical name GenMoreData. There may be times when you want to use an alternative public data file rather than the default file. For example, if you're using the CodonPreference program to analyze a Drosophila sequence, you may want to use the alternative codon frequency table drosophila_high.cod, rather than the default table, ecohigh.cod, which is more appropriate for bacterial sequences.
In each of the following data file descriptions, we provide the names of the default data files used by programs as well as alternative public data files you can specify separately. You will find the following subtopics in each data file's description:
Default data file. You can find all default public data files in the directory with the logical name GenRunData.
Alternative data file. You can find alternative public data files in the directory with the logical name GenMoreData.
You also can create your own data file or personalize a public data file by copying it to your working directory and modifying it. These files are known as local data files. For instance, you could copy the restriction enzyme data file called enzyme.dat to your directory and delete all of the enzymes in it that are not available in your laboratory. Or, let's say you're working with the FindPatterns program and you create a data file of patterns specific to your research. This personal data file, then, would be available only to you.
To view a public data file online, use the TypeData program, for example % typedata enzyme.dat. To copy a public data file to your directory, use the Fetch program, for example, % fetch enzyme.dat. Then, open the file in the text editor of your choice to view or modify the file to your needs. For information on how to use an alternative data file with a program, see Chapter 4, Using Data Files in the User's Guide.
Nucleotide mapping programs read the list of available restriction
enzymes along with their recognition sites, cut positions, and overhangs
from an enzyme data file.
None.
Heading: An enzyme data file consists of an optional documentary
heading. A divider of two adjacent periods (..) separates the heading from
the enzymes.
Name: The first field on each line contains the name of the restriction enzyme; the name should have no more than 132 characters. Only one enzyme should appear per line.
Offset: The name is followed by an offset number, which tells the mapping programs where to cut the top strand when the recognition site is found.
Recognition site: The offset is followed by the enzyme recognition sequence. Nucleic acid recognition sequences, like all nucleotide sequences, are represented in 5' to 3' orientation. The recognition sequences should be shorter than 350 characters. They may contain any IUPAC-IUB alphabetic nucleotide character. See Appendix III of the Program Manual for a complete list of supported sequence symbols.
Nonsequence characters in the recognition site: Mapping programs read the offset and overhang fields to find out where each enzyme actually cuts, but the recognition sequences contain non-sequence characters to help humans see the cut points. An apostrophe (') indicates the cut point on the top strand; an underscore ( _ ) indicates the cut point on the bottom strand (when the enzyme does not leave a blunt end). These apostrophes and underscores are ignored by mapping programs and may therefore be absent.
Overhang: The fourth field in the list of enzymes tells the number of bases (positive or negative) from the cut point on the top strand to the point where the bottom strand is cut. A 0 (zero) would leave a blunt end; a 3 would give a 5' overhang of 3 bases; a -3 would leave a 3' overhang of 3 bases. If the recognition site is a palindrome, the overhang field is ignored. If the overhang field is absent or is a non-numeric character (? or . are most often used), the bottom strand is not searched.
Display of isoschizomers: The public file has a semicolon in front of all but one member of each family of isoschizomers. (Isoschizomers, in this context, are restriction endonucleases with the same recognition sequence.) Mapping programs normally ignore isoschizomers whose names are preceded by a semicolon. These isoschizomers are available if you select them individually by name or if you type ** in response to the enzyme prompt.
Isoschizomers, suppliers, and literature: Any information on the line to the right of an exclamation point (!) is documentary and is ignored by mapping programs. The documentary information on each record of the public file contains the names of other isoschizomers, if any are known, along with the commercial suppliers and literature references for the enzyme. (See "Restriction Enzyme Suppliers" and "Restriction Enzyme Literature" below.)
Format requirements: The exact column for each field on a line does not matter; only the order of the fields is important. Each field should be separated from all other fields on the same line by at least one blank space. Blank lines are tolerated. Most Wisconsin Package programs ignore information to the right of an exclamation mark (!) so you can add comments to the data file.
Asymmetric recognition sequences: If the forward and reverse recognition sites are not the same, then there are two records, one showing the forward and the other the reverse strand. These records must be adjacent to one another in the enzyme file. (See BcgI for an example.) You can give several recognition sites with the same name, but you must put all entries with the same name on adjacent lines of the enzyme data file.
You can put semicolons in front of the enzymes to which you do not
have access so that they are not displayed when you create restriction
enzyme maps.
Because Wisconsin Package mapping programs using the default data file display only one member of each family of isoschizomers, these programs find all possible recognition sites but not all possible cut points. If you find an enzyme displayed near a point of interest, you might want to examine the enzyme file to see if another cut point is available.
Many of the restriction enzymes displayed by the Wisconsin Package
mapping programs are available commercially. The file enz_sources.txt shows
the main suppliers of restriction enzymes together with the enzymes they
make available. This file is for your information only; it is not read
by any Wisconsin Package program.
You can use Fetch to copy this file to your working directory and then search it with a text editor.
Most of the restriction enzymes displayed by the Wisconsin Package
mapping programs are described in the scientific literature. The citations
for each enzyme are the numbers that appear last on each record of the
enzyme data file enzyme.dat. You can find
these citations in the file enz_refs.txt.
This file is for your information only; it is not read by any Wisconsin
Package program.
You can use Fetch to copy this file to your working directory and then search it with a text editor.
Dr. Richard Roberts at New England Biolabs developed and maintains
REBASE, the restriction enzyme database from which the enzyme data in the
Wisconsin Package are drawn.
Peptide mapping programs read enzyme and reagent names, recognition
patterns, and cut positions from an enzyme data file.
PeptideMap, MapSort,
MapPlot, and PeptideSort.
Program Data file PeptideMap, MapSort, and MapPlot proenzyme.dat PeptideSort proenzall.datNote: Proenzall.dat, is a more complete list of proteolytic agents, containing several agents that cut at the same place.
None.
Heading: An enzyme data file consists of an optional documentary
heading. A divider of two adjacent periods (..) separates the heading from
the enzymes.
Name: The first field on each line contains the name of the enzyme. Only one enzyme should appear per line.
Offset: The name is followed by an offset number, which tells the mapping programs where to cut the peptide when the recognition pattern is found.
Cleavage site: The offset is followed by the enzyme recognition sequence. Recognition sequences, like all peptide sequences, are represented in amino -> carboxyl orientation. They may contain any standard amino acid character, but no ambiguity characters (B and Z). See Appendix III of the Program Manual for a complete list of supported sequence symbols.
Nonsequence characters in the recognition site: Mapping programs read the offset field to find out where each enzyme actually cleaves, but the recognition sequences contain non-sequence characters to help humans see the cleavage points. An apostrophe (') indicates the cut point. These apostrophes are ignored by mapping programs and may therefore be absent.
Overhang: The fourth field is the overhang which is used in nucleotide restriction enzyme data files. It has no function for proteolytic reagents.
Display of isoschizomers: Mapping programs normally ignore enzymes whose names are preceded by a semicolon (;). These enzymes are available if you select them individually by name or if you type ** in response to the enzyme prompt.
Documentation: Any information on the line to the right of an exclamation point (!) is documentary and is ignored by mapping programs.
Format requirements: The exact column for each field on a line does not matter; only the order of the fields is important. Each field should be separated from all other fields on the same line by at least one blank space. Blank lines are tolerated. Most Wisconsin Package programs ignore information to the right of an exclamation mark (!) so you can use these marks to create comments within the data.
Multiple specificities: You may include more than one occurrence of an enzyme name if the enzyme has more than one specificity. All records with the same name must appear on adjacent lines of the enzyme data file. If you want to distinguish specificities (for instance trypsin cutting when arginines are blocked), you can create a unique name the distinguishes trypsin cutting at lysine from trypsin cutting at arginine.
You can put semicolons in front of all the enzymes and reagents
that you do not have access to or that you do not want to use. Wisconsin
Package programs will ignore those enzymes and reagents.
The Wisconsin Package programs PeptideMap, MapSort, and MapPlot search for every point of specific cleavage but not every cleavage pattern. PeptideSort tries to identify each known single-digest cleavage pattern. Send us suggestions for other specificities and cleavage patterns that you think these files should include.
This data file provides a list of the recognition sequences for
eukaryotic sequence-specific transcription factors from the Transcription
Factor Database (TFD).
FindPatterns. (Map, MapSort,
and MapPlot can also read this file.)
None.
Heading: tfsites.dat consists
of an optional documentary heading. A divider of two adjacent periods (..)
separates the heading from the transcription sites.
Name: The first field on each line contains the name of the site; the name should have no more than 132 characters. Only one site should appear per line.
Offset: The name is followed by an offset number, which tells programs where to mark the top strand when the recognition site is found.
Recognition site: The offset is followed by the recognition sequence. Nucleic acid recognition sequences are represented in 5' to 3' orientation. The recognition sequences should be shorter than 350 characters. They may contain any IUPAC-IUB alphabetic nucleotide character. See Appendix III of the Program Manual for a complete list of supported sequence symbols.
Overhang: The fourth field should be set to zero to signal that both strands should be searched.
Display of isoschizomers: The public file has a semicolon (;) in front of frequently found sites. Mapping programs normally do not display sites whose names are preceded by a semicolon. If you want to use any of these sites, use the Fetch program to copy tfsites.dat to your working directory and use a text editor to remove the semicolons you want.
Literature: Any information on the line to the right of an exclamation point (!) is documentary and is ignored by mapping programs. The documentary information on each record of the public file contains a common name as well as a literature reference for the site.
Format requirements: The exact column for each field on a line does not matter; only the order of the fields is important. Each field should be separated from all other fields on the same line by at least one blank space. Blank lines are tolerated. Most Wisconsin Package programs ignore information to the right of an exclamation mark (!) so you can add comments to the data file.
You can use Fetch to copy tfsites.dat
to your working directory and then rename it pattern.dat. FindPatterns
will then read it automatically and use it as the default data file.
Also note that you should always search both strands (FindPatterns does this by default) as most transcription factor sites are strand specific.
Dr. David Ghosh developed and maintains TFD.
Codon frequency tables reflect the known codon preferences of an
organism.
BackTranslate, CodonPreference,
and Frames.
Heading: A codon frequency table consists of an optional
documentary heading. A divider of two adjacent periods (..), separates
the heading from the table. For example
AmAcid Codon Number /1000 Fraction .. Gly GGG 13.00 1.89 0.02
Codon: The second field contains an unambiguous codon for that amino acid.
Number: The third field lists the number of occurrences of that codon in the genes from which the table is compiled.
/1000: The fourth field lists the expected number of occurrences of that codon per 1,000 codons in genes whose codon usage is identical to that compiled in the codon frequency table.
Fraction: The last field contains the fraction of occurrences of the codon in its synonymous codon family.
Each field of information is separated from every other field by at least one blank space.
You can use the CodonFrequency
program to create a codon frequency table from a set of input nucleotide
sequences and/or existing codon frequency tables. You also can create or
modify a codon frequency table with a text editor. If you choose to use
a text editor, you need provide only the first three fields of information
on each line of the table. The lines can be in any order; only codons whose
use is greater than zero need be present. You should then generate the
complete codon usage table -- five fields of information, one line for
each codon, and all lines ordered by amino acid -- by using the table you
created as the input to the CodonFrequency
program.
Translation tables are used by Wisconsin Package programs for three
purposes:
BackTranslate, CodonFrequency,
CodonPreference, Diverge,
Frames, Map, MapPlot,
MapSort, PepData, Publish,
Reformat, and Translate.
Data file Function transmitodros.txt drosophila mitochondrial translation table transl_table_02.txt vertebrate mitochondrial translation table transl_table_03.txt yeast mitochondrial translation table transl_table_04.txt mold, protozoan, and coelenterate mitochondrial
and mycoplasma/spiroplasma translation table transl_table_05.txt invertebrate mitochondrial translation table transl_table_06.txt ciliate, dasycladacean, and hexamita translation table transl_table_09.txt echinoderm mitochondrial translation table transl_table_10.txt euplotid translation table transl_table_11.txt bacterial translation table transl_table_12.txt alternative yeast translation table transl_table_13.txt ascidian mitochondrial translation table transl_table_14.txt flatworm mitochondrial translation table transl_table_15.txt blepharisma mitochondrial translation table transl_table_16.txt chlorophycean mitochondrial translation table transl_table_21.txt trematode mitochondrial translation table transl_table_22.txt scenedesmus obliquus mitochondrial translation table transl_table_23.txt thraustochytrium aureum mitochondrial translation tableTo specify an alternative translation data file, add the parameter -TRANSlate=filename.txt on the command line.
Heading: A translation table consists of an optional documentary
heading. A divider of two adjacent periods (..), separates the heading
from the table. For example
3-letter: The second field is the three-letter amino acid code for that sequence symbol.
Codons: The third field must contain a list of all unambiguous codons for the amino acid; this list must come before the exclamation point (!).
!IUPAC: In the fourth field, the exclamation point delimits where the unambiguous codons end and where the ambiguous codons start. The ambiguous codons are provided for documentary purposes only and are completely ignored by Wisconsin Package programs. Each field is separated from every other field by at least one blank space. Any of the 31 GCG sequence symbols (see Appendix III of the Program Manual) may be associated with a three-letter code and one or more unambiguous codons. Each codon and each sequence symbol may be used only once.
Potential start codons are written only in lowercase letters. Stop
codons are translated as the asterisk (*) symbol.
(formerly Symbol Comparison Tables)
Many sequence comparison programs make comparisons between pairs
of sequence symbols by looking up a value in a scoring matrix. The matrix
assigns an integer value for the match quality of every possible pair of
symbols. If you are comparing nucleotides, the matrix might contain 1's
for matching symbols and 0's (zeros) for mismatching symbols. However,
if you are comparing amino acids, a number could be assigned that is based
on chemical similarity or evolutionary distance. The number might be negative
if two residues were very dissimilar.
BestFit, Compare,
FastA, FrameAlign,
FrameSearch, Gap,
GapShow, GelMerge,
PileUp, PlotSimilarity, Pretty,
Prime, ProfileMake,
Repeat, Segments, StemLoop,
TFastA, and the Consensus
operation (in the Edit menu) in the Editor mode of SeqLab.
For nucleotides:
Program Default data file BestFit swgapdna.cmp Compare compardna.cmp FastA fastadna.cmp Gap nwsgapdna.cmp GapShow swgapdna.cmp or nwsgapdna.cmp GelMerge gelmergedna.cmp and gelmergelocaldna.cmp PileUp pileupdna.cmp PlotSimilarity plotsimdna.cmp Pretty prettydna.cmp Prime prime.cmp ProfileMake profiledna.cmp Repeat repeatdna.cmp Segments segdna.cmp StemLoop stemloop.cmp
All analysis programs, except FastA and
TFastA, use blosum62.cmp
as the default data file. FastA and TFastA
use blosum50.cmp. The Consensus
operation (in the Edit menu) in the Editor mode of SeqLab
uses identpep.cmp.
To specify an alternative scoring matrix file, add the parameter
-MATRix=filename.txt on the command line.
By default, Segments creates local
alignments, analogous to those created by BestFit.
You can direct Segments to create global
alignments, analogous to those created by Gap, by
using the command-line parameter -WHOle. Segments
then uses the scoring matrix seggapdna.cmp,
containing no negative values for mismatches.
ProfileGap and ProfileSegments can be directed to create global alignments by using the command-line parameter -GLObal. If you want to create global alignments using these programs, you might want to create the profile in ProfileMake using the alternative scoring matrix profilegapdna.cmp.
This matrix is most appropriate for programs creating local
alignments (BestFit, Segments, ProfileGap,
and ProfileSegments). Since all mismatches
between IUPAC-IUB nucleotide symbols are given a value of -3 and all matches
are given a value of +10, local alignments created using this matrix
will be extended further than those created with any of the default scoring
matrices for these programs.
To specify an alternative scoring matrix file, add the parameter
-MATRix=filename.txt on the command line.
The Wisconsin Package provides a set of BLOSUM matrices for the
comparison of peptide sequences, derived from substitutions observed in
more than 2,000 blocks of aligned sequences (Henikoff, S. and Henikoff,
J. G. (1992). Amino acid substitution matrices from protein blocks (Proceedings
of the National Academy of Sciences USA 89; 10915-10919) are provided
as alternative peptide scoring matrices in the files blosum30.cmp,
blosum35.cmp, blosum40.cmp,
blosum45.cmp, blosum55.cmp,
blosum60.cmp, blosum65.cmp,
blosum70.cmp, blosum75.cmp,
blosum80.cmp, blosum85.cmp,
blosum90.cmp, and blosum100.cmp.
To complete this set, blosum50.cmp and
blosum62.cmp are also provided as the
default scoring matrices for some analysis programs in the Wisconsin Package.
These matrices are the log odds form of the mutation data matrix
for 120 PAMs and 250 PAMs (Dayhoff, M. O., Schwartz, R. M., and Orcutt,
B. C. [1979] in Atlas of Protein Sequence and Structure, and Dayhoff, M.
O. Ed, pp. 345-352 (Figure 84), National Biomedical Research Foundation,
Washington D.C., respectively).
This matrix, described by Risler, et al. (Journal of Molecular Biology
204; 1019-1029), is derived from an analysis of amino acid substitutions
after superposition of homologous protein structures. To construct this
matrix the authors converted only substitutions whose alpha carbon atoms
are very close to one another after superposition of the structures. Based
on results from test alignments using Gap and BestFit,
the authors suggest that this scoring matrix may prove superior to others
in finding weak similarities in distantly related proteins.
An alternative peptide scoring matrix in the file oldpep.cmp
can be provided to Wisconsin Package programs as a local data file. This
matrix was derived from the default peptide scoring matrix in Version 8
of the Wisconsin Package. Each value in the Version 8 matrix of floating
point values was multiplied by 10 and rounded to the nearest integer to
determine the comparison values in oldpep.cmp.
Perfect matches in oldpep.cmp have a
comparison value of 15, and no matches in the matrix have a higher value
than perfect matches.
A scoring matrix file consists of a documentary heading, a dividing
line with two adjacent periods (..), an optional auxiliary data block that
specifies the default gap creation and extension penalties associated with
the scoring matrix, and the matrix itself. GCG nucleotide and amino acid
symbols are described in Appendix III of
the Program Manual.
Wisconsin Package programs can use two different types of scoring matrices: BLAST format and GCG format.
BLAST-format scoring matrices
Rectangular scoring matrices. The rectangular form organizes
the sequence symbols along an x axis (columns) and y axis (rows), where
each symbol along the x axis is compared with each symbol along the y axis.
There is a row and column for every sequence symbol that has at least one
non-zero comparison value. The value of each pair of symbols compared is
placed at the intersection of the appropriate row and column. All relationships
that are not explicitly defined in the matrix are assigned a value of 0.
Every comparison value is separated from every other value by at least
one blank space. Blank lines are tolerated.
Consider the example below:
A B C D E F G H ... A 4 -2 0 -2 -1 -2 0 -2 B -2 6 -3 6 2 -3 -1 -1 C 0 -3 9 -3 -4 -2 -3 -3 D -2 6 -3 6 2 -3 -1 -1 E -1 2 -4 2 5 -3 -2 0 F -2 -3 -2 -3 -3 6 -3 -1 G 0 -1 -3 -1 -2 -3 6 -2 H -2 -1 -3 -1 0 -1 -2 8 ...
Notice that the values are identical at the C-D comparison and at
the D-C comparison: -3. Previous versions of the Package supported only
triangular forms of scoring matrices to eliminate this repetition. However,
to make publicly available scoring matrices, which are in a rectangular
format, easier to use, the Wisconsin Package now supports only rectangular-format
scoring matrices. See "Converting Scoring Matrices" later in this section
for converting pre-Version 9 scoring matrices to the new format.
Equals-form scoring matrices. The second form of GCG-format scoring
matrix supported is "equals" form, so named because within the matrix,
each pairwise comparison equals a value. For instance, in the example below,
a A-A symbol comparison is assigned, or equals, a value of 4.
AA= 4 AB= -2 AD= -2 AE= -1 AF= -2 AH= -2 AI= -1 AK= -1 AL= -1 AM= -1 AN= -2 AP= -1 AQ= -1 AR= -1 AS= 1 AW= -3 AX= -1 AY= -2 AZ= -1 BB= 6 BC= -1 BD= 6 BE= 2 BF= -3 BG= -1 ...
You can specify gap creation and gap extension penalties within
a scoring matrix to ensure that programs reading the scoring matrix use
those values as defaults. If you do not specify these penalties, the program
calculates reasonable defaults based on the values in the matrix.
Gap creation and gap extension penalties must
follow a specific format within a scoring matrix. These penalties must
appear in an auxiliary data block, which appears after the dividing line
with the two adjacent periods (..) and before the line of sequence symbols
in the scoring matrix, as shown below:
.. { GAP_CREATE 12 GAP_EXTEND 4 } A B C D E F G H ...
Note that even though gap creation and extension penalties may be
set within a scoring matrix, you can override them on the command line.
To do so, use the parameters -GAPweight and -LENgthweight on the command
line when you run a program that uses scoring matrices.
Use the CompTable program to create
scoring matrices. You also can use a text editor to create a scoring matrix;
if you do so, use the Reformat program with
the command-line parameter -COMparison to rewrite the file into GCG format.
Both CompTable and Reformat
round the values in the matrix to the nearest integer.
Several programs may use the same default scoring matrix. However,
although the matrices may be identical, the default matrix for each program
is contained in a separate file. This allows you to modify a local version
of the matrix for one program without affecting the matrix used by another
program.
If you make modifications to a matrix, use the Reformat program with the command-line parameter -COMparison to rewrite your scoring matrix data file into GCG format.
In Version 9 all scoring matrices provided with the Package in GenRunData
and GenMoreData are already converted to the new format. However, you must
convert all of the scoring matrices in your personal directories, including
your personal directory with the logical name MyData, to the new rectangular
format. When you do so, you will need to specify the scoring matrix as
either nucleotide or protein. Wisconsin Package programs will not accept
pre-Version 9 scoring matrices, and they will display the following error
message if you try to use one:
*** ERROR, READSCOREMAT cannot read the scoring matrix in the file "filename"! If this is a scoring matrix created before Version 9, try converting it with "% reformat /OLDCMPformat /PROtein" or "% reformat /OLDCMPformat /NUCleotide"
% Reformat -OLDCMPformat -NUCleotide scoring_matrix
or
% Reformat -OLDCMPformat -PROtein scoring_matrix
Wisconsin Package programs can accept two forms of GCG-format scoring
matrix files: rectangular and "equals." There is no difference in analysis
or performance between the forms. However, some people find "equals" format
easier to read, and the Package provides a way to convert between the two
forms.
To convert rectangular scoring matrices to the more readable "equals" format, type
% Reformat -COMParison -EQUALSformat scoring_matrix
To convert "equals" format scoring matrices to rectangular format, type
% Reformat -COMParison scoring_matrix
The Wisconsin Package also works with native BLAST-formatted scoring
matrices. Although converting BLAST-formatted scoring matrices to GCG-format
is unnecessary, you may find it useful to do so. GCG-formatted scoring
matrices allow you to specify gap creation and extension penalties within
the scoring matrix file.
To convert BLAST-formatted scoring matrices to GCG-format, type
% Reformat -COMParison -NUCleotide scoring_matrix
or
% Reformat -COMParison -PROtein scoring_matrix
These data files enable programs to locate motifs in protein sequences
and to make predictions about peptide isolation, secondary structure, hydrophobicity,
and antigenicity.
PeptideSort, Isoelectric,
PepPlot, HelicalWheel,
CoilScan, SPScan,
and HTHScan.
Program Default data file Function PeptideSort aminoacid.dat amino acid residue properties extinctcoef.dat extinction coefficients for amino acids isoelectric.dat residue-specific pK values for the prediction of a peptide's isoelectric point Isoelectric isoelectric.dat residue-specific pK values for the prediction of a peptide's isoelectric point PepPlot pepplot.dat residue-specific values for the prediction of protein secondary structure, hydrophobicity, and helical hydrophobic moment ges.dat residue-specific values for identifying nonpolar transbilayer helices garnier.dat residue-specific values for secondary structure prediction using the method of Garnier HelicalWheel helicalwheel.dat residue-specific attributes for the display of a peptide sequence as a helical wheel CoilScan mtidkcoils.dat weight matrix of amino acid coiled-coil propensities SPScan speuk.dat weight matrix for eukaryotic signal peptides spgpos.dat weight matrix for Gram-positive bacterial signal peptides spgneg.dat weight matrix for Gram-negative bacterial signal peptides HTHScan htharac.dat weight matrix for AraC family H-T-Hs hthlysr.dat weight matrix for LysR family H-T-Hs hthhomeobox.dat weight matrix for Homeobox family H-T-Hs
CoilScan mtkcoils.dat
All data files consist of an optional documentary heading, a dividing
line with two adjacent periods (..), and the data. The exact column for
each field on a line does not matter; only the order of the fields is important.
Each field should be separated from all other fields on the same line by
at least one blank space.
You can search protein sequences for motifs that are represented
in the PROSITE Dictionary of Protein Sites and Patterns.
None.
The format of Wisconsin Package pattern files is described in the
documentation for programs that use these files.
The exact column for each field on a line does not matter; only the order of the fields is important. Each field should be separated from the other fields on the same line by at least one blank space. Blank lines are tolerated. Most Wisconsin Package programs ignore information to the right of an exclamation mark (!), so you can use these marks to create comments within the data. You cannot edit prosite.patterns unless your text editor can handle very large records.
Heading: This data file has an optional documentary heading, followed by a dividing line with two adjacent periods (..).
Name: The first field on each line contains the name of the restriction enzyme; the name should have no more than 132 characters. Motifs prefixed by a semicolon ( ; ) are short patterns which are expected to occur in most protein sequences by chance alone. Such frequently found patterns are not displayed by the Motifs program unless you run Motifs with the command-line parameter -FREquent. Only one motif should appear per line.
Offset: The name is followed by an offset number, which tells Motifs where to mark the sequence when the motif expression is found.
Pattern: Patterns should be shorter than 350 characters. They may contain any alphabetic amino acid character. See Appendix III of the Program Manual for a complete list of supported sequence symbols.
For a complete description of the syntax in which motifs are represented, see the topic DEFINING PATTERNS in Motifs in the Program Manual.
Note that some motifs require multiple patterns to identify them. If this is so, these patterns will have the same name and must appear on adjacent lines.
PDoc: The fourth field tells the name of the PROSITE abstract for the pattern. You can copy this file to your directory with the Fetch command, or you can display it with the TypeData command.
prosite.seqcat contains a short
description of each motif in prosite.patterns.
Use the Fetch command to copy the prosite.seqcat
file to your directory or use the TypeData command to view the file online.
The use of Motifs is so straightforward that there are few occasions when you will need to modify this file.
Dr. Amos Bairoch of the University of Geneva publishes and maintains
the PROSITE Dictionary of Protein Sites and Patterns . PROSITE is
distributed by the European Bioinformatics Institute in Cambridge, England.
This database contains validated profiles derived from the motifs
in the PROSITE Dictionary of Protein Sites and Patterns.
Heading: The optional heading documents the contents of each
column. A divider of two adjacent periods (..) separates the heading from
the profiles.
Name: The first column contains the location and name of each profile (see SUGGESTIONS below). These names correspond to the names of the patterns in the prosite.patterns file. The profile name must contain fewer than 255 characters.
High and Intrst: By default, ProfileScan reports only alignments with normalized scores greater than the HIGH value. If you add the -INTEResting parameter to the command line, ProfileScan will report alignments that score higher than the INTRST value.
Gap and Len: These values specify, respectively, the gap creation and extension penalties used to align the motif profile to the query sequence.
A, B, C, AVE, and SD: These values specify the parameters for length-dependent normalization of the alignment scores. See ProfileSearch in the Program Manual for a description of the derivation of these values and their use in normalizing the alignment scores.
Individual profile files are maintained in the directory with the
logical name ProfileDir. To view a profile's documentation, use the Fetch
command to copy a profile file to your directory, for example -%
fetch
apple.prf, or use the TypeData command to view the file online.
Dr. Michael Gribskov of the San Diego Supercomputing Center prepared
and validated these profiles. Dr. Amos Bairoch of the University of Geneva
publishes and maintains the PROSITE Dictionary of Protein Sites and
Patterns .
This database contains validated profiles derived from the motifs
in the PROSITE Dictionary of Protein Sites and Patterns. Profiles are a
special kind of scoring matrix used by several different programs. The
addition of MEME and MotifSearch
to the Wisconsin Package required the introduction of a new format of profile
that allows multiple profiles to be kept in one file.
MEME generates version 2.0 profiles, while
MotifSearch is intended to process them.
ProfileSearch, ProfileGap and ProfileSegments
can all read ONLY THE FIRST profile from a version 2.0 file.
Not Applicable
Not Applicable
Heading: The file should begin with a line containing either
"!!AAPROFILE 2.0" or "!!NAPROFILE 2.0". Thereafter, you may include any
information you like, concluding the heading section with a divider of
two adjacent periods (..)
Auxiliary Data Block: The ADB begins with a line having nothing but a "{", and ends with a line having
The ADB must contain four parsable data lines. The first gives the
Length of the profile (sometimes thought of as the width !), in the form
"Length: <value>". The next two lines control the gap creation and extension
penalties for the profile, and the fourth gives the labels of the columns
used in the profiles. The column labels should be separated by blank spaces.
The first label should alwasy be "Cons" (for Consensus),
and this should appear at the beginning of the line -- no indentation please.
Here is an example of a simple ADB, with some of the column labels replaced by an ellipsis:
{ Length: 9 Gap: 1.00 Len: 1.00 GapRatio: 0.0 LenRatio: 0.0 Cons A C D E F . . . W Y Gap Len }
Profile The profile itself is made up of rows of log-odds
values, with each row corresponding to a position in the profile and (with
three exceptions) each column corresponding to a valid symbol for that
position. The exceptions are the first column (which contains a letter
identifying the consensus symbol for the row) and the last two columns,
which give the multiplying factor for the gap creation and extension penalties
for the row. (Note that MEME's output profiles are always ungapped, and
thus will always have 100 (the maximum value) in the last two columns).
The last row in a profile does NOT correspond to a position in the profile.
Instead it contains counts for the number of appearances of each letter
at any position in the sequences from which the profile was derived. This
information is not used by any programs at this time, but it nonetheless
must be there. Note that this dummy row is NOT included in the Length count
given in the Auxiliary Data Block.
[ Program Manual | User's Guide | Data Files | Databases ]
Technical Support: support-us@accelrys.com
or support-eu@accelrys.com
Copyright (c) 1982-2002 Accelrys Inc. A subsidiary of Pharmacopeia, Inc. All rights reserved.
Licenses and Trademarks Wisconsin Package is a trademark and GCG and the GCG logo are registered trademarks of Accelrys Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.