APPENDIX VII

[ Program Manual | User's Guide | Data Files | Databases ] Table of Contents

Data Files

Overview

Viewing or Modifying Data Files

RESTRICTION ENZYMES

PROTEOLYTIC ENZYMES AND REAGENTS

TRANSCRIPTION FACTOR DATABASE (TFD)

CODON FREQUENCY TABLES

TRANSLATION TABLES

SCORING MATRICES

PROTEIN ANALYSIS DATA FILES

PROSITE

PROFILES

VERSION 2.0 PROFILES

Data Files

[Top | Next ]

This appendix contains descriptions of the following types of data files used by Wisconsin Package programs:

    Restriction Enzymes                     Scoring Matrices


    Proteolytic Enzymes and Reagents        Protein Analysis Data Files


    Transcription Factor Database (TFD)     Codon Frequency Tables


    Translation Tables                      PROSITE


    Profiles                                   Version 2.0 Profiles

Overview

[ Previous | Top | Next ]

Most Wisconsin Package programs analyze nucleic acid or protein sequences stored in files or in sequence databases. Additionally, many programs require nonsequence information, or data files, which they use to analyze the sequences. For example, the nucleic acid mapping programs require two data files: enzyme.dat, which contains restriction enzyme names and their corresponding recognition sites; and translate.txt, which associates codons with their corresponding amino acids.

All programs that require a data file have a default file they use, so as a new user, you don't need to worry about supplying one. These default files are public data files. Public data files are located in the public directory with the logical name GenRunData and may be accessed by everyone who uses the Package. When you run a program that requires a data file, it automatically finds the appropriate default file in this directory without you having to specify the directory and file name.

The Wisconsin Package also supplies alternative public data files you can have a program use instead of the default. These files are located in the directory with the logical name GenMoreData. There may be times when you want to use an alternative public data file rather than the default file. For example, if you're using the CodonPreference program to analyze a Drosophila sequence, you may want to use the alternative codon frequency table drosophila_high.cod, rather than the default table, ecohigh.cod, which is more appropriate for bacterial sequences.

In each of the following data file descriptions, we provide the names of the default data files used by programs as well as alternative public data files you can specify separately. You will find the following subtopics in each data file's description:

Default data file. You can find all default public data files in the directory with the logical name GenRunData.

Alternative data file. You can find alternative public data files in the directory with the logical name GenMoreData.

You also can create your own data file or personalize a public data file by copying it to your working directory and modifying it. These files are known as local data files. For instance, you could copy the restriction enzyme data file called enzyme.dat to your directory and delete all of the enzymes in it that are not available in your laboratory. Or, let's say you're working with the FindPatterns program and you create a data file of patterns specific to your research. This personal data file, then, would be available only to you.

Viewing or Modifying Data Files

[ Previous | Top | Next ]

To view a public data file online, use the TypeData program, for example% typedata enzyme.dat. To copy a public data file to your directory, use the Fetch program, for example, % fetch enzyme.dat. Then, open the file in the text editor of your choice to view or modify the file to your needs. For information on how to use an alternative data file with a program, see Chapter 4, Using Data Files in the User's Guide.

RESTRICTION ENZYMES

[ Previous | Top | Next ]

Function
Programs that use this file
Default data file
Alternative data files
Format
Suggestions
Restriction Enzyme Suppliers
Restriction Enzyme Literature
Acknowledgments

PROTEOLYTIC ENZYMES AND REAGENTS

[ Previous | Top | Next ]

Function
Programs that use this file
Default data files
Alternative data files
Format
Suggestions

TRANSCRIPTION FACTOR DATABASE (TFD)

[ Previous | Top | Next ]

Function
Programs that use this file
Default data file
Alternative data files
Format
Suggestions
Acknowledgments

CODON FREQUENCY TABLES

[ Previous | Top | Next ]

Function
Programs that use these tables
Default data file
Alternative data files: drosophila_high.cod; human_high.cod; maize_high.cod; yeast_high.cod; celegans_high.cod; celegans_low.cod
Format

AmAcid  Codon  Number     /1000     Fraction  ..

Gly     GGG    13.00       1.89      0.02

Suggestions

TRANSLATION TABLES

[ Previous | Top | Next ]

Function
Programs that use these tables
Default data file
Alternative data files
Format

Symbol 3-letter Codons !IUPAC .. A Ala GCG GCC GCA GCG !GCX

Output

SCORING MATRICES

[ Previous | Top | Next ]

(formerly Symbol Comparison Tables)

Function
Programs that use these files
Default data files
Alternative data files for nucleotides
Alternative data files for proteins
Format

   A  B  C  D  E  F  G  H ...
A  4 -2  0 -2 -1 -2  0 -2
B -2  6 -3  6  2 -3 -1 -1
C  0 -3  9 -3 -4 -2 -3 -3
D -2  6 -3  6  2 -3 -1 -1
E -1  2 -4  2  5 -3 -2  0
F -2 -3 -2 -3 -3  6 -3 -1
G  0 -1 -3 -1 -2 -3  6 -2
H -2 -1 -3 -1  0 -1 -2  8
 ...

The intersection of row D with column D has a value of 6, which represents an identical match for a D-D pairwise comparison. However, the pairwise comparison between non-identical symbols often is given a lower value, for example a C-D comparison is -3.

Notice that the values are identical at the C-D comparison and at the D-C comparison: -3. Previous versions of the Package supported only triangular forms of scoring matrices to eliminate this repetition. However, to make publicly available scoring matrices, which are in a rectangular format, easier to use, the Wisconsin Package now supports only rectangular-format scoring matrices. See "Converting Scoring Matrices" later in this section for converting pre-Version 9 scoring matrices to the new format.

Equals-form scoring matrices. The second form of GCG-format scoring matrix supported is "equals" form, so named because within the matrix, each pairwise comparison equals a value. For instance, in the example below, a A-A symbol comparison is assigned, or equals, a value of 4.

AA=      4      AB=     -2      AD=     -2      AE=     -1      AF=     -2
AH=     -2      AI=     -1      AK=     -1      AL=     -1      AM=     -1
AN=     -2      AP=     -1      AQ=     -1      AR=     -1      AS=      1
AW=     -3      AX=     -1      AY=     -2      AZ=     -1      BB=      6
BC=     -1      BD=      6      BE=      2      BF=     -3      BG=     -1
 ...

Auxiliary Data Block: Setting Gap Creation and Extension Penalties

 ..

 {
 GAP_CREATE 12
 GAP_EXTEND 4
 }

   A  B  C  D  E  F  G  H ...

If you create your own scoring matrix, or if you modify an existing one, you must maintain this format for specifying gap creation and extension penalties.
Suggestions

*** ERROR, READSCOREMAT cannot read the scoring matrix in the file
 "filename"!

If this is a scoring matrix created before Version 9,
try converting it with "% reformat /OLDCMPformat /PROtein" or
                       "% reformat /OLDCMPformat /NUCleotide"

Converting scoring matrices to make them more readable

Converting BLAST-format scoring matrices to GCG-format

PROTEIN ANALYSIS DATA FILES

[ Previous | Top | Next ]

Function
Programs that use these tables
Default data file
Alternative data files
Format

PROSITE

[ Previous | Top | Next ]

Function
Programs that use this file
Default data file
Alternative data files
Format
Suggestions
Acknowledgments

PROFILES

[ Previous | Top | Next ]

Function
Programs that use this file
Default data file
Alternative data file
Format
Suggestions
Acknowledgments

VERSION 2.0 PROFILES

[ Previous | Top | Next ]

Function
Programs that use these files
Default data file
Alternative data file
Format: nothing but a "}". These MUST appear in the first column of their respective lines.

{
  Length: 9
  Gap: 1.00              Len: 1.00
  GapRatio: 0.0          LenRatio: 0.0
Cons   A      C      D      E      F    . . .      W      Y   Gap  Len
}

The ADB may contain any number of "Comment" lines, indicated by a "!" in the first column

Profile The profile itself is made up of rows of log-odds values, with each row corresponding to a position in the profile and (with three exceptions) each column corresponding to a valid symbol for that position. The exceptions are the first column (which contains a letter identifying the consensus symbol for the row) and the last two columns, which give the multiplying factor for the gap creation and extension penalties for the row. (Note that MEME's output profiles are always ungapped, and thus will always have 100 (the maximum value) in the last two columns). The last row in a profile does NOT correspond to a position in the profile. Instead it contains counts for the number of appearances of each letter at any position in the sequences from which the profile was derived. This information is not used by any programs at this time, but it nonetheless must be there. Note that this dummy row is NOT included in the Length count given in the Auxiliary Data Block.

Printed: January 9, 2002 13:45 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]

Technical Support: support-us@accelrys.com
or support-eu@accelrys.com

Licenses and Trademarks Wisconsin Package is a trademark and GCG and the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.