ASSIGNMENT #6
Text Processing / Regular Expressions
Reference material
General guidelines
The data files that you need for this assignment are located inside a directory named
/home/students/sources/ass6_sources.
(From your home directory, this can also be accessed as ../sources/ass6_sources).
These files may also be obtained from
here.
We recommend that you use a separate directory for result files (which are
not HTML files), so that they will not mix with your programs.
NOTE: Assume that case-insensitivity is required wherever applicable.
Obligatory
- The file
ass6_sources/kinases_map
contains mapping
information of human protein kinase genes, taken from the
OMIM Gene Map database.
Each line contains information on one gene, where fields are separated by a vertical bar (|).
Fields description:
- Gene symbol(s).
- Gene name.
- Date of entry to the database.
- Cytogenetic location (chromosome number followed by cytogenetic
band(s)).
- Accession number in OMIM.
-
Write a program that reads this file, and prints (on another file) the gene symbol, gene name
and cytogenetic location, in a tab-delimited format.
- Modify your program to print only tyrosine kinase genes (i.e. only
where the gene name contains the word tyrosine).
Note: /[Tt][Yy][Rr][Oo][Ss][Ii][Nn][Ee]/ is NOT acceptable as an answer.
- Modify your program to print only genes located in chromosome 9.
- Modify your program so it asks the user for a year, and then prints only genes
entered to the database AFTER that year.
In this case, print the year as the first field of each line.
Assume that dates up to 2025 are possible.
- Assume you want to clone the genomic region coding for the P53 gene, including all relevant introns. The file ass6_sources/p53_seq
contains the full genomic sequence of the P53 gene, in FASTA format (read about this format
here). The part of the sequence that is translated
to protein (including several introns) is between nucleotides 11717 - 18680.
- Start by defining the locations of the beginning and end of the coding region in
variables.
Thereafter, use those variables for extracting the coding sequence.
- Read the sequence from the file and store it in a scalar variable.
- Extract the part of the sequence that is translated to protein (including the introns) and store it in another variable.
- Validate that the coding sequence starts with an ATG codon and ends with a stop codon (either TAA, TAG or TGA) using *one* regular expression.
- Check whether the coding sequence contains a restriction site for
BamHI (cuts at GGATCC).
- Check whether the coding sequence contains a restriction site for
BstSFI (cuts at CGryCG, where r is either G or A, and y is either C or T).
- Check whether the coding sequence contains a restriction site for
DrdI (cuts at a sequence containing GAC, then 6 nucleotides of any type, then
GTC). If it does, calculate what is the actual sequence in P53 that is recognized by this enzyme and at what position (from the start of the coding sequence) it is located.
- Extract the gene gi from the first line and print it
before every output in a descriptive manner. E.g., "The gene gi35213 has a
Drd I restriction site".
Tip: have a look at the full
GenBank entry for this gene, where the positions on the sequence are indicated.
A hint for determining where the DrdI site is located: when a regular expression matches successfully, Perl gives you three special variables: $&, $` and $', which include (respectively) the part of the string that was matched, whatever was before it, and whatever was after it. Therefore "$`$&$'" should equal the original string. The matching position therefore is length($`)+1. (Other methods exist, e.g. using split, but NOTE: methods involving loops are unacceptable.)
- Write a short program that can receive a cDNA sequence
from either STDIN or a file, validates that it contains
only valid nucleotides, and prints it as a mRNA sequence (replace all T with U).
Example for valid sequences: "TTTTAATTAAACGTAAAAAGGCAGG"
,"tcaccttcgacgacgcttagagcagatagacgat", and "tTacG".
Example for a wrong sequence:
"GTXCTTXXAAGGCNNNTACTTTYTCCRAGCC".
- Write a program that reads in a sequence in GenBank format
(e.g. as in ass6_sources/genbank_seq),
removes all spaces and line numbers, and prints it on another file.
Optional
Write a general program that reads information from file containing
a Swiss-Prot entry (e.g.
ass6_sources/ACM1_HUMAN ,
ass6_sources/SSBP_HUMAN), extracts
the information that seems most relevant to you and prints it as
a nicely formatted text (or HTML) file.
Consult the
Key to Swiss-Prot field names, or the entire
Swiss-Prot user manual.
Example fields of interest:
- Protein symbol and no. of amino acids, from the ID line.
- Protein definition. Notice that the DE field may contain more than one
line.
- Gene name.
- References.
- Protein functions, taken from the CC field.
- Positions of transmembrane domains or disulfide bonds, taken from the
FF field.
- Protein molecular weight, from the SQ field.
- Protein sequence.
Table of Contents.
Course Home Page.