ASSIGNMENT #4
File Input / Output and Basic Text Processing
Relevant Slides
General guidelines
The data files you need for this assignment are located inside a directory named
/home/students/sources/ass4_sources.
(from your home directory, this can also be accessed as ../sources/ass4_sources).
These files may also be obtained from here.
If you copy them to your own directory, make sure that their lines are not wrapped.
Exercise
The ass4_sources subdirectory contains four text files
from the UniGene database. Each file describes one gene.
Write a subroutine that receives a gene name, opens the file for that
gene, extracts the list of tissues in which this gene is expressed
and prints it to a file having the gene name with some special extension
(e.g. TGM1.express).
The output file should have the following format (example is for the
TGM1 gene):
TGM1
1. Esophagus
2. Germ Cell
3. Larynx
4. Pancreas
5. Uterus
6. colon
7. head_neck
8. uterus
Then create a main program that calls this subroutine with the
following gene names: 'ADH2', 'CEACAM4', 'TGM1', 'GLDC'.
Notes:
- The tissues list in the UniGene files appears after the keyword 'EXPRESS'.
- Before submitting the assignment, check all resulting files against the source files.
Make sure that tissues are counted from 1, and that there are no empty
values in the tissues list.
- Sometimes the same tissue may appear twice in the list, once capitalized and
another time in all-lowercase (e.g. Uterus and uterus in TGM1).
This is due to a "bug" in UniGene. You may disregard it.
- Use the substring and split functions. Don't use regular expressions.
- Don't forget to use 'strict', 'warnings', proper indentation and comments!
Table of Contents.
Course Home Page.