Sequence Bias in PDB Proteins:
Comparison of Dipeptidyl Fragment Counts vs. the Residue Composition
Felder, C.1, Einav, U.1, Segal, D.1,
Sussman, J.1, Silman, I.2, Beckmann, J.3 and
Yakir, B.4
1 Dept. of Structural Biology, Weizmann Institute of Science
2 Dept. of Neurobiology, Weizmann Institute of Science
3 Dept. of Cellular Genetics, Weizmann Institute of Science
4 Dept. of Biological Statistics, Hebrew University of Jerusalem
Abstract:
Examination of sequence frequencies of dipeptidyl units of PDB proteins reveals
a bias toward certain sequences relative to what would be expected from a random
assembly from the residue composition. A database of the frequency counts of all
possible 400 dipeptidyl fragment sequences in PDB proteins was constructed,
using the PDB_select list of Hobohm and Sander at 90% homology to eliminate
redundant entries. A parallel database of sequence composition was also made.
From these data we calculated the observed probability of each dipeptidyl
sequence, eg. the raw count divided by the total number of dipeptide fragments
in all proteins, against what would be expected from a random combination of
residues based on the residue composition. We noted a clear bias in favor of
certain dipeptides, such as CH, MM, HP, YW, YC and QQ; and against other
dipeptides, such as LW, MW, EC, EP, ES, CV and GP. The ratio of observed over
expected probability ranges from 0.7 to 1.5, with an average near 1.0 and std.
dev. 1.12. The results suggest that certain combinations of residues may be
preferred to facilitate proper folding and function of the proteins.