Biases and Complex Patterns in the Residues Flanking Protein N-glycosylation Sites

Rubin, E.1, Ben-Dor, S.1 and Sharon, N.2
1 Bioinformatics and Biological Computing Unit, Biological Services, Weizmann Institute of Science
2 Department of Biological Chemistry, Weizmann Institute of Science

Protein glycosylation, in particular of asparagine residues (N-glycosylation) is the most common and most complex reaction that occurs during protein biosynthesis and often affects markedly their physicochemical and biological properties. It has been estimated that over half of proteins in Nature are glycoproteins (Apweiler et al., 1999). The consensus for N-glycosylation, also known as the sequon, is NXT/S; it is abundant in proteins, but only two thirds are glycosylated. The lack glycosylation of some sequons may be a result, at least in part, of the presence or absence of specific residues at or near the sequon. For example, a proline (Pro) at the X position was reported to be prohibitive for glycosylation. Little is known, however, about the influence of other residues at this position, nor of those flanking the sequon, on the efficiency of N-glycosylation (Shakin-Eshelman, 1996).

We extended traditional approaches of sequence analysis to glycosylation sites in several ways. Using the current version of SWISSPROT, in which 602 well characterized, non-redundant N-glycoproteins have been deposited. The analyzed pattern was extended from the traditional 3-mer sequon NXS/T to a 7-mer sequon M2M1NXS/TP1P2. Based on experimental information on N-glycosylation of specific aspargines deposited in SWISSPROT, 1186 glycosylated and 717 non-glycosylated 7-mer sequons were analyzed. A supervised learning approach was used to identify complex patterns that separate glycosylated and non-glycosylated sequons.

Analysis of the amino acid distribution at each position of the 7-mer sequon revealed biases in all. Glycosylated sequons showed over-representation of Gly in the X position, and of Leu in the P1 position and under-representation of Pro in these positions. For non-glycosylated sequons, over-representation of Ser was found in M2, Asp in M1, Lys and Pro in X, Tyr in P1 and Gly in P2; under-representation of Leu was observed in position P1.

Supervised learning identified two complex patterns. The data-mining tool WizWhy (WizSoft, Israel) was used to analyze the 7-mer sequons, by describing each position as a separate attribute, and providing the glycosylation state of each sequon as the dependent variable. WizWhy identifies complex "rules" or patterns by first identifying biases in single sites, and merging "rules" that together better explain the dependent attribute.

In glycosylated sequons, several sub-patterns were identified, all matching the consensus D/ESNGTLT. Each sub-pattern matched 2-3 amino acids in positions M1-2, X, and P1-2 of the consensus. Scanning SWISSPROT for sequences matching any of the sub-patterns, an abundance of sequons were identified with a strong over-representation for yeast proteins. Interestingly, there are only 3 sequons in SWISSPROT that match the complete consensus.

In non-glycosylated sequons, several sub-patterns were also identified, all converging to the consensus SDNKS/TYG. Each sub-pattern also matched 2-3 amino acids in positions M1-2, X, and P1-2 of the consensus, and the perfect consensus was found only twice in SWISSPROT. The sub-patterns of this consensus were also found in abundance in SWISSPROT, but no species biases were observed.

To conclude, patterns were identified in the residues flanking N-glycosylated sequons, both simple biases at single positions, and complex patterns spanning the entire 7-sequon that was analyzed. Our results support some observations made in the past on flanking residues, such as the lack of Pro at position X. However, our results failed to support other suggested preferences, such as the favorable effect of Lys, Arg and Ser on glycosylation, or the inhibitory effect of Trp. We also propose complex patterns that may play a role in the specificity of N-glycosylation.