All experimental phosphorylation site data
The experimentally validated phosphorylation sites are extracted from Phospho.ELM, PhosphoSitePlus and UniProtKB/Swiss-Prot. Since we considered the disease related abnormal phosphorylation in this study, we extracted a dataset containing only human phosphorylation sites. After removing the redundant proteins among these three databases, we collected 128122 phosphorylation sites within 32148 proteins, where the number of serine (S), threonine (T) and tyrosine (Y) substrate are 69315, 30398 and 28409, respectively.
Kinase-specific phosphorylation data
To construct a kinase-specific phosphorylation site predictor, for each entry, we only retained substrate proteins with the exact positions of the residues that are experimentally verified to be phosphorylated by a given kinase. Finally, we collected 8033 kinase-specific phosphorylation entries. The prediction was performed in a kinase specific way and the known phosphorylation sites of each single kinase, kinase family and kinase group were extracted separately. These three levels of kinase hierarchical classification containing at least 50 experimental phosphorylation sites were used in this study.
Independent data
All the known human phosphorylation sites in Phospho.ELM, PhosphoSitePlus and UniProtKB/Swiss-Prot databases with given kinases have been used in the positive training set. These three databases almost cover all experimentally verified phosphorylation sites, so it is hard to find another set of human data as independent test set. It has been known that the phosphorylation mechanisms are conserved across eukaryotic species [1-3]. We therefor collected the nonhuman phosphorylation sites of CDK, CK1, CK2, MAPK, PKA, PKC and Src kinase families in these three databases as the positive independent test set.
Disease-related phosphorylation data
The information about abnormal phosphorylation which could cause severe diseases was obtained from the PhosphoSitePlus, which is an online systems biology resource providing comprehensive information and tools for the study of protein post-translational modifications (PTMs), and provide MS/MS records for sets of modification sites observed in specified diseases, cell lines, and tissues. We collected 320 human abnormal phosphorylation proteins, which contain 806 abnormal phosphorylation terms. After collecting all the data, we further consulted the SwissVariant, UniProtKB/Swiss-Prot and PubMed databases about their effects, and the references to the variations, phosphorylation information and related diseases of corresponding proteins.
[1]. N. Blom, T. Sicheritz-Ponten, R. Gupta, S. Gammeltoft, S. Brunak, Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 4, 1633 (Jun, 2004).
[2]. L. J. Jensen, D. W. Ussery, S. Brunak, Functionality of system components: Conservation of protein function in protein feature space. Genome Res 13, 2444 (Nov, 2003).
[3]. Y. V. Budovskaya, J. S. Stephan, S. J. Deminoff, P. K. Herman, An evolutionary proteomics approach identifies substrates of the cAMP-dependent protein kinase. P Natl Acad Sci Usa 102, 13933 (Sep, 2005).
Copyright © 2013 Jian Ding Qiu's Lab. NanChang University.