Sunday, Dec, 22, 2013 10:27 PM updated by Xiang Chen

Subcellular Phosphorylation diagram

About SubPhos

Although most subcellular compartments ( SCs ) in a cell are genetically identical, the biochemistry of each is optimized to fulfill its unique function roles, with important consequences for human health and disease. Despite sharing identical genomes and overlapping transcription profiles, SCs exhibit diverse function. Each SC’s unique function requires tightly regulated gene and protein expression coordinated by specialized, phosphorylation-dependent intracellular signaling.  This specialization arises via variable protein expression and differential posttranslational modifications that tune the activity of ubiquitous proteins to each SC’s needs. The resulting biochemical idiosyncrasies can account for SC-specific disease and drug resistance, with consequences for human health [1-4]. Thus, although transcriptome and proteome profiling uncover functional differences among SCs due to differential phosphorylation protein abundance [5], they do not address SC-specific effects of posttranslational regulation. Phosphorylation is the most important post-translational modification ( PTM ) in human cell.  To better understand the role of phosphorylation in maintenance of functional differences among SCs, we collected phosphorylation data for different SCs as a online database ( SubPhosDB )and performed proteomic characterizations of eight human SCs. Comparing protein abundances and phosphorylation levels revealed specialized, interconnected phosphorylation networks within each SC.

Most current predictors focus on organism-specific or kinase-specific phosphorylation sites. However, such predictors cannot account for specialized SC. Hence a large-scale, multi-SC survey of protein phosphorylation abundance combined with phosphorylation site identification were critical first step that would provide insight into phosphorylation-dependent signaling pathways and could be a critical first step in delineating the key proteins and pathways underlying specific SC function.

Here, we proposed firstly a novel web tool, SubPhosPred, specifically designed for predictions of compartment-specific phosphorylation sites. We have trained compartment-specific phosphorylation prediction models for eight SCs ( Cell membrane, Nucleus, Cytoplasm, Mitochondrion, Golgi apparatus, Endoplasmic reticulum, Secreted, Lysosome ) in SubPhos platform. The prediction models of eight SCs were trained by a comprehensive Support Vector Machine ( SVM ) approach that integrates a novel strategy of discrete wavelet transform ( DWT ). Cross-validation tests show that SubPhosPred achieves satisfactory performance.

Taken together, this platform allows us to gain better insight into subcellular functions that depend on phosphorylation, while predictive analyses will prove helpful for further experimental investigation.

About SubPhosDB

SubPhosDB aims to compile known phosphorylation in subcellular proteome to provide a framework that enable examperimental or computational analysis of various scales. In its first release, SubPhosDB provides subcellular phosphoproteome of eight different SCs within proteins in human. They are based on five public database: UniProt/SwissProt, PhosPho.ELM, PhosPhoSitePlus, PHOSIDA, HPRD (see Figure).

SubPhosDB is presented as protein-based searchable database with an interactive web interface providing subcellular phosphoproteome of nearly 137153 sites in 17297 proteins.

About SubPhosPred

Here, we present a first account of the emerging field of subcellular phosphoproteomics where Support Vector Machine (SVM) approach combined with a novel strategy of discrete wavelet transform (DWT) to facilitate the identification of compartment-specific phosphorylation sites and to unravel the intricate regulation of protein phosphorylation. The method was implemented through a novel web tool termed SubPhoPred, which designed currently eight compartment-specific models. Cross-validation tests show that SubPhosPred achieves satisfactory performance.

Discrete Wavelet Transform (DWT) -- The most attractive character of DWT is the ability to elucidate simultaneously both spectral and temporal information and is particularly helpful in detecting subtle time localized changes [6]. The coefficients of the DWT can be divided into two parts: one is the approximation coefficient, which represents the high-scale and low-frequency components of the signal, and the other is the detail coefficient, which represents the low-scale and high-frequency components of the signal [7]. According to both experimental and theoretical progress in protein dynamics, it is clear that low-frequency internal motions do exist in protein and DNA molecules and indeed play a significant role in biological functions [8]. Using the low-frequency wavelet coefficients to formulate the sample of a protein can better reflect its overall sequence order effect. In this work, a digital signal of the protein sequence obtained by similarity scores and amino acid pair compositions was decomposed to j scales with details from scale 1 to scale j and an approximation at scale j by the DWT, and (j+1) scales wavelet coefficients were obtained. With the increase of decomposition level j, more feature vectors of the signal can be observed. To further decrease the dimensionality of the extracted feature vectors, statistics over the set of the wavelet coefficients were used [9]. The following statistical features calculated from the approximation coefficients and detail coefficients were used for the classification of subcompartments: (i) maximum of the wavelet coefficients in each sub-band, (ii) mean of the wavelet coefficients in each sub-band, (iii) minimum of the wavelet coefficients in each sub-band, and (iv) standard deviation of the wavelet coefficients in each sub-band. So a protein sequence can be characterized as a 4(j+1) dimension feature vector. In this study, the decomposition level 4 was chosen, and the obtained 20 dimension feature vectors were then inputted to SVM for classification.

Choosing wavelet functions -- Wavelet transform (WT) is based on the idea of mapping a signal onto a set of basic functions. Based on different basis functions, the wavelet functions have different families; every wavelet family has its quality fitting for different signals and has different results [10]. As the characteristics of the analyzing wavelet control the performance of the WT, the better the analyzing wavelet function matches the underlying structure in the signal, the more concise and sparse the WT representation. So the selection of wavelet functions becomes an important stage to achieve optimal performance in signal processing. We have tried eleven wavelet functions for testing in this work,, including bior1.5, bior2.4, bior3.1, bior4.4, coif2, coif3, db3, db4, haar, sym2 and sym4 functions.

Selection of optimal decomposition scale -- Selection of optimal decomposition scale-A WT decomposes a signal into several vectors of coefficients. Restricted by the property of wavelet decomposition, different decomposition scales have different results in analyzing protein sequences. On the one hand, decomposing a shorter sequence with too high a decomposition scale would introduce ineluctable redundancy in the decomposing process [11]. On the other hand, decomposing a longer sequence with too low a decomposition level would omit much detailed information [11]. In order to gain the highest predictive accuracy, an appropriate decomposition scale would be selected. Considering the dimensionality of sequence feature, 3–6 scales were chosen to decompose the test sequences, separately.

Local sequence clusters (LSC) often exist around phosphorylation sites because substrate sites of the same kinase or kinase family usually share similar patterns in local sequences. Additionally, amino acid pair compositions (AAPC) could reflect the characteristics of the residues surrounding phosphorylation sites, and it has been successfully used for predicting phosphorylation sites. Therefore we took into account similarity scores and amino acid pair compositions of the phosphorylated sequence to convert these training datasets into numerical series. After obtaining the numerical sequences of training data, the feature wavelet coefficients of each query sequence were extracted by using DWT.

Phosphorylation data for Homo sapiens from SubPhosDB database was selected. The data contains 218013 experimental verified phosphorylation sites within 17297 phosphoproteins, in which 10384 phosphoproteins have experimental verified information for different subcellular localizations. Furthermore, the data pertaining to subcellular localization was extracted from UniProt/Swiss-Prot database released on 9-Oct-2012. Sequence annotated with ambiguous or uncertain subcellular localization terms, such as “potential”, “probable”, “probably”, “maybe”, or “by similarly”, were excluded. In addition, the experimental verified localization information of corresponding kinases was also extracted.

We developed subcellular phosphorylation prediction models using the unique phosphorylation proteome of specific subcellular localization. Phosphorylation data of Homo sapiens for eight subcellular localizations which contained more than 50 known phosphoproteins, Cell membrane, Nucleus, Cytoplasm, Mitochondrion, Golgi apparatus, Endoplasmic reticulum, Secreted, Lysosome were used to construct the training datasets of the models. In this study, phosphorylation sites (positive training data) were represented as peptides of length 13, with the phosphorylated residue in the center and six amino acids on either side. When a particular phosphorylated residue was too close to the beginning or end of the protein to have six residues on either side, the missing residues were represented by “*” characters. As done by Musite [12], we used the same type of residues (Serine/threonine or tyrosine) excluding known phosphorylation sites as the non-phosphorylation sites (negative training data).

For each of the eight subcellular localizations, after combining the positive and negative data, protein sequences with high similarities were removed to build a non-redundant (NR) protein data set using CD-HIT with a sequence identity threshold of 30%.

In machine-learning problems, imbalanced datasets occur when one class has a significantly different number of instances than another class and can significantly affect the accuracy of some learning methods [13]. In the context of phosphorylation site prediction, positive phosphorylation sites are vastly outnumbered by negative sites. To correct this imbalance, for each organism and for each site type (Serine/threonine or tyrosine), the number of positive sites was determined, and an equal number of negative sites were randomly chosen from the negative training data. For example, if 1000 positive sites were available for serine sites in Nucleus, then 1000 corresponding negative serine sites were chosen.

How To Use SubPhosDB?

As the first large-scale public database providing this kind of information, we believe that SubPhosDB will enable both computional and molecular biology laboratories to further research in many ways and at various scales. For the convenience of use, a detailed illustration is showed as following:

First, there are panels including search panel and browse panel ( see Figure ):

For search panel, kinase search and substrate search are provided for specific kinase or substrate. You can select a ID type by drop-down list (fig.1) and input a query (fig.2) for data search. You also can browse data by choosing a subcellular compartment of kinase and substrate in drop-down list (fig.4), and another drop-down list  (fig.3) is used for  screening data in unique or non-unique SC.

Second, the entries of query from SubPhosDB is listed in a searchable table as following:

A searchable table lists all records for your query. You can select showing the number of result in a page and use key words for further searching in the result table (fig.1). A statistics for the results of query is showed in footnote of the table (fig.2). The table head represents the relative items for specific query (fig.3). Note that the query corresponding items are different for various searching or browsing.

Finally, SubPhosDB also provides a file downloads including kinase data and substrate data for eight SC of  the training sets in SubPhosPred.

How To Start SubPhosPred?

SubPhosPred was developed to predict phosphorylation sites of subcellular proteome as a important part of SubPhos platform. We provide eight subcellular models in its first release. The detailed prediction processes are illustrated in the following:

First, the prediction panel is showed in the following figure:

For prediction, you can choose a or more subcellular model which you want to predict in the multiple-drop-down list (fig.1), and paste a or more protein sequences with fasta format into the text box (fig.2).

Second, the prediction result from submited form is showed in the following figure:

For each model of subcellular compartment, the results are seprately displayed in result panel. The detailed illustrations for each prediction items are listed in service page. Note that the predictive results can be downloaded by a hyperlink.


We acknowledge with thanks the following software or web servers:

Blast2GO     CD-HIT      DAVID     Libsvm     STRING     Cytoscape

We also acknowledge with thanks the following public database:

UniProt/Swiss-Prot     PhosphoSitePlus     Phospho.ELM     PHOSIDA     HPRD

This work was supported by program for New Century Excellent Talents in University (NCET-11-1002) and the National Natural Science Foundation of China (20605010 and 21175064).


Your feedback will be greatly important for us to improve the SubPhos. Please feel free to contact us if you have any concerns. Thanks very much!

Contact Information: Department of Chemistry, Nanchang University, Nanchang 330031, China. E-mail address:


1. Dhaunsi GS: Molecular Mechanisms of Organelle Biogenesis and Related Metabolic Diseases. Med Prin Pract 2005, 14:49-57.
2. Chan DC: Mitochondria: Dynamic organelles in disease, aging, and development. Cell 2006, 125(7):1241-1252.
3. Schirmer EC, Florens L, Guan TL, Yates JR, Gerace L: Nuclear membrane proteins with potential disease links found by subtractive proteomics. Science 2003, 301(5638):1380-1382.
4. Nigg EA, Raff JW: Centrioles, Centrosomes, and Cilia in Health and Disease. Cell 2009, 139(4):663-678.
5. Kislinger T, Cox B, Kannan A, Chung C, Hu PZ, Ignatchenko A, Scott MS, Gramolini AO, Morris Q, Hallett MT et al: Global survey of organ and organelle protein expression in mouse: Combined proteomic and transcriptomic profiling. Cell 2006, 125(1):173-186.
6. Mori K, Kasashima N, Yoshioka T, Ueno Y: Prediction of spalling on a ball bearing by applying the discrete wavelet transform to vibration signals. Wear 1996, 195(1-2):162-168.
9. Kandaswamy A, Kumar CS, Ramanathan RP, Jayaraman S, Malmurugan N: Neural classification of lung sounds using wavelet coefficients. Comput Biol Med 2004, 34(6):523-537.
10. Grunbaum FA: Ten lectures on wavelets-Daubechies, I. Science 1992, 257(5071):821-822.
11. Wen ZN, Wang KL, Li ML, Nie FS, Yang Y: Analyzing functional similarity of protein sequences
12. Gao JJ, Thelen JJ, Dunker AK, Xu D: Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites. Mol Cell Proteomics 2010, 9(12):2586-2600.
13. Japkowicz N, Stephen S: The class imbalance problem: a systematic study. Intelligent Data Analysis 2002, 6(5):429-449.