Computational Prediction of Ubiquitylation sites in Eukaryotic Proteins  
  About UBIPROBER  
  What is UbiProber?  
  How does UbiProber predict ubiquitylation?  
  UbiProber Related Resources  
  Using UBIPROBER  
  How to start?  
  How to find the protein ID/FASTA that I am interented in?  
  Result panel  
  Related sites  
  LIBSVM (a Library for Support Vector Machines)  
  CD-HIT (Biological Sequence Clustering and Comparison)  
  Several ubiquitylation site prediction tools  
  Other matters  
  Acknowledgments  
  Visit lab web page  
  Feedback  
Troubleshooting
  The software of UbiProber seems broken  
     
  What is UbiProber?
Systematic dissection of the ubiquitylation proteome is emerging as an appealing but challenging research topic because of the significant roles ubiquitylation plays not only in protein degradation but also in many other cellular functions. Since ubiquitylation is rapid and reversible, it is time-consuming and labor-intensive to identify ubiquitylation sites using conventional experimental approaches. To efficiently discover lysine-ubiquitylation sites, a highly specific predictor for in silico prediction of ubiquitylation sites in any individual organism is urgently needed to guide experimental design. Here we present a novel protein ubiquitylation prediction tool named UbiProber, implemented by support vector machines that integrates local sequence similarities to known ubiquitylation sites, physicochemical property and amino acid compositions, and we used the information gain to identify the key positions and amino acids to optimize the prediction model. Although the amino acid sequences around the ubiquitin conjugation sites do not contain conserved motifs, but the cross-validation result indicates that the integration of key positions and amino acids features of ubiquitylation sequences can improve predictive performance. UbiProber offers four models of Homo sapiens, Mus musculus, Saccharomyces cerevisiae and Combined, an independent test on a 1:1 ratio of positive and negative samples revealed that the areas under ROC curves (AUCs) of Combined model reached 83.36%. Cross-validation tests also show that UbiProber achieves some improvement over existing tools in predicting species-specific ubiquitylation sites.
 
     
   How does UbiProber predict ubiquitylation?
Reliable and large-scale experimental ubiquitin proteomics data in multiple species were collected from several sources and utilized to train ubiquitylation site prediction models by randomly selected 10 times negative samples to match the positive samples. Three sets of features (K Nearest Neighbor (KNN) Feature, Physicochemical Property and Amino Acid Composition) were extracted from the training data and combined with support vector machine (SVM) to make predictions. KNN features capture local sequence similarity around sites ubiquitylated by the same enzyme or enzyme family whether or not the enzyme-substrate interactions are known. Physicochemical properties and amino acid compositions reflect the biochemical environment of the surrounding regions of ubiquitylation sites, and they play various roles in signaling and regulation. Additionally, to extract the meaningful information and enhance the overall accuracy of the predictor, information gain (IG) method is first used to select some key positions and amino acids, which key positions and amino acids are used to optimize each feature set.  Further,we developed a new tool named UbiProber.
 
     
  UbiProber Related Resources
Dataset
Positive examples of ubiquitylation sites were extracted from two large-scale proteomics database (UniProt/SwissProt,Dec.23st.2012; Phosphosite, Dec.23st.2012), UbiProt and literature search (Kim, et al., 2011; Radivojac, et al., 2010). These lysine ubiquitylation sites were present in 8044, 3355 and 208 proteins from H.sapiens, M.musculus and S. cerevisiae,respectively. From these proteins, we extracted ubiquitylated (positive) fragments, each containing up to 13 upstream and downstream residues around the central lysine residue. The set of non-ubiquitylated (negative) fragments were extracted from the same proteins. To obtain a non-redundant dataset, no two fragments within the positive or negative datasets, as well as across the two datasets, were allowed to share 40% sequence identity. When a similar pair between a positive and negative example occurred, the negative site was always removed as less reliably labeled. The sequence identity cutoff of 40% lies well below those that provide accurate functional inference by homology transfer (Rost, et al., 2003) thusallowing us to consider our dataset to be non-redundant.

Datasets can be downloaded from here.


Software

UbiProber is ONLY freely available for academic research. And for commercial usage, please contact us.

UbiProber executable files are available for download:

Windows (64-bit)   UbiProber_64.zip
Windows (32-bit)   UbiProber_32.zip

Installation guide can be downloaded
here.
 
     
 How to start?
This is the home page of UbiProber:
  

There are two ways in this page that you can use to submit a query protein for which you want to know where the ubiquitylation site is. In the first way, you can specify the query protein using a UniProt accession number or entry name. For example, if you are interested in the ubiquitylation site of a ubiquitin protein, please fill in the "Protein ID" field with your protein ID:
  
and then click the "Load" button
 
In this case, we use the default protein ID A2RU67,Q9UIC8. If you do not know the protein ID (or protien sequence) of the proteins of interest, this section could be helpful to you. To submit proteins in the other way is similar to the above process except requiring a protein sequence (in FASTA format). It is up to you which way to use, a protein ID or a protein sequence.
Prediction options
 Before submitting a query protein, you will see the prediction options of UbiProber like this:      
 
This panel is consisted of two regions (choose an species type and threshold setting). The left region provides four models for predicting ubiquitylation site while the right region provides an interface to control stringency degree.
     
  How to find the protein ID/FASTA that I am interented in?
Sometimes you only know the protein name (e.g. cysteine desulfurase) and have no appropriate protein ID or protein sequence at hand. Moreover, maybe you never hear about any of the following ID type in UbiProber (UniProt accession number), neither FASTA. Here we provide a simple method to get the protein ID/FASTA. However, a protein name might result in many protein IDs/sequences and you must choose the most appropriate one by yourself.

First, go to the home page of
UniProt (shown below) where you can find a powerful keyword search throughout the whole UniProtKB database.

All we need to do is to input the keyword. Here we use "Ubiquitin-like protein" as an example. Then, click the "Serch" button.

Here comes the results page. In this example, Q9SCA9 is the UniProt accession number and Q9SCA9_SOLLC is the entry name. Both of them are valid input in UbiProber.

Furthermore, if you want the FASTA of this sequence, please click the entry name,

then you will see a entry page like this. Click the "FASTA format" link pointed by the arrow to retrieve the FASTA.
 
     
  Result panel      

For normal queries, the prediction results will consist of 3 parts. The 1st part lists basic information for prediction result. The 2nd part provides a statistical results of the predicted ubiquitylation site. The 3rd shows the detailed results  of the predicted ubiquitylation site.
 
     
  LIBSVM
CC C, CJ L: LIBSVM: a Library for Support Vector Machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html
 
     
  CD-HIT
Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequence-s. Bioinformatics 2006, 22(13):1658-1659. Website: http://weizhong-lab.ucsd.edu/cdhit_suite/cgi-bin/index.cgi?cmd=cd-hit
 
     
  Several ubiquitylation site prediction tools
Up to now, serval ubiquitylation site prediction tools have been developed.
UbiPred. Tung and Ho (2008) developed an predictor using Support Vector Machine (SVM) with 31 informative physicochemical features selected from the published aminoacid indices.
UbPred. Radivojac et al. also proposed a random forest-based predictor, in which 586 sequence attributes were employed as input feature vector.
CKSAAP_UbSite. Zhang et al. presented a method by utilizing the composition of k-spaces amino acid pairs surrounding a query site with the assistance of support vector machine.
 
     
  Acknowledgment
This work was supported by program for New Century Excellent Talents in University (NCET-11-1002) and the National Natural Science Foundation of China (20605010 and 21175064).
 
     
  Visit lab web page
This lab focuses on the researches of biological information theories and their applications in solving complicated biological problems. The long-term of interest of our research is to develop effective computational algorithms and methods to select the useful information from the mass data generated in public dataset, which can be further to present some useful message for medical science and biology. Webpage: http://bioinfo.ncu.edu.cn/default.aspx
 
     
  Feedback
Your feedback will be greatly important for us to improve the UbiProber. Please feel free to contact us if you have any concerns. Thanks very much! Contact Information: Department of Chemistry, Nanchang University, Nanchang 330031, China. E-mail address: jdqiu@ncu.edu.cn
 
     
  The software of UbiProber seems broken
For better experience, we have implemented an easy-to-use windows forms applications of UbiProber, which worked with Windows Operating System. The software system is written as a Windows application in .NET 4.0 framework by using C# language, and project provides an open platform for development of machine learning-based applications in predicting protein ubiquitylation sites. Here, We provide some checkpoints to help people who have no idea about these software technologies.
Run the software
   Please read the Manual firstly, which can be found in UbiProber software package.
.NET 4.0 framework
   You will first need to install .NET 4.0 framework, which can be found in UbiProber software package.
Tested environments
   The environments that have been tested for compatibility are as follows:
   Windows XP  - .NET 4.0 framework
   Windows7      - .NET 4.0 framework
    ■ Windows8      - .NET 4.0 framework 

It would be very nice if you can tell us your environment in which UbiProber runs well. On the other hand, please feel free to contact us when you cannot use UbiProber on a specific environment. We will try to test UbiProber on your environment.
 
     
  Copyright © 2012 Jian Ding Qiu's Lab. NanChang University.