1. Musite: Prediction of Protein
Phosphorylation Sites
Jianjiong Gao
University of Missouri Columbia
Missouri,
http://musite.sourceforge.net/
2. Background:
Protein Phosphorylation
Protein phosphorylation is one of the most
important p
p post-translational modifications.
It was estimated that up to 50% of proteins are
phosphorylated in some cellular state
Abnormality in phosphorylation is a cause or
consequence of many diseases
Cancer
Diabete
Parkinson’s
Hepertitis B
…
3. Background:
Protein Phosphorylation
Phosphorylation-dephosphorylation is a
biochemical switch system regulating
y g g
various cellular processes.
Catalyzed by various specific protein
kinases.
Kinase
ON
OFF
Phosphatase
4. Phosphorylation Site Prediction
Problem Formulation
Phosphorylation site: a phosphorylated amino acid
in a protein (determined by protein sequence)
General phosphorylation site prediction: to predict
whether an amino acid can be phosphorylated
Kinase-specific p
p phosphorylation site p
p y prediction: to
predict whether an amino acid can be
p
phosphorylated by a specific kinase
p y y p
Based on protein sequence only
5. Limitations of Current Methods
Current prediction tools have
limitations when applying to whole
proteomes
Prediction accuracy could be improved
Most were released as web servers and have
restrictions for the uploaded data by users
Training data were out of date
Stringency adjustment was not fully
supported
6. Our tool Musite is unique
Novel method with better accuracy
First open source tool in the field that meet
open-source
OSI Open Standards Requirement
Standalone program designed for proteome-
scale prediction
p
Support both general and kinase-specific
phosphorylation site prediction
Support customized model training
Support continuous stringency adjustment
7. Phosphorylation Site Prediction
Flowchart
Data collection from high quality sources, Training data
such as Uniprot/Swiss-Prot,Phospho.ELM,
PhosphoPep,and PhosPhAt Bootstrap
Non-redundant datasets built by BLASTclust
Bootstrap
sample 1
... Bootstrap
sample m
Training
Phosphorylation it
Ph h l ti sites Non-phosphorylation it
N h h l ti sites
Feature extraction Classifier 1 ... Classifier m
KNN scores Disorder scores
Amino acid frequencies Aggregating
Specificity
Features from Features from estimation Phosphorylation
positive set negative set
prediction model
Control data Making predictions
on new data
8. Phosphorylation Site Prediction
Data Extraction
Data collection from high quality sources, Training data
such as Uniprot/Swiss-Prot,Phospho.ELM,
PhosphoPep,and PhosPhAt Bootstrap
Non-redundant datasets built by BLASTclust
Bootstrap
sample 1
... Bootstrap
sample m
Training
Phosphorylation it
Ph h l ti sites Non-phosphorylation it
N h h l ti sites
Feature extraction Classifier 1 ... Classifier m
KNN scores Disorder scores
Amino acid frequencies Aggregating
Specificity
Features from Features from estimation Phosphorylation
positive set negative set
prediction model
Control data Making predictions
on new data
9. Phosphorylation Site Prediction
Feature Extraction
Data collection from high quality sources, Training data
such as Uniprot/Swiss-Prot,Phospho.ELM,
PhosphoPep,and PhosPhAt Bootstrap
Non-redundant datasets built by BLASTclust
Bootstrap
sample 1
... Bootstrap
sample m
Training
Phosphorylation it
Ph h l ti sites Non-phosphorylation it
N h h l ti sites
Feature extraction Classifier 1 ... Classifier m
KNN scores Disorder scores
Amino acid frequencies Aggregating
Specificity
Features from Features from estimation Phosphorylation
positive set negative set
prediction model
Control data Making predictions
on new data
10. Phosphorylation Site Prediction
Feature Extraction
Data collection from high quality sources, Training data
such as Uniprot/Swiss-Prot,Phospho.ELM,
PhosphoPep,and PhosPhAt Bootstrap
Non-redundant datasets built by BLASTclust
Bootstrap
sample 1
... Bootstrap
sample m
Training
Phosphorylation it
Ph h l ti sites Non-phosphorylation it
N h h l ti sites
Feature extraction Classifier 1 ... Classifier m
KNN scores Disorder scores
Amino acid frequencies Aggregating
Specificity
Features from Features from estimation Phosphorylation
positive set negative set
prediction model
Control data Making predictions
on new data
11. KNN Features
Motivation
Rationale of using KNN features: local
sequence clusters exist around
phosphorylation sites, since
Each phosphorylation site is a substrate of a specific
protein kinase
Substrates of the same kinase or kinase family
usually shares similar patterns in local sequences
12. KNN Features
Result
(A)
Overall, phosphosites Phospho Nonphospho
have larger KNN scores 1
than non-phosphosites 0.8
core
KNN sc
0.6
Average KNN scores 0.4
0.7~0.8 for phosphosites 0.2
≈0.5 for non-phosphosites 0
0.25
0 25 0.5
05 1 2 4
Size of nearest neighbors (% of sample size)
Boxplot of KNN features
(Human S /Th )
(H Ser/Thr)
13. Disorder Features
Concept & Rationale
Disordered region (structure)
Some parts of a protein have a rigid structure,
such as α-helix and β-sheet.
Other parts, disordered regions, do not have
well defined
well-defined conformations
The conformational flexibility of disordered
regions may facilitate protein phosphorylation
[Dunker, 2008]: protein phosphorylation sites
are frequently located within disordered regions
14. Disorder Features
Result
For h
F phosphosites
h it (A) Phospho-S/T in H. sapiens
6
Occurrence increases exponentially 10000 5
when d so de sco e increases
e disorder score c eases 4
For non-phosphosites 5000 3
2
Significantly different distribution
occurrence
e
0 1
0 0.2 0.4 0.6 0.8 1
x 10
5
(B) Non-phospho-S/T in H. sapiens 0
Disorder score > 0.5 2.5
-1
2
Phosphosites: ~91% -2
1.5
Non-phosphosites: ~55% -3
1
Phosphosites are significantly 0.5
05
-4
over-represented in disordered 0
-5
-6
regions 0 0.2 0.4 0.6
Disorder Score
0.8 1
Histogram of disorder features
(Human Ser/Thr)
15. Amino Acid Frequencies
Result
quency) 1
0.5
0
Log2(Ratio of Freq
-0.5 H. sapiens (S/T)
M. musculus (S/T)
-1
1
D. melanogaster (S/T)
-1.5 C. elegans (S/T)
-2
2 S. cerevisiae (S/T)
( )
g
A. thaliana (S/T)
-2.5
P R D E S K G A Q N V T H L M I F Y W C
Amino Acid
A i A id
P, R, D, E, S, K, and G are enriched around
phosphosites
C, W, Y, F, I, M, L, H, T, and V are depleted
16. Phosphorylation Site Prediction
Classifier Training
Data collection from high quality sources, Training data
such as Uniprot/Swiss-Prot,Phospho.ELM,
PhosphoPep,and PhosPhAt Bootstrap
Non-redundant datasets built by BLASTclust
Bootstrap
sample 1
... Bootstrap
sample m
Training
Phosphorylation it
Ph h l ti sites Non-phosphorylation it
N h h l ti sites
Feature extraction Classifier 1 ... Classifier m
KNN scores Disorder scores
Amino acid frequencies Aggregating
Specificity
Features from Features from estimation Phosphorylation
positive set negative set
prediction model
Control data Making predictions
on new data
20. Phosphorylation Site Prediction
Software Implementation-Musite
Open Source
License: GNU General Public License (GPL)
http://musite.sourceforge.net/
http://musite sourceforge net/
Stand-alone application
Based on Java
Support Windows Linux and Mac OS X
Windows, Linux,
A web server is also being developed
g p
http://musite.net/
22. Implementation
Customized Model Training
A unique utility for users to train
prediction models f
di ti d l from th i own d t
their data
Take advantage of latest data
Train disease-specific models
Train organ-specific models
Integrate into experimental p
g p procedure in an
iterative way
23. Summary
Musite is for prediction of general and kinase-
specific phosphosites in a better accuracy
Musite is a open-source standalone program
capable of performing proteome-wide
proteome wide
predictions
24. Acknowledgements
Dr. Dong Xu (University of Missouri)
Dr. Jay Thelen (U e s ty o Missouri)
e e (University of ssou )
Dr. Keith Dunker (Indiana University)
Curtis Bollinger (University of Missouri)
Funding Visit us at
NSF [# DBI 0604439]
DBI-0604439] http://musite.sourceforge.net
p g
NIH [# R21/R33 GM078601] http://musite.net
Poster R09 at ISMB