1. Spectrum clustering of PRIDE MS/MS
data
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-EBI
Hinxton, Cambridge, UK
2. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Overview
• Brief introduction to PRIDE
• PRIDE Cluster: motivation and concept, first implementation
• PRIDE Cluster second implementation
3. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Overview
• Brief introduction to PRIDE
• PRIDE Cluster: motivation and concept, first implementation
• PRIDE Cluster second implementation
4. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
• PRIDE stores mass spectrometry (MS)-
based proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak
lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
5. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
ProteomeXchange: A Global, distributed proteomics
database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
155 datasets/month
since July 2015
Mandatory raw data deposition
since July 2015
6. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Overview
• Brief introduction to PRIDE
• PRIDE Cluster: motivation and concept, first
implementation
• PRIDE Cluster second implementation
7. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Motivation
• Data is stored in PRIDE as originally analysed by the
submitters (no data reprocessing is done)
• Heterogeneous quality, difficult to make the data
comparable
• Enable assessment of (published) proteomics data
• Pre-requisite for data reuse (e.g. in other bioinformatics
resources such as UniProt)
8. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2013
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 10 spectra in
a cluster and ratio >70%.
Originally submitted identified spectra
10. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster: Implementation
• Griss et al, Nat. Methods 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
11. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Overview
• Brief introduction to PRIDE
• PRIDE Cluster: motivation and concept, first implementation
• PRIDE Cluster second implementation
14. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster: Second Implementation
• Griss et al, Nat. Methods 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
Griss et al, Nat. Methods 2016, in
press
Clustered all public spectra in
PRIDE by April 2015
Apache Hadoop
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
15. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
17. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
19. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified clusters
• 3. Clusters just containing unidentified spectra
20. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
1. Re-analysis of inconsistent clusters
NMMAACDPR
NMMAACDPR
IGGIGTVPVGR
NMMAACDPR
PPECPDFDPPR
VFDEFKPLVEEPQNLIK
NMMAACDPR
IGGIGTVPVGR
No sequence has a
proportion in the cluster
>50%
Consensus spectrum
PPECPDFDPPR
VFDEFKPLVEEP
QNLIK
Originally submitted identified spectra
21. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
1. Re-analysis of inconsistent clusters
• Re-analysed 3,997 large (>100 spectra), inconsistent clusters with
PepNovo, SpectraST, X!Tandem.
• 453 clusters (11%) were identified as peptides originated from
keratins, trypsin, albumin, and hemoglobin.
• In this case, it is likely that a contaminants DB was not used in the
search.
28. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
2. Inferring identifications for originally unidentified spectra
30
• 9.1 M unidentified spectra were contained in clusters with a reliable
identification.
• These are candidate new identifications (that need to be confirmed),
often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.
29. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
3. Consistently unidentified clusters
31
• 19 M clusters contain only unidentified spectra.
• 41,155 of these spectra have more than 100 spectra (= 12 M
spectra).
• Most are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
31. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster as a Public Data Mining Resource
36
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API
33. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Applications of spectrum clustering…
38
• In individual or small groups or “similar” datasets:
• Can be used to target spectra that are “consistently” unidentified.
• Unidentified spectra could represent PTMs or sequence variants.
• Try “more-expensive” computational analysis methods (e.g.
spectral searches, de novo).
• When mixing identified and unidentified spectra from different
experiments, if “non-initially” found PTMs are identified, one could
modify the initial search parameters.
• For quantification purposes, the alignment of different runs could
be improved by clustering the spectra first?
34. Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Acknowledgements
Johannes Griss
Rui Wang
Yasset Perez-Riverol
Steve Lewis
Henning Hermjakob
Open MS team (led by O. Kohlbacher)
David Tabb
The rest of the PRIDE team especially
Noemi del Toro and Jose A. Dianes