Pride Cluster 062016 Update

Spectrum clustering of PRIDE MS/MS
data
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-EBI
Hinxton, Cambridge, UK

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Overview
• Brief introduction to PRIDE
• PRIDE Cluster: motivation and concept, first implementation
• PRIDE Cluster second implementation

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
• PRIDE stores mass spectrometry (MS)-
based proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak
lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
ProteomeXchange: A Global, distributed proteomics
database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
155 datasets/month
since July 2015
Mandatory raw data deposition
since July 2015

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Overview
• Brief introduction to PRIDE
• PRIDE Cluster: motivation and concept, first
implementation
• PRIDE Cluster second implementation

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Motivation
• Data is stored in PRIDE as originally analysed by the
submitters (no data reprocessing is done)
• Heterogeneous quality, difficult to make the data
comparable
• Enable assessment of (published) proteomics data
• Pre-requisite for data reuse (e.g. in other bioinformatics
resources such as UniProt)

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2013
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 10 spectra in
a cluster and ratio >70%.
Originally submitted identified spectra

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster: Implementation
• Griss et al, Nat. Methods 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Archive
• World-leading repository for MS/MS-based proteomics
data
• Founding member of ProteomeXchange

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster
Sequence-based
search engines
Spectrum clustering
Incorrectly or
unidentified spectra

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster: Second Implementation
• Griss et al, Nat. Methods 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
Griss et al, Nat. Methods 2016, in
press
Clustered all public spectra in
PRIDE by April 2015
Apache Hadoop
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified clusters
• 3. Clusters just containing unidentified spectra

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
1. Re-analysis of inconsistent clusters
NMMAACDPR
NMMAACDPR
IGGIGTVPVGR
NMMAACDPR
PPECPDFDPPR
VFDEFKPLVEEPQNLIK
NMMAACDPR
IGGIGTVPVGR
No sequence has a
proportion in the cluster
>50%
Consensus spectrum
PPECPDFDPPR
VFDEFKPLVEEP
QNLIK

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
1. Re-analysis of inconsistent clusters
• Re-analysed 3,997 large (>100 spectra), inconsistent clusters with
PepNovo, SpectraST, X!Tandem.
• 453 clusters (11%) were identified as peptides originated from
keratins, trypsin, albumin, and hemoglobin.
• In this case, it is likely that a contaminants DB was not used in the
search.

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Validation

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
2. Inferring identifications for originally unidentified spectra
30
• 9.1 M unidentified spectra were contained in clusters with a reliable
identification.
• These are candidate new identifications (that need to be confirmed),
often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
3. Consistently unidentified clusters
31
• 19 M clusters contain only unidentified spectra.
• 41,155 of these spectra have more than 100 spectra (= 12 M
spectra).
• Most are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
3. Consistently unidentified clusters

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster as a Public Data Mining Resource
36
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Applications of spectrum clustering…
38
• In individual or small groups or “similar” datasets:
• Can be used to target spectra that are “consistently” unidentified.
• Unidentified spectra could represent PTMs or sequence variants.
• Try “more-expensive” computational analysis methods (e.g.
spectral searches, de novo).
• When mixing identified and unidentified spectra from different
experiments, if “non-initially” found PTMs are identified, one could
modify the initial search parameters.
• For quantification purposes, the alignment of different runs could
be improved by clustering the spectra first?

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Acknowledgements
Johannes Griss
Rui Wang
Yasset Perez-Riverol
Steve Lewis
Henning Hermjakob
Open MS team (led by O. Kohlbacher)
David Tabb
The rest of the PRIDE team especially
Noemi del Toro and Jose A. Dianes

Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Questions?

Pride Cluster 062016 Update

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (15)

Ähnlich wie Pride Cluster 062016 Update

Ähnlich wie Pride Cluster 062016 Update (20)

Mehr von Juan Antonio Vizcaino

Mehr von Juan Antonio Vizcaino (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Pride Cluster 062016 Update