Pride cluster presentation

Update to the PRIDE Cluster project
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK

Juan A. Vizcaíno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
•PRIDE stores mass spectrometry (MS)-
based proteomics data:
•Peptide and protein expression data
(identification and quantification)
•Post-translational modifications
•Mass spectra (raw data and peak lists)
•Technical and biological metadata
•Any other related information
•Full support for tandem MS approaches
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster: Initial Motivation
• Provide a QC-filtered peptide-centric view of PRIDE.
• Data is stored in PRIDE Archive as originally analysed by the
submitters (no data reprocessing is done).
• Heterogeneous quality, difficult to make the data comparable.
• Enable assessment of (published) proteomics data. Pre-
requisite for data reuse (e.g. in UniProt).

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster: Second Implementation
• Griss et al., Nat. Methods, 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2016
• Clustered all public spectra in
PRIDE by April 2015
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores

Juan A. Vizcaíno
juan@ebi.ac.uk
Parallelizing Spectrum Clustering: Hadoop
• Optimizes work distribution among machines.
• Hadoop is a (open source) Framework for parallelism using
the Map-Reduce algorithm by Google.
• Solves many general issues of large parallel jobs:
• Scheduling
• inter-job communication
• failure
https://hadoop.apache.org/

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster Home page
http://www.ebi.ac.uk/pride/cluster/#/

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster: result of searches
http://www.ebi.ac.uk/pride/cluster/#/
A couple of examples …

Juan A. Vizcaíno
juan@ebi.ac.uk
Examples: one perfect cluster
- 880 PSMs give the same peptide ID
- 4 species
- 28 datasets
- Same instruments

Juan A. Vizcaíno
juan@ebi.ac.uk
Examples: one perfect cluster (2)

Juan A. Vizcaíno
juan@ebi.ac.uk
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.

Juan A. Vizcaíno
juan@ebi.ac.uk
2. Inferring identifications for originally unidentified spectra
13
• 9.1 M unidentified spectra were contained in clusters with a reliable
identification.
• These are candidate new identifications (that need to be confirmed),
often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.

Juan A. Vizcaíno
juan@ebi.ac.uk
3. Consistently unidentified clusters
• 19 M clusters contain only unidentified spectra.
• 41,155 of these spectra have more than 100 spectra (= 12 M spectra).
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.

Juan A. Vizcaíno
juan@ebi.ac.uk
3. Consistently unidentified clusters

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Cluster as a Public Data Mining Resource
18
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API

Juan A. Vizcaíno
juan@ebi.ac.uk
Consistently unidentified clusters
• We provide the results split per species in MGF and mzML format.
• Very interested in getting people trying to work in those.
• Available for several species (Largest clusters at present).

Juan A. Vizcaíno
juan@ebi.ac.uk

Juan A. Vizcaíno
juan@ebi.ac.uk
Aknowledgements: People
Attila Csordas
Tobias Ternent
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Former team members,
especially Rui Wang, Florian
Reisinger, Noemi del Toro, Jose
A. Dianes & Henning Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!

Juan A. Vizcaíno
juan@ebi.ac.uk
Questions?

Pride cluster presentation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Pride cluster presentation

Ähnlich wie Pride cluster presentation (20)

Mehr von Juan Antonio Vizcaino

Mehr von Juan Antonio Vizcaino (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Pride cluster presentation