Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Â
Pride cluster presentation
1. Update to the PRIDE Cluster project
Dr. Juan Antonio VizcaĂno
Proteomics Team Leader
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
2. Juan A. VizcaĂno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
â˘PRIDE stores mass spectrometry (MS)-
based proteomics data:
â˘Peptide and protein expression data
(identification and quantification)
â˘Post-translational modifications
â˘Mass spectra (raw data and peak lists)
â˘Technical and biological metadata
â˘Any other related information
â˘Full support for tandem MS approaches
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
VizcaĂno et al., NAR, 2016
3. Juan A. VizcaĂno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
PRIDE Cluster: Initial Motivation
⢠Provide a QC-filtered peptide-centric view of PRIDE.
⢠Data is stored in PRIDE Archive as originally analysed by the
submitters (no data reprocessing is done).
⢠Heterogeneous quality, difficult to make the data comparable.
⢠Enable assessment of (published) proteomics data. Pre-
requisite for data reuse (e.g. in UniProt).
4. Juan A. VizcaĂno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering
5. Juan A. VizcaĂno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
PRIDE Cluster: Second Implementation
⢠Griss et al., Nat. Methods, 2013
⢠Clustered all public, identified
spectra in PRIDE
⢠EBI compute farm, LSF
⢠20.7 M identified spectra
⢠610 CPU days, two
calendar weeks
⢠Validation, calibration
⢠Feedback into PRIDE datasets
⢠EBI farm, LSF
⢠Griss et al., Nat. Methods, 2016
⢠Clustered all public spectra in
PRIDE by April 2015
⢠Apache Hadoop.
⢠Starting with 256 M spectra.
⢠190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
⢠66 M identified spectra
⢠Result: 28 M clusters
⢠5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
6. Juan A. VizcaĂno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Parallelizing Spectrum Clustering: Hadoop
⢠Optimizes work distribution among machines.
⢠Hadoop is a (open source) Framework for parallelism using
the Map-Reduce algorithm by Google.
⢠Solves many general issues of large parallel jobs:
⢠Scheduling
⢠inter-job communication
⢠failure
https://hadoop.apache.org/
9. Juan A. VizcaĂno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Examples: one perfect cluster
- 880 PSMs give the same peptide ID
- 4 species
- 28 datasets
- Same instruments
11. Juan A. VizcaĂno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Output of the analysis
⢠1. Inconsistent spectrum clusters
⢠2. Clusters including identified and unidentified spectra.
⢠3. Clusters just containing unidentified spectra.
12. Juan A. VizcaĂno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Output of the analysis
⢠1. Inconsistent spectrum clusters
⢠2. Clusters including identified and unidentified spectra.
⢠3. Clusters just containing unidentified spectra.
13. Juan A. VizcaĂno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
2. Inferring identifications for originally unidentified spectra
13
⢠9.1 M unidentified spectra were contained in clusters with a reliable
identification.
⢠These are candidate new identifications (that need to be confirmed),
often missed due to search engine settings
⢠Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.
14. Juan A. VizcaĂno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Output of the analysis
⢠1. Inconsistent spectrum clusters
⢠2. Clusters including identified and unidentified spectra.
⢠3. Clusters just containing unidentified spectra.
15. Juan A. VizcaĂno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
3. Consistently unidentified clusters
⢠19 M clusters contain only unidentified spectra.
⢠41,155 of these spectra have more than 100 spectra (= 12 M spectra).
⢠Most of them are likely to be derived from peptides.
⢠They could correspond to PTMs or variant peptides.
⢠With various methods, we found likely identifications for about 20%.
⢠Vast amount of data mining remains to be done.
18. Juan A. VizcaĂno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
PRIDE Cluster as a Public Data Mining Resource
18
⢠http://www.ebi.ac.uk/pride/cluster
⢠Spectral libraries for 16 species.
⢠All clustering results, as well as specific subsets of interest available.
⢠Source code (open source) and Java API
19. Juan A. VizcaĂno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Consistently unidentified clusters
⢠We provide the results split per species in MGF and mzML format.
⢠Very interested in getting people trying to work in those.
⢠Available for several species (Largest clusters at present).
21. Juan A. VizcaĂno
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Aknowledgements: People
Attila Csordas
Tobias Ternent
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Former team members,
especially Rui Wang, Florian
Reisinger, Noemi del Toro, Jose
A. Dianes & Henning Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!