Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Mining the hidden proteome using hundreds of public proteomics datasets
1. Mining the hidden proteome using
hundreds of public proteomics datasets
Dr. Juan Antonio Vizcaíno
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
2. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
Overview
• PRIDE Archive and ProteomeXchange
• Reuse of public proteomics data
• PRIDE Cluster iteration 1
• PRIDE Cluster iteration 2 -> Mining the hidden proteome
3. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
What is a proteomics publication in 2016?
• Proteomics studies generate potentially large amounts of
data and results.
• Ideally, a proteomics publication needs to:
• Summarize the results of the study
• Provide supporting information for reliability of any
results reported
• Information in a publication:
• Manuscript
• Supplementary material
• Associated data submitted to a public repository
4. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
• Any type of data can be stored.
PRIDE (PRoteomics IDEntifications) Archive
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
5. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
ProteomeXchange: A Global, distributed proteomics
database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory raw data deposition
since July 2015
• Goal: Development of a framework to allow standard data submission and
dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.org
New in 2016
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017, in press
6. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
Peptide Atlas
Receiving repositories
PRIDE
Researcher’s results
Raw data
Metadata
PASSEL
Research
groups
Reanalysis of datasets
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
SRM
data
Reprocessed results
MassIVE
ProteomeXchange data workflow
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017, in press
7. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
ProteomeCentral: Centralised portal for all PX
datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
8. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
Peptide Atlas
Receiving repositories
PRIDE
Researcher’s results
Raw data
Metadata
PASSEL
Research
groups
Reanalysis of datasets
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
SRM
data
Reprocessed results
MassIVE
ProteomeXchange data workflow
9. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
UniProt/
neXtProtPeptide Atlas
Other DBs
Receiving repositories
PRIDE
GPMDBResearcher’s results
Raw data
Metadata
PASSEL
proteomicsDB
Research
groups
Reanalysis of datasets
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
OmicsDI
Integration with other
omics datasets
SRM
data
Reprocessed results
MassIVE
ProteomeXchange data workflow
10. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
PRIDE Archive – over 5,000 datasets from
over 51 countries and 2,000 groups
• USA – 814 datasets
• Germany – 528
• UK – 338
• China – 328
• France – 222
• Netherlands – 175
• Canada - 137
Data volume:
• Total: ~275 TB
• Number of all files: ~560,000
• PXD000320-324: ~ 4 TB
• PXD002319-26 ~2.4 TB
• PXD001471 ~1.6 TB
• 1,973 datasets i.e. 52% of
all are publicly accessible
• ~90% of all
ProteomeXchange datasets
YearSubmissions
All submissions
Complete
PRIDE Archive growth
In the last 12 months: ~165 submitted datasets per month
Top Species studied by at least 100
datasets:
2,010 Homo sapiens
604 Mus musculus
191 Saccharomyces cerevisiae
140 Arabidopsis thaliana
127 Rattus norvegicus
>900 reported taxa in total
11. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
Overview
• PRIDE Archive and ProteomeXchange
• Reuse of public proteomics data
• PRIDE Cluster iteration 1
• PRIDE Cluster iteration 2
12. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
Datasets are being reused more and more….
Vaudel et al., Proteomics, 2016
Data download volume for
PRIDE Archive in 2015: 198 TB
0
50
100
150
200
250
2013 2014 2015 2016
Downloads in TBs
15. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014
•Around 60% of the data used for the
analysis comes from previous
experiments, most of them stored in
proteomics repositories such as
PRIDE/ProteomeXchange, PASSEL or
MassIVE.
•They complement that data with “exotic”
tissues.
17. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
Examples of repurposing datasets: proteogenomics
Data in public resources can be used for genome annotation purposes
18. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
Public datasets from different omics: OmicsDI
http://www.ebi.ac.uk/Tools/omicsdi/
• Aims to integrate of ‘omics’ datasets (proteomics,
transcriptomics, metabolomics and genomics at present).
PRIDE
MassIVE
jPOST
PASSEL
GPMDB
ArrayExpress
Expression Atlas
MetaboLights
Metabolomics Workbench
GNPS
EGA
Perez-Riverol et al., Nat Biotechnol, in press
21. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
Overview
• PRIDE Archive and ProteomeXchange
• Reuse of public proteomics data
• PRIDE Cluster iteration 1
• PRIDE Cluster iteration 2 -> Mining the hidden proteome
23. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
PRIDE Cluster - Concept
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
NMMAACDPR NMMAACDPR
Consensus spectrum
PPECPDFDPPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering
Provide a QC-filtered peptide-centric view of PRIDE
Archive
24. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
One perfect cluster in PRIDE Cluster web
- 880 PSMs give the same peptide ID
- 4 species
- 28 datasets
- Same instruments
http://www.ebi.ac.uk/pride/cluster/
25. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
PRIDE Cluster: Implementation
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2013
26. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
Overview
• PRIDE Archive and ProteomeXchange
• Reuse of public proteomics data
• PRIDE Cluster iteration 1
• PRIDE Cluster iteration 2 -> Mining the hidden proteome
27. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
PRIDE Cluster Iteration 2: Why?
• PRIDE Archive has experienced a huge increase in data
since 2013.
• We wanted to develop an algorithm that could also work
with unidentified spectra.
Year
Submissions
All submissions
Complete
PRIDE Archive growth
28. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
PRIDE Cluster: Second Implementation
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2013
• Clustered all public spectra in
PRIDE by April 2015.
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
• Griss et al., Nat. Methods,
2016
31. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
Highlights of the output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.
32. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
1. Re-analysis of inconsistent clusters
NMMAACDPR
NMMAACDPR
IGGIGTVPVGR
NMMAACDPR
PPECPDFDPPR
VFDEFKPLVEEPQNLIK
NMMAACDPR
IGGIGTVPVGR
Consensus spectrum
PPECPDFDPPR
VFDEFKPLVEEP
QNLIK
Originally submitted identified spectra
Spectrum
clustering
No sequence has a
proportion in the cluster
>50%
33. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
1. Re-analysis of inconsistent clusters
• Re-analysed 3,997 large (>100 spectra), inconsistent clusters with
PepNovo, SpectraST, X!Tandem.
• 453 clusters (11%) were identified as peptides originated from
keratins, trypsin, albumin, and hemoglobin.
• In this case, it is likely that a contaminants DB was not used in the
search
38. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
2. Inferring identifications for originally unidentified spectra
Not identified
PPECPDFDPPR
PPECPDFDPPR
PPECPDFDPPR
Not identified
Not identified
Consensus spectrum
PPECPDFDPPR
Not identified
Originally submitted spectra
Spectrum
clustering
Identifications are
inferred
39. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
2. Inferring identifications for originally unidentified spectra
39
• 9.1 M unidentified spectra were contained in clusters with a reliable
identification.
• These are candidate new identifications (that need to be
confirmed), often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.
40. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
3. Consistently unidentified clusters
Not identified
Not identified
Not identified
Not identified
Consensus spectrum
Not identified
Not identified
Originally submitted spectra
Spectrum
clustering
Method to target
commonly found
unidentified spectra
??
41. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
PRIDE Cluster
Sequence-based
search engines
Spectrum clustering
Incorrectly or
unidentified spectra
42. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
3. Consistently unidentified clusters
• 19 M clusters contain only unidentified spectra.
• 41,155 of these spectra have more than 100 spectra (=12 M spectra,
5%).
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
45. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
PRIDE Cluster as a Public Data Mining Resource
45
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API
46. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
Consistently unidentified clusters
• We provide the results split per species in MGF and mzML format.
• Very interested in getting people trying to work in those.
• Available for several species (Largest clusters at present).
47. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
Summary
• PRIDE is now receiving ~8 datasets per working day.
• A lot of possibilities open for reuse of this data.
• New OmicsDI resource.
• It is possible to detect spectra that are consistently
unidentified across hundreds of datasets (maybe peptide
variants, or peptides with PTMs not initially considered).
48. Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
Aknowledgements: People
Attila Csordas
Tobias Ternent
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Enrique Perez
Former team members, especially
Rui Wang, Florian Reisinger, Noemi
del Toro, Jose A. Dianes & Henning
Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!
@pride_ebi
@proteomexchange