Mining the hidden proteome using hundreds of public proteomics datasets

Mining the hidden proteome using
hundreds of public proteomics datasets
Dr. Juan Antonio Vizcaíno
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Ghent, 17 November 2016
Overview
• PRIDE Archive and ProteomeXchange
• Reuse of public proteomics data
• PRIDE Cluster iteration 1
• PRIDE Cluster iteration 2 -> Mining the hidden proteome

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
What is a proteomics publication in 2016?
• Proteomics studies generate potentially large amounts of
data and results.
• Ideally, a proteomics publication needs to:
• Summarize the results of the study
• Provide supporting information for reliability of any
results reported
• Information in a publication:
• Manuscript
• Supplementary material
• Associated data submitted to a public repository

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
• Any type of data can be stored.
PRIDE (PRoteomics IDEntifications) Archive
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
ProteomeXchange: A Global, distributed proteomics
database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory raw data deposition
since July 2015
• Goal: Development of a framework to allow standard data submission and
dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.org
New in 2016
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017, in press

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
Peptide Atlas
Receiving repositories
PRIDE
Researcher’s results
Raw data
Metadata
PASSEL
Research
groups
Reanalysis of datasets
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
SRM
data
Reprocessed results
MassIVE
ProteomeXchange data workflow
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017, in press

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
ProteomeCentral: Centralised portal for all PX
datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
Peptide Atlas
PRIDE
Researcher’s results
Raw data
Metadata
PASSEL
Research
groups
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
SRM
data
Reprocessed results
MassIVE

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
UniProt/
neXtProtPeptide Atlas
Other DBs
PRIDE
GPMDBResearcher’s results
Raw data
Metadata
PASSEL
proteomicsDB
Research
groups
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
OmicsDI
Integration with other
omics datasets
SRM
data
Reprocessed results
MassIVE

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
PRIDE Archive – over 5,000 datasets from
over 51 countries and 2,000 groups
• USA – 814 datasets
• Germany – 528
• UK – 338
• China – 328
• France – 222
• Netherlands – 175
• Canada - 137
Data volume:
• Total: ~275 TB
• Number of all files: ~560,000
• PXD000320-324: ~ 4 TB
• PXD002319-26 ~2.4 TB
• PXD001471 ~1.6 TB
• 1,973 datasets i.e. 52% of
all are publicly accessible
• ~90% of all
ProteomeXchange datasets
YearSubmissions
All submissions
Complete
PRIDE Archive growth
In the last 12 months: ~165 submitted datasets per month
Top Species studied by at least 100
datasets:
2,010 Homo sapiens
604 Mus musculus
191 Saccharomyces cerevisiae
140 Arabidopsis thaliana
127 Rattus norvegicus
>900 reported taxa in total

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Overview
• PRIDE Archive and ProteomeXchange
• Reuse of public proteomics data

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Datasets are being reused more and more….
Vaudel et al., Proteomics, 2016
Data download volume for
PRIDE Archive in 2015: 198 TB
0
50
100
150
200
250
2013 2014 2015 2016
Downloads in TBs

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Data sharing in Proteomics
Vaudel et al., Proteomics, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014
•Around 60% of the data used for the
analysis comes from previous
experiments, most of them stored in
proteomics repositories such as
PRIDE/ProteomeXchange, PASSEL or
MassIVE.
•They complement that data with “exotic”
tissues.

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Examples of repurposing datasets: proteogenomics
Data in public resources can be used for genome annotation purposes

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Public datasets from different omics: OmicsDI
http://www.ebi.ac.uk/Tools/omicsdi/
• Aims to integrate of ‘omics’ datasets (proteomics,
transcriptomics, metabolomics and genomics at present).
PRIDE
MassIVE
jPOST
PASSEL
GPMDB
ArrayExpress
Expression Atlas
MetaboLights
Metabolomics Workbench
GNPS
EGA
Perez-Riverol et al., Nat Biotechnol, in press

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
OmicsDI: Portal for omics datasets

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
PRIDE Cluster - Concept
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
NMMAACDPR NMMAACDPR
Consensus spectrum
PPECPDFDPPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering
Provide a QC-filtered peptide-centric view of PRIDE
Archive

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
One perfect cluster in PRIDE Cluster web
- 880 PSMs give the same peptide ID
- 4 species
- 28 datasets
- Same instruments
http://www.ebi.ac.uk/pride/cluster/

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
PRIDE Cluster: Implementation
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2013

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
PRIDE Cluster Iteration 2: Why?
• PRIDE Archive has experienced a huge increase in data
since 2013.
• We wanted to develop an algorithm that could also work
with unidentified spectra.
Year
Submissions
All submissions
Complete
PRIDE Archive growth

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
PRIDE Cluster: Second Implementation
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2013
• Clustered all public spectra in
PRIDE by April 2015.
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
• Griss et al., Nat. Methods,
2016

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
PRIDE Cluster - Concept

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Examples: one perfect cluster (2)

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Highlights of the output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
1. Re-analysis of inconsistent clusters
NMMAACDPR
NMMAACDPR
IGGIGTVPVGR
NMMAACDPR
PPECPDFDPPR
VFDEFKPLVEEPQNLIK
NMMAACDPR
IGGIGTVPVGR
Consensus spectrum
PPECPDFDPPR
VFDEFKPLVEEP
QNLIK
Originally submitted identified spectra
Spectrum
clustering
No sequence has a
proportion in the cluster
>50%

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
1. Re-analysis of inconsistent clusters
• Re-analysed 3,997 large (>100 spectra), inconsistent clusters with
PepNovo, SpectraST, X!Tandem.
• 453 clusters (11%) were identified as peptides originated from
keratins, trypsin, albumin, and hemoglobin.
• In this case, it is likely that a contaminants DB was not used in the
search

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Validation

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
2. Inferring identifications for originally unidentified spectra
Not identified
PPECPDFDPPR
PPECPDFDPPR
PPECPDFDPPR
Not identified
Not identified
Consensus spectrum
PPECPDFDPPR
Not identified
Originally submitted spectra
Spectrum
clustering
Identifications are
inferred

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
2. Inferring identifications for originally unidentified spectra
39
• 9.1 M unidentified spectra were contained in clusters with a reliable
identification.
• These are candidate new identifications (that need to be
confirmed), often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
3. Consistently unidentified clusters
Not identified
Not identified
Not identified
Not identified
Consensus spectrum
Not identified
Not identified
Originally submitted spectra
Spectrum
clustering
Method to target
commonly found
unidentified spectra
??

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
PRIDE Cluster
Sequence-based
search engines
Spectrum clustering
Incorrectly or
unidentified spectra

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
• 19 M clusters contain only unidentified spectra.
• 41,155 of these spectra have more than 100 spectra (=12 M spectra,
5%).
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
PRIDE Cluster as a Public Data Mining Resource
45
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Consistently unidentified clusters
• We provide the results split per species in MGF and mzML format.
• Very interested in getting people trying to work in those.
• Available for several species (Largest clusters at present).

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Summary
• PRIDE is now receiving ~8 datasets per working day.
• A lot of possibilities open for reuse of this data.
• New OmicsDI resource.
• It is possible to detect spectra that are consistently
unidentified across hundreds of datasets (maybe peptide
variants, or peptides with PTMs not initially considered).

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Aknowledgements: People
Attila Csordas
Tobias Ternent
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Enrique Perez
Former team members, especially
Rui Wang, Florian Reisinger, Noemi
del Toro, Jose A. Dianes & Henning
Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!
@pride_ebi
@proteomexchange

Juan A. Vizcaíno
juan@ebi.ac.uk
BePA Conference
Questions?
http://www.slideshare.net/JuanAntonioVizcaino

Mining the hidden proteome using hundreds of public proteomics datasets

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Mining the hidden proteome using hundreds of public proteomics datasets

Ähnlich wie Mining the hidden proteome using hundreds of public proteomics datasets (20)

Mehr von Juan Antonio Vizcaino

Mehr von Juan Antonio Vizcaino (11)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mining the hidden proteome using hundreds of public proteomics datasets