Proteomics and the "big data" trend: challenges and new possibilitites (Talk at ISAS Dortmund)
1. Proteomics and the “big data” trend:
challenges and new possibilities
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
2. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
7. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
“Big data” in biology: personalised medicine
• Aim: Healthcare
becomes patient
centric for the first
time
• Personalized
medicine
Slide from: http://vector.childrenshospital.org/wp-content/uploads/2016/01/What-is-precision-medicine.jpg
9. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
10. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Data resources at EMBL-EBI
Genes, genomes & variation
ArrayExpress
Expression Atlas PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene & protein expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology
Literature & ontologies
11. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Data resources at EMBL-EBI
Genes, genomes & variation
ArrayExpress
Expression Atlas PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene & protein expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology
Literature & ontologies
12. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
13. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
ProteomeXchange Consortium
• Goal: Development of a framework to allow
standard data submission and dissemination
pipelines between the main existing proteomics
repositories.
• Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK), MassIVE (UCSD, San Diego) and
recently jPOST (Japan).
• Common identifier space (PXD identifiers)
• Two supported data workflows: MS/MS and SRM.
• Main objective: Make life easier for researchers
http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014
14. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
What is a proteomics publication in 2016?
• Proteomics studies generate potentially large amounts of
data and results.
• Ideally, a proteomics publication needs to:
• Summarize the results of the study
• Provide supporting information for reliability of any
results reported
• Information in a publication:
• Manuscript
• Supplementary material
• Associated data submitted to a public repository
15. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Archive – over 4,000 datasets, > 51
countries and 1,700 groups
• USA – 814 datasets
• Germany – 528
• UK – 338
• China – 328
• France – 222
• Netherlands – 175
• Canada - 137
Data volume:
• Total: ~225 TB
• Number of all files: ~560,000
• PXD000320-324: ~ 4 TB
• PXD002319-26 ~2.4 TB
• PXD001471 ~1.6 TB
• 1973 datasets i.e. 52% of all
are publicly accessible
• > 90% of all
ProteomeXchange data
Year
Submissions
All submissions
Complete
PRIDE Archive growth
In the last year: >150 submitted datasets per month
16. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE: Source of MS proteomics data
• PRIDE Archive already provides or
will soon provide MS proteomics
data to other EMBL-EBI resources
such as UniProt, Ensembl and the
EBI Expression Atlas.
http://www.ebi.ac.uk/pride/archive
17. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
18. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Components: Data Submission Process
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
mzIdentML
PRIDE XML
In addition to PRIDE Archive, the PRIDE team develops
and maintains different tools and software libraries to
facilitate the handling and visualisation of MS proteomics
data and the submission process
19. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Inspector Toolsuite
Wang et al., Nat. Biotechnology, 2012
Perez-Riverol et al., Bioinformatics,
2015
Perez-Riverol et al., MCP, 2016
• PRIDE Inspector - standalone tool to enable visualisation and validation of MS
data.
• Build on top of ms-data-core-api - open source algorithms and libraries for
computational proteomics.
• Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE
XML.
• Broad functionality.
https://github.com/PRIDE-Utilities/ms-data-core-api
https://github.com/PRIDE-Toolsuite/pride-inspector
20. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Inspector Functionality
Summary and QC charts Peptide spectra annotation and
visualisation
Protein groups inference
Protein view containing protein
inference information
Quantification view
Multiple export options (.mgf,
protein/peptide tables, mzTab file)
Direct access to PRIDE datasets
Summary and QC charts (Delta m/z,
precursor charges, etc.)
Spectra view (fragmentation table, ion
series annotation)
Protein inference algorithm and protein
groups visualisation
21. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Components: Data Submission Process
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
mzIdentML
PRIDE XML
In addition to PRIDE Archive, the PRIDE team develops
and maintains different tools and software libraries to
facilitate the handling and visualisation of MS proteomics
data and the submission process
22. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PX Submission Tool
Desktop application for data
submissions to ProteomeXchange via
PRIDE
• Implemented in Java 7
• Streamlines the submission process
• Capture mappings between files
• Retain metadata
• Fast file transfer with Aspera (FASP®
transfer technology) – FTP also
available
• Command line option
Submission tool screenshot
23. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
26. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014 Kim et al., Nature, 2014
•Two independent groups claimed to have produced the
first complete draft of the human proteome by MS.
• Some of their findings are controversial and need further
validation… but generated a lot of discussion and put
proteomics in the spotlight.
•They used many different tissues.
Nature cover 29 May 2014
27. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014
•Around 60% of the data used for the
analysis comes from previous
experiments, most of them stored in
proteomics repositories such as
PRIDE/ProteomeXchange, PASSEL or
MassIVE.
•They complement that data with “exotic”
tissues.
28. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Challenges for data reuse in proteomics
• Insufficient technical and biological metadata.
• Large computational infrastructure maybe needed (e.g. when
analysing many datasets together).
• Shortage of expertise (people).
• Lack of standardisation in the field.
29. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Summary of the talk so far
• PRIDE Archive and other ProteomeXchange resources make
possible data sharing in the MS proteomics field.
• Data sharing is becoming the norm in the field.
• Standalone tools: PRIDE Inspector and PX Submission tool.
• Datasets are increasingly reused (many opportunities):
• Example of one of the drafts of the human proteome.
• But there are important challenges as well.
30. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with Big data: PRIDE Cluster
31. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster: Initial Motivation
• Provide a QC-filtered peptide-centric view of PRIDE.
• Data is stored in PRIDE Archive as originally analysed by the
submitters (no data reprocessing is done).
• Heterogeneous quality, difficult to make the data comparable.
• Enable assessment of (published) proteomics data. Pre-
requisite for data reuse (e.g. in UniProt).
32. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2013
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
NMMAACDPR NMMAACDPR
Consensus spectrum
PPECPDFDPPR
Threshold: At least 10 spectra in
a cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering
34. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster: Implementation
• Griss et al., Nat. Methods, 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
35. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster Iteration 2: Why?
• PRIDE Archive has experienced a huge increase in data
since 2013.
• We wanted to develop an algorithm that could also work
with unidentified spectra.
Year
Submissions
All submissions
Complete
PRIDE Archive growth
36. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Parallelizing Spectrum Clustering: Hadoop
• Optimizes work distribution among machines.
• Hadoop is a (open source) Framework for parallelism
using the Map-Reduce algorithm by Google.
• Solves many general issues of large parallel jobs:
• Scheduling
• inter-job communication
• failure
https://hadoop.apache.org/
37. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster: Second Implementation
• Griss et al., Nat. Methods, 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2016
• Clustered all public spectra in
PRIDE by April 2015
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
38. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering
43. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.
44. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
1. Re-analysis of inconsistent clusters
NMMAACDPR
NMMAACDPR
IGGIGTVPVGR
NMMAACDPR
PPECPDFDPPR
VFDEFKPLVEEPQNLIK
NMMAACDPR
IGGIGTVPVGR
No sequence has a
proportion in the
cluster >50%
Consensus spectrum
PPECPDFDPPR
VFDEFKPLVEEP
QNLIK
Originally submitted identified spectra
Spectrum
clustering
45. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
1. Re-analysis of inconsistent clusters
• Re-analysed 3,997 large (>100 spectra), inconsistent clusters with
PepNovo, SpectraST, X!Tandem.
• 453 clusters (11%) were identified as peptides originated from
keratins, trypsin, albumin, and hemoglobin.
• In this case, it is likely that a contaminants DB was not used in the
search.
52. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
2. Inferring identifications for originally unidentified spectra
55
• 9.1 M unidentified spectra were contained in clusters with a reliable
identification.
• These are candidate new identifications (that need to be confirmed),
often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.
53. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
3. Consistently unidentified clusters
• 19 M clusters contain only unidentified spectra.
• 41,155 of these spectra have more than 100 spectra (= 12 M
spectra).
• Most of themare likely to be derived from peptides.
• They could correspond to PTMs or variant peptides.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
55. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster as a Public Data Mining Resource
58
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API
56. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Other Applications of spectrum clustering…
60
• In individual or small groups or “similar” proteomics
datasets:
• Can be used to target spectra that are “consistently”
unidentified.
• Unidentified spectra could represent PTMs or sequence
variants.
• Try “more-expensive” computational analysis methods (e.g.
spectral searches, de novo).
• When mixing identified and unidentified spectra from
different experiments, if “non-initially” found PTMs are
identified, one could modify the initial search parameters.
58. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Summary part 2
• Using a big data approach we are able to get extra
knowledge from all the public data in PRIDE Archive.
• Spectrum clustering enables QC in proteomics resources
such as PRIDE Archive.
• It is possible to detect spectra that are consistently
unidentified across hundreds of datasets (maybe peptide
variants, or peptides with PTMs not initially considered).
• Spectrum clustering is applicable in the analysis of individual
datasets (and not only for proteomics!).
59. Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Aknowledgements: People
Attila Csordas
Tobias Ternent
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Former team members,
especially Rui Wang, Florian
Reisinger, Noemi del Toro, Jose
A. Dianes & Henning Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!