SlideShare ist ein Scribd-Unternehmen logo
1 von 60
Proteomics and the “big data” trend:
challenges and new possibilities
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
“Big data” jobs
http://www.indeed.co.uk/Big-Data-jobs
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
“Big data”: definition
Slide from: http://www.ibmbigdatahub.com/
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
“Big data” is everywhere…
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
“Big data” in biology
The term has been applied so far mainly to genomics
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
“Big data” in biology: personalised medicine
• Aim: Healthcare
becomes patient
centric for the first
time
• Personalized
medicine
Slide from: http://vector.childrenshospital.org/wp-content/uploads/2016/01/What-is-precision-medicine.jpg
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
One slide intro to MS based proteomics
Hein et al., Handbook of Systems Biology, 2012
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Data resources at EMBL-EBI
Genes, genomes & variation
ArrayExpress
Expression Atlas PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene & protein expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology
Literature & ontologies
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Data resources at EMBL-EBI
Genes, genomes & variation
ArrayExpress
Expression Atlas PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene & protein expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology
Literature & ontologies
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
ProteomeXchange Consortium
• Goal: Development of a framework to allow
standard data submission and dissemination
pipelines between the main existing proteomics
repositories.
• Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK), MassIVE (UCSD, San Diego) and
recently jPOST (Japan).
• Common identifier space (PXD identifiers)
• Two supported data workflows: MS/MS and SRM.
• Main objective: Make life easier for researchers
http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
What is a proteomics publication in 2016?
• Proteomics studies generate potentially large amounts of
data and results.
• Ideally, a proteomics publication needs to:
• Summarize the results of the study
• Provide supporting information for reliability of any
results reported
• Information in a publication:
• Manuscript
• Supplementary material
• Associated data submitted to a public repository
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Archive – over 4,000 datasets, > 51
countries and 1,700 groups
• USA – 814 datasets
• Germany – 528
• UK – 338
• China – 328
• France – 222
• Netherlands – 175
• Canada - 137
Data volume:
• Total: ~225 TB
• Number of all files: ~560,000
• PXD000320-324: ~ 4 TB
• PXD002319-26 ~2.4 TB
• PXD001471 ~1.6 TB
• 1973 datasets i.e. 52% of all
are publicly accessible
• > 90% of all
ProteomeXchange data
Year
Submissions
All submissions
Complete
PRIDE Archive growth
In the last year: >150 submitted datasets per month
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE: Source of MS proteomics data
• PRIDE Archive already provides or
will soon provide MS proteomics
data to other EMBL-EBI resources
such as UniProt, Ensembl and the
EBI Expression Atlas.
http://www.ebi.ac.uk/pride/archive
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Components: Data Submission Process
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
mzIdentML
PRIDE XML
In addition to PRIDE Archive, the PRIDE team develops
and maintains different tools and software libraries to
facilitate the handling and visualisation of MS proteomics
data and the submission process
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Inspector Toolsuite
Wang et al., Nat. Biotechnology, 2012
Perez-Riverol et al., Bioinformatics,
2015
Perez-Riverol et al., MCP, 2016
• PRIDE Inspector - standalone tool to enable visualisation and validation of MS
data.
• Build on top of ms-data-core-api - open source algorithms and libraries for
computational proteomics.
• Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE
XML.
• Broad functionality.
https://github.com/PRIDE-Utilities/ms-data-core-api
https://github.com/PRIDE-Toolsuite/pride-inspector
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Inspector Functionality
Summary and QC charts Peptide spectra annotation and
visualisation
Protein groups inference
 Protein view containing protein
inference information
 Quantification view
 Multiple export options (.mgf,
protein/peptide tables, mzTab file)
 Direct access to PRIDE datasets
 Summary and QC charts (Delta m/z,
precursor charges, etc.)
 Spectra view (fragmentation table, ion
series annotation)
 Protein inference algorithm and protein
groups visualisation
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Components: Data Submission Process
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
mzIdentML
PRIDE XML
In addition to PRIDE Archive, the PRIDE team develops
and maintains different tools and software libraries to
facilitate the handling and visualisation of MS proteomics
data and the submission process
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PX Submission Tool
 Desktop application for data
submissions to ProteomeXchange via
PRIDE
• Implemented in Java 7
• Streamlines the submission process
• Capture mappings between files
• Retain metadata
• Fast file transfer with Aspera (FASP®
transfer technology) – FTP also
available
• Command line option
Submission tool screenshot
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Datasets are being reused more and more….
Data download volume in 2015: ~ 200 TB
Vaudel et al., Proteomics, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Data sharing in Proteomics
Vaudel et al., Proteomics, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014 Kim et al., Nature, 2014
•Two independent groups claimed to have produced the
first complete draft of the human proteome by MS.
• Some of their findings are controversial and need further
validation… but generated a lot of discussion and put
proteomics in the spotlight.
•They used many different tissues.
Nature cover 29 May 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014
•Around 60% of the data used for the
analysis comes from previous
experiments, most of them stored in
proteomics repositories such as
PRIDE/ProteomeXchange, PASSEL or
MassIVE.
•They complement that data with “exotic”
tissues.
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Challenges for data reuse in proteomics
• Insufficient technical and biological metadata.
• Large computational infrastructure maybe needed (e.g. when
analysing many datasets together).
• Shortage of expertise (people).
• Lack of standardisation in the field.
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Summary of the talk so far
• PRIDE Archive and other ProteomeXchange resources make
possible data sharing in the MS proteomics field.
• Data sharing is becoming the norm in the field.
• Standalone tools: PRIDE Inspector and PX Submission tool.
• Datasets are increasingly reused (many opportunities):
• Example of one of the drafts of the human proteome.
• But there are important challenges as well.
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with Big data: PRIDE Cluster
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster: Initial Motivation
• Provide a QC-filtered peptide-centric view of PRIDE.
• Data is stored in PRIDE Archive as originally analysed by the
submitters (no data reprocessing is done).
• Heterogeneous quality, difficult to make the data comparable.
• Enable assessment of (published) proteomics data. Pre-
requisite for data reuse (e.g. in UniProt).
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2013
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
NMMAACDPR NMMAACDPR
Consensus spectrum
PPECPDFDPPR
Threshold: At least 10 spectra in
a cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster - Concept
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster: Implementation
• Griss et al., Nat. Methods, 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster Iteration 2: Why?
• PRIDE Archive has experienced a huge increase in data
since 2013.
• We wanted to develop an algorithm that could also work
with unidentified spectra.
Year
Submissions
All submissions
Complete
PRIDE Archive growth
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Parallelizing Spectrum Clustering: Hadoop
• Optimizes work distribution among machines.
• Hadoop is a (open source) Framework for parallelism
using the Map-Reduce algorithm by Google.
• Solves many general issues of large parallel jobs:
• Scheduling
• inter-job communication
• failure
https://hadoop.apache.org/
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster: Second Implementation
• Griss et al., Nat. Methods, 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2016
• Clustered all public spectra in
PRIDE by April 2015
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster Home page
http://www.ebi.ac.uk/pride/cluster/#/
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Examples: one perfect cluster
- 880 PSMs give the same peptide ID
- 4 species
- 28 datasets
- Same instruments
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Examples: one perfect cluster (2)
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster
Sequence-based
search engines
Spectrum clustering
Incorrectly or
unidentified spectra
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
1. Re-analysis of inconsistent clusters
NMMAACDPR
NMMAACDPR
IGGIGTVPVGR
NMMAACDPR
PPECPDFDPPR
VFDEFKPLVEEPQNLIK
NMMAACDPR
IGGIGTVPVGR
No sequence has a
proportion in the
cluster >50%
Consensus spectrum
PPECPDFDPPR
VFDEFKPLVEEP
QNLIK
Originally submitted identified spectra
Spectrum
clustering
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
1. Re-analysis of inconsistent clusters
• Re-analysed 3,997 large (>100 spectra), inconsistent clusters with
PepNovo, SpectraST, X!Tandem.
• 453 clusters (11%) were identified as peptides originated from
keratins, trypsin, albumin, and hemoglobin.
• In this case, it is likely that a contaminants DB was not used in the
search.
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Validation
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
2. Inferring identifications for originally unidentified spectra
55
• 9.1 M unidentified spectra were contained in clusters with a reliable
identification.
• These are candidate new identifications (that need to be confirmed),
often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
3. Consistently unidentified clusters
• 19 M clusters contain only unidentified spectra.
• 41,155 of these spectra have more than 100 spectra (= 12 M
spectra).
• Most of themare likely to be derived from peptides.
• They could correspond to PTMs or variant peptides.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
3. Consistently unidentified clusters
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
PRIDE Cluster as a Public Data Mining Resource
58
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Other Applications of spectrum clustering…
60
• In individual or small groups or “similar” proteomics
datasets:
• Can be used to target spectra that are “consistently”
unidentified.
• Unidentified spectra could represent PTMs or sequence
variants.
• Try “more-expensive” computational analysis methods (e.g.
spectral searches, de novo).
• When mixing identified and unidentified spectra from
different experiments, if “non-initially” found PTMs are
identified, one could modify the initial search parameters.
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Other applications of spectrum clustering…
61
• Spectrum clustering can also be applied to MS/MS lipidomics studies
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Summary part 2
• Using a big data approach we are able to get extra
knowledge from all the public data in PRIDE Archive.
• Spectrum clustering enables QC in proteomics resources
such as PRIDE Archive.
• It is possible to detect spectra that are consistently
unidentified across hundreds of datasets (maybe peptide
variants, or peptides with PTMs not initially considered).
• Spectrum clustering is applicable in the analysis of individual
datasets (and not only for proteomics!).
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Aknowledgements: People
Attila Csordas
Tobias Ternent
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Former team members,
especially Rui Wang, Florian
Reisinger, Noemi del Toro, Jose
A. Dianes & Henning Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!
Juan A. Vizcaíno
juan@ebi.ac.uk
Colloquium
Dortmund, 11 August 2016
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsJuan Antonio Vizcaino
 
An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...Juan Antonio Vizcaino
 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?Juan Antonio Vizcaino
 
TIB's action for research data managament as a national library's strategy in...
TIB's action for research data managament as a national library's strategy in...TIB's action for research data managament as a national library's strategy in...
TIB's action for research data managament as a national library's strategy in...Peter Löwe
 
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
 
OpenAIRE in 8 minutes - Introduction to European einfrastructures session at ...
OpenAIRE in 8 minutes - Introduction to European einfrastructures session at ...OpenAIRE in 8 minutes - Introduction to European einfrastructures session at ...
OpenAIRE in 8 minutes - Introduction to European einfrastructures session at ...OpenAIRE
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...Carole Goble
 
ICIC 2013 Conference Proceedings Uwe Rosemann TIB
ICIC 2013 Conference Proceedings Uwe Rosemann TIBICIC 2013 Conference Proceedings Uwe Rosemann TIB
ICIC 2013 Conference Proceedings Uwe Rosemann TIBDr. Haxel Consult
 
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
FAIR Data, Operations and Model management for Systems Biology and Systems Me...FAIR Data, Operations and Model management for Systems Biology and Systems Me...
FAIR Data, Operations and Model management for Systems Biology and Systems Me...Carole Goble
 
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
 
Reflections on a (slightly unusual) multi-disciplinary academic career
Reflections on a (slightly unusual) multi-disciplinary academic careerReflections on a (slightly unusual) multi-disciplinary academic career
Reflections on a (slightly unusual) multi-disciplinary academic careerCarole Goble
 
FAIR data and model management for systems biology.
FAIR data and model management for systems biology.FAIR data and model management for systems biology.
FAIR data and model management for systems biology.FAIRDOM
 
Reproducible Research: how could Research Objects help
Reproducible Research: how could Research Objects helpReproducible Research: how could Research Objects help
Reproducible Research: how could Research Objects helpCarole Goble
 
Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016Carole Goble
 
OEG-Tools for supporting Ontology Engineering
OEG-Tools for supporting Ontology EngineeringOEG-Tools for supporting Ontology Engineering
OEG-Tools for supporting Ontology EngineeringMaría Poveda Villalón
 
Introduction to FAIRDOM
Introduction to FAIRDOMIntroduction to FAIRDOM
Introduction to FAIRDOMCarole Goble
 
Improving the Management of Computational Models -- Invited talk at the EBI
Improving the Management of Computational Models -- Invited talk at the EBIImproving the Management of Computational Models -- Invited talk at the EBI
Improving the Management of Computational Models -- Invited talk at the EBIMartin Scharm
 
Research Shared: researchobject.org
Research Shared: researchobject.orgResearch Shared: researchobject.org
Research Shared: researchobject.orgNorman Morrison
 
Research Objects, SEEK and FAIRDOM
Research Objects, SEEK and FAIRDOMResearch Objects, SEEK and FAIRDOM
Research Objects, SEEK and FAIRDOMCarole Goble
 

Was ist angesagt? (20)

Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasets
 
Pride cluster presentation
Pride cluster presentation Pride cluster presentation
Pride cluster presentation
 
An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...
 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?
 
TIB's action for research data managament as a national library's strategy in...
TIB's action for research data managament as a national library's strategy in...TIB's action for research data managament as a national library's strategy in...
TIB's action for research data managament as a national library's strategy in...
 
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
 
OpenAIRE in 8 minutes - Introduction to European einfrastructures session at ...
OpenAIRE in 8 minutes - Introduction to European einfrastructures session at ...OpenAIRE in 8 minutes - Introduction to European einfrastructures session at ...
OpenAIRE in 8 minutes - Introduction to European einfrastructures session at ...
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
 
ICIC 2013 Conference Proceedings Uwe Rosemann TIB
ICIC 2013 Conference Proceedings Uwe Rosemann TIBICIC 2013 Conference Proceedings Uwe Rosemann TIB
ICIC 2013 Conference Proceedings Uwe Rosemann TIB
 
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
FAIR Data, Operations and Model management for Systems Biology and Systems Me...FAIR Data, Operations and Model management for Systems Biology and Systems Me...
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
 
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
 
Reflections on a (slightly unusual) multi-disciplinary academic career
Reflections on a (slightly unusual) multi-disciplinary academic careerReflections on a (slightly unusual) multi-disciplinary academic career
Reflections on a (slightly unusual) multi-disciplinary academic career
 
FAIR data and model management for systems biology.
FAIR data and model management for systems biology.FAIR data and model management for systems biology.
FAIR data and model management for systems biology.
 
Reproducible Research: how could Research Objects help
Reproducible Research: how could Research Objects helpReproducible Research: how could Research Objects help
Reproducible Research: how could Research Objects help
 
Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016
 
OEG-Tools for supporting Ontology Engineering
OEG-Tools for supporting Ontology EngineeringOEG-Tools for supporting Ontology Engineering
OEG-Tools for supporting Ontology Engineering
 
Introduction to FAIRDOM
Introduction to FAIRDOMIntroduction to FAIRDOM
Introduction to FAIRDOM
 
Improving the Management of Computational Models -- Invited talk at the EBI
Improving the Management of Computational Models -- Invited talk at the EBIImproving the Management of Computational Models -- Invited talk at the EBI
Improving the Management of Computational Models -- Invited talk at the EBI
 
Research Shared: researchobject.org
Research Shared: researchobject.orgResearch Shared: researchobject.org
Research Shared: researchobject.org
 
Research Objects, SEEK and FAIRDOM
Research Objects, SEEK and FAIRDOMResearch Objects, SEEK and FAIRDOM
Research Objects, SEEK and FAIRDOM
 

Andere mochten auch

The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)Juan Antonio Vizcaino
 
Integrative_omics_lecture_feb112016_UAB
Integrative_omics_lecture_feb112016_UABIntegrative_omics_lecture_feb112016_UAB
Integrative_omics_lecture_feb112016_UABSophia Banton
 
Usability and Bioinformatics: experience and research challenges
Usability and Bioinformatics: experience and research challengesUsability and Bioinformatics: experience and research challenges
Usability and Bioinformatics: experience and research challengesbolk
 
BPIPE: a bioinformatics pipeline framework
BPIPE: a bioinformatics pipeline frameworkBPIPE: a bioinformatics pipeline framework
BPIPE: a bioinformatics pipeline frameworkMohamed Nadhir Djekidel
 
Multi-omics Pathway Visualization
Multi-omics Pathway VisualizationMulti-omics Pathway Visualization
Multi-omics Pathway VisualizationAnwesha Bohler
 
The Ondex Data Integration Framework
The Ondex Data Integration FrameworkThe Ondex Data Integration Framework
The Ondex Data Integration Frameworkbosc
 
Knowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysisKnowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysisCOST action BM1006
 
Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Juan Antonio Vizcaino
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsDuncan Hull
 
The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...
The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...
The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...pratikomics
 
Applications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessApplications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessProf. Dr. Basavaraj Nanjwade
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbugc.titus.brown
 

Andere mochten auch (15)

The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)
 
Integrative_omics_lecture_feb112016_UAB
Integrative_omics_lecture_feb112016_UABIntegrative_omics_lecture_feb112016_UAB
Integrative_omics_lecture_feb112016_UAB
 
Usability and Bioinformatics: experience and research challenges
Usability and Bioinformatics: experience and research challengesUsability and Bioinformatics: experience and research challenges
Usability and Bioinformatics: experience and research challenges
 
B4OS-2012
B4OS-2012B4OS-2012
B4OS-2012
 
BPIPE: a bioinformatics pipeline framework
BPIPE: a bioinformatics pipeline frameworkBPIPE: a bioinformatics pipeline framework
BPIPE: a bioinformatics pipeline framework
 
Multi-omics Pathway Visualization
Multi-omics Pathway VisualizationMulti-omics Pathway Visualization
Multi-omics Pathway Visualization
 
The Ondex Data Integration Framework
The Ondex Data Integration FrameworkThe Ondex Data Integration Framework
The Ondex Data Integration Framework
 
Knowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysisKnowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysis
 
integration_Aug2015
integration_Aug2015integration_Aug2015
integration_Aug2015
 
Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 
The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...
The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...
The Galaxy framework as a unifying bioinformatics solution for multi-omic dat...
 
Proteomics
Proteomics Proteomics
Proteomics
 
Applications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessApplications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And Process
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 

Ähnlich wie Proteomics and the "big data" trend: challenges and new possibilitites (Talk at ISAS Dortmund)

PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...Juan Antonio Vizcaino
 
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Juan Antonio Vizcaino
 
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataPRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataJuan Antonio Vizcaino
 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateJuan Antonio Vizcaino
 
Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Juan Antonio Vizcaino
 
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...Juan Antonio Vizcaino
 
PRIDE and ProteomeXchange: Training webinar
PRIDE and ProteomeXchange: Training webinarPRIDE and ProteomeXchange: Training webinar
PRIDE and ProteomeXchange: Training webinarJuan Antonio Vizcaino
 
ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange_and_PRIDE_Semmeting_2015ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange_and_PRIDE_Semmeting_2015Juan Antonio Vizcaino
 
Introduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRIntroduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRJuan Antonio Vizcaino
 
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance...
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance...ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance...
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance...Juan Antonio Vizcaino
 
Data volumes in proteomics data resources: PRIDE and ProteomeXchange
Data volumes in proteomics data resources: PRIDE and ProteomeXchangeData volumes in proteomics data resources: PRIDE and ProteomeXchange
Data volumes in proteomics data resources: PRIDE and ProteomeXchangeJuan Antonio Vizcaino
 

Ähnlich wie Proteomics and the "big data" trend: challenges and new possibilitites (Talk at ISAS Dortmund) (20)

PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
 
Pride and ProteomeXchange
Pride and ProteomeXchangePride and ProteomeXchange
Pride and ProteomeXchange
 
PRIDE and ProteomeXchange
PRIDE and ProteomeXchangePRIDE and ProteomeXchange
PRIDE and ProteomeXchange
 
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
 
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataPRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 update
 
Human microbiome project
Human microbiome projectHuman microbiome project
Human microbiome project
 
ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016
 
The ELIXIR Proteomics Community
The ELIXIR Proteomics CommunityThe ELIXIR Proteomics Community
The ELIXIR Proteomics Community
 
Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?
 
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
PRIDE and ProteomeXchange: Training webinar
PRIDE and ProteomeXchange: Training webinarPRIDE and ProteomeXchange: Training webinar
PRIDE and ProteomeXchange: Training webinar
 
Reuse of public data in proteomics
Reuse of public data in proteomicsReuse of public data in proteomics
Reuse of public data in proteomics
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
 
ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange_and_PRIDE_Semmeting_2015ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange_and_PRIDE_Semmeting_2015
 
Introduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRIntroduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIR
 
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance...
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance...ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance...
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance...
 
Data volumes in proteomics data resources: PRIDE and ProteomeXchange
Data volumes in proteomics data resources: PRIDE and ProteomeXchangeData volumes in proteomics data resources: PRIDE and ProteomeXchange
Data volumes in proteomics data resources: PRIDE and ProteomeXchange
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 

Mehr von Juan Antonio Vizcaino

Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Juan Antonio Vizcaino
 
Introduction to the PSI standard data formats
Introduction to the PSI standard data formatsIntroduction to the PSI standard data formats
Introduction to the PSI standard data formatsJuan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Juan Antonio Vizcaino
 
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...Juan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Juan Antonio Vizcaino
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...Juan Antonio Vizcaino
 
Enabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataEnabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataJuan Antonio Vizcaino
 

Mehr von Juan Antonio Vizcaino (16)

Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...
 
Introduction to the PSI standard data formats
Introduction to the PSI standard data formatsIntroduction to the PSI standard data formats
Introduction to the PSI standard data formats
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
PRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchangePRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchange
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018
 
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
 
PSI-Proteome Informatics update
PSI-Proteome Informatics updatePSI-Proteome Informatics update
PSI-Proteome Informatics update
 
ProteomeXchange update
ProteomeXchange updateProteomeXchange update
ProteomeXchange update
 
The ELIXIR Proteomics community
The ELIXIR Proteomics community The ELIXIR Proteomics community
The ELIXIR Proteomics community
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
 
ProteomeXchange update 2017
ProteomeXchange update 2017ProteomeXchange update 2017
ProteomeXchange update 2017
 
Enabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataEnabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics data
 

Kürzlich hochgeladen

Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 

Kürzlich hochgeladen (20)

The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 

Proteomics and the "big data" trend: challenges and new possibilitites (Talk at ISAS Dortmund)

  • 1. Proteomics and the “big data” trend: challenges and new possibilities Dr. Juan Antonio Vizcaíno Proteomics Team Leader EMBL-European Bioinformatics Institute Hinxton, Cambridge, UK
  • 2. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with “Big data”: PRIDE Cluster
  • 3. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 “Big data” jobs http://www.indeed.co.uk/Big-Data-jobs
  • 4. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 “Big data”: definition Slide from: http://www.ibmbigdatahub.com/
  • 5. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 “Big data” is everywhere…
  • 6. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 “Big data” in biology The term has been applied so far mainly to genomics
  • 7. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 “Big data” in biology: personalised medicine • Aim: Healthcare becomes patient centric for the first time • Personalized medicine Slide from: http://vector.childrenshospital.org/wp-content/uploads/2016/01/What-is-precision-medicine.jpg
  • 8. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 One slide intro to MS based proteomics Hein et al., Handbook of Systems Biology, 2012
  • 9. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with “Big data”: PRIDE Cluster
  • 10. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Data resources at EMBL-EBI Genes, genomes & variation ArrayExpress Expression Atlas PRIDE InterPro Pfam UniProt ChEMBL ChEBI Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive European Variation Archive European Genome-phenome Archive Gene & protein expression Protein sequences, families & motifs Chemical biology Reactions, interactions & pathways IntAct Reactome MetaboLights Systems BioModels Enzyme Portal BioSamples Ensembl Ensembl Genomes GWAS Catalog Metagenomics portal Europe PubMed Central Gene Ontology Experimental Factor Ontology Literature & ontologies
  • 11. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Data resources at EMBL-EBI Genes, genomes & variation ArrayExpress Expression Atlas PRIDE InterPro Pfam UniProt ChEMBL ChEBI Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive European Variation Archive European Genome-phenome Archive Gene & protein expression Protein sequences, families & motifs Chemical biology Reactions, interactions & pathways IntAct Reactome MetaboLights Systems BioModels Enzyme Portal BioSamples Ensembl Ensembl Genomes GWAS Catalog Metagenomics portal Europe PubMed Central Gene Ontology Experimental Factor Ontology Literature & ontologies
  • 12. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 • PRIDE stores mass spectrometry (MS)-based proteomics data: • Peptide and protein expression data (identification and quantification) • Post-translational modifications • Mass spectra (raw data and peak lists) • Technical and biological metadata • Any other related information • Full support for tandem MS approaches PRIDE (PRoteomics IDEntifications) database http://www.ebi.ac.uk/pride/archive Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2016
  • 13. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 ProteomeXchange Consortium • Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories. • Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK), MassIVE (UCSD, San Diego) and recently jPOST (Japan). • Common identifier space (PXD identifiers) • Two supported data workflows: MS/MS and SRM. • Main objective: Make life easier for researchers http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014
  • 14. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 What is a proteomics publication in 2016? • Proteomics studies generate potentially large amounts of data and results. • Ideally, a proteomics publication needs to: • Summarize the results of the study • Provide supporting information for reliability of any results reported • Information in a publication: • Manuscript • Supplementary material • Associated data submitted to a public repository
  • 15. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Archive – over 4,000 datasets, > 51 countries and 1,700 groups • USA – 814 datasets • Germany – 528 • UK – 338 • China – 328 • France – 222 • Netherlands – 175 • Canada - 137 Data volume: • Total: ~225 TB • Number of all files: ~560,000 • PXD000320-324: ~ 4 TB • PXD002319-26 ~2.4 TB • PXD001471 ~1.6 TB • 1973 datasets i.e. 52% of all are publicly accessible • > 90% of all ProteomeXchange data Year Submissions All submissions Complete PRIDE Archive growth In the last year: >150 submitted datasets per month
  • 16. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE: Source of MS proteomics data • PRIDE Archive already provides or will soon provide MS proteomics data to other EMBL-EBI resources such as UniProt, Ensembl and the EBI Expression Atlas. http://www.ebi.ac.uk/pride/archive
  • 17. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with “Big data”: PRIDE Cluster
  • 18. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Components: Data Submission Process PRIDE Converter 2 PRIDE Inspector PX Submission Tool mzIdentML PRIDE XML In addition to PRIDE Archive, the PRIDE team develops and maintains different tools and software libraries to facilitate the handling and visualisation of MS proteomics data and the submission process
  • 19. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Inspector Toolsuite Wang et al., Nat. Biotechnology, 2012 Perez-Riverol et al., Bioinformatics, 2015 Perez-Riverol et al., MCP, 2016 • PRIDE Inspector - standalone tool to enable visualisation and validation of MS data. • Build on top of ms-data-core-api - open source algorithms and libraries for computational proteomics. • Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE XML. • Broad functionality. https://github.com/PRIDE-Utilities/ms-data-core-api https://github.com/PRIDE-Toolsuite/pride-inspector
  • 20. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Inspector Functionality Summary and QC charts Peptide spectra annotation and visualisation Protein groups inference  Protein view containing protein inference information  Quantification view  Multiple export options (.mgf, protein/peptide tables, mzTab file)  Direct access to PRIDE datasets  Summary and QC charts (Delta m/z, precursor charges, etc.)  Spectra view (fragmentation table, ion series annotation)  Protein inference algorithm and protein groups visualisation
  • 21. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Components: Data Submission Process PRIDE Converter 2 PRIDE Inspector PX Submission Tool mzIdentML PRIDE XML In addition to PRIDE Archive, the PRIDE team develops and maintains different tools and software libraries to facilitate the handling and visualisation of MS proteomics data and the submission process
  • 22. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PX Submission Tool  Desktop application for data submissions to ProteomeXchange via PRIDE • Implemented in Java 7 • Streamlines the submission process • Capture mappings between files • Retain metadata • Fast file transfer with Aspera (FASP® transfer technology) – FTP also available • Command line option Submission tool screenshot
  • 23. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with “Big data”: PRIDE Cluster
  • 24. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Datasets are being reused more and more…. Data download volume in 2015: ~ 200 TB Vaudel et al., Proteomics, 2016
  • 25. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Data sharing in Proteomics Vaudel et al., Proteomics, 2016
  • 26. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Draft Human proteome papers published in 2014 Wilhelm et al., Nature, 2014 Kim et al., Nature, 2014 •Two independent groups claimed to have produced the first complete draft of the human proteome by MS. • Some of their findings are controversial and need further validation… but generated a lot of discussion and put proteomics in the spotlight. •They used many different tissues. Nature cover 29 May 2014
  • 27. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Draft Human proteome papers published in 2014 Wilhelm et al., Nature, 2014 •Around 60% of the data used for the analysis comes from previous experiments, most of them stored in proteomics repositories such as PRIDE/ProteomeXchange, PASSEL or MassIVE. •They complement that data with “exotic” tissues.
  • 28. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Challenges for data reuse in proteomics • Insufficient technical and biological metadata. • Large computational infrastructure maybe needed (e.g. when analysing many datasets together). • Shortage of expertise (people). • Lack of standardisation in the field.
  • 29. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Summary of the talk so far • PRIDE Archive and other ProteomeXchange resources make possible data sharing in the MS proteomics field. • Data sharing is becoming the norm in the field. • Standalone tools: PRIDE Inspector and PX Submission tool. • Datasets are increasingly reused (many opportunities): • Example of one of the drafts of the human proteome. • But there are important challenges as well.
  • 30. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with Big data: PRIDE Cluster
  • 31. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Cluster: Initial Motivation • Provide a QC-filtered peptide-centric view of PRIDE. • Data is stored in PRIDE Archive as originally analysed by the submitters (no data reprocessing is done). • Heterogeneous quality, difficult to make the data comparable. • Enable assessment of (published) proteomics data. Pre- requisite for data reuse (e.g. in UniProt).
  • 32. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Cluster - Concept Griss et al., Nat Methods, 2013 NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR NMMAACDPR NMMAACDPR Consensus spectrum PPECPDFDPPR Threshold: At least 10 spectra in a cluster and ratio >70%. Originally submitted identified spectra Spectrum clustering
  • 33. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Cluster - Concept
  • 34. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Cluster: Implementation • Griss et al., Nat. Methods, 2013 • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF
  • 35. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Cluster Iteration 2: Why? • PRIDE Archive has experienced a huge increase in data since 2013. • We wanted to develop an algorithm that could also work with unidentified spectra. Year Submissions All submissions Complete PRIDE Archive growth
  • 36. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Parallelizing Spectrum Clustering: Hadoop • Optimizes work distribution among machines. • Hadoop is a (open source) Framework for parallelism using the Map-Reduce algorithm by Google. • Solves many general issues of large parallel jobs: • Scheduling • inter-job communication • failure https://hadoop.apache.org/
  • 37. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Cluster: Second Implementation • Griss et al., Nat. Methods, 2013 • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF • Griss et al., Nat. Methods, 2016 • Clustered all public spectra in PRIDE by April 2015 • Apache Hadoop. • Starting with 256 M spectra. • 190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide). • 66 M identified spectra • Result: 28 M clusters • 5 calendar days on 30 node Hadoop cluster, 340 CPU cores
  • 38. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Cluster - Concept Griss et al., Nat Methods, 2016 NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR Consensus spectrum PPECPDFDPPR NMMAACDPR NMMAACDPR Threshold: At least 3 spectra in a cluster and ratio >70%. Originally submitted identified spectra Spectrum clustering
  • 39. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Cluster Home page http://www.ebi.ac.uk/pride/cluster/#/
  • 40. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Examples: one perfect cluster - 880 PSMs give the same peptide ID - 4 species - 28 datasets - Same instruments
  • 41. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Examples: one perfect cluster (2)
  • 42. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Cluster Sequence-based search engines Spectrum clustering Incorrectly or unidentified spectra
  • 43. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Output of the analysis • 1. Inconsistent spectrum clusters • 2. Clusters including identified and unidentified spectra. • 3. Clusters just containing unidentified spectra.
  • 44. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 1. Re-analysis of inconsistent clusters NMMAACDPR NMMAACDPR IGGIGTVPVGR NMMAACDPR PPECPDFDPPR VFDEFKPLVEEPQNLIK NMMAACDPR IGGIGTVPVGR No sequence has a proportion in the cluster >50% Consensus spectrum PPECPDFDPPR VFDEFKPLVEEP QNLIK Originally submitted identified spectra Spectrum clustering
  • 45. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 1. Re-analysis of inconsistent clusters • Re-analysed 3,997 large (>100 spectra), inconsistent clusters with PepNovo, SpectraST, X!Tandem. • 453 clusters (11%) were identified as peptides originated from keratins, trypsin, albumin, and hemoglobin. • In this case, it is likely that a contaminants DB was not used in the search.
  • 52. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 2. Inferring identifications for originally unidentified spectra 55 • 9.1 M unidentified spectra were contained in clusters with a reliable identification. • These are candidate new identifications (that need to be confirmed), often missed due to search engine settings • Example: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.
  • 53. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 3. Consistently unidentified clusters • 19 M clusters contain only unidentified spectra. • 41,155 of these spectra have more than 100 spectra (= 12 M spectra). • Most of themare likely to be derived from peptides. • They could correspond to PTMs or variant peptides. • With various methods, we found likely identifications for about 20%. • Vast amount of data mining remains to be done.
  • 54. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 3. Consistently unidentified clusters
  • 55. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 PRIDE Cluster as a Public Data Mining Resource 58 • http://www.ebi.ac.uk/pride/cluster • Spectral libraries for 16 species. • All clustering results, as well as specific subsets of interest available. • Source code (open source) and Java API
  • 56. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Other Applications of spectrum clustering… 60 • In individual or small groups or “similar” proteomics datasets: • Can be used to target spectra that are “consistently” unidentified. • Unidentified spectra could represent PTMs or sequence variants. • Try “more-expensive” computational analysis methods (e.g. spectral searches, de novo). • When mixing identified and unidentified spectra from different experiments, if “non-initially” found PTMs are identified, one could modify the initial search parameters.
  • 57. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Other applications of spectrum clustering… 61 • Spectrum clustering can also be applied to MS/MS lipidomics studies
  • 58. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Summary part 2 • Using a big data approach we are able to get extra knowledge from all the public data in PRIDE Archive. • Spectrum clustering enables QC in proteomics resources such as PRIDE Archive. • It is possible to detect spectra that are consistently unidentified across hundreds of datasets (maybe peptide variants, or peptides with PTMs not initially considered). • Spectrum clustering is applicable in the analysis of individual datasets (and not only for proteomics!).
  • 59. Juan A. Vizcaíno juan@ebi.ac.uk Colloquium Dortmund, 11 August 2016 Aknowledgements: People Attila Csordas Tobias Ternent Gerhard Mayer (de.NBI) Johannes Griss Yasset Perez-Riverol Manuel Bernal-Llinares Andrew Jarnuczak Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob Acknowledgements: The PRIDE Team All data submitters !!!