SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Update to the PRIDE Cluster project
Dr. Juan Antonio VizcaĂ­no
Proteomics Team Leader
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
•PRIDE stores mass spectrometry (MS)-
based proteomics data:
•Peptide and protein expression data
(identification and quantification)
•Post-translational modifications
•Mass spectra (raw data and peak lists)
•Technical and biological metadata
•Any other related information
•Full support for tandem MS approaches
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
VizcaĂ­no et al., NAR, 2016
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
PRIDE Cluster: Initial Motivation
• Provide a QC-filtered peptide-centric view of PRIDE.
• Data is stored in PRIDE Archive as originally analysed by the
submitters (no data reprocessing is done).
• Heterogeneous quality, difficult to make the data comparable.
• Enable assessment of (published) proteomics data. Pre-
requisite for data reuse (e.g. in UniProt).
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
PRIDE Cluster: Second Implementation
• Griss et al., Nat. Methods, 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2016
• Clustered all public spectra in
PRIDE by April 2015
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Parallelizing Spectrum Clustering: Hadoop
• Optimizes work distribution among machines.
• Hadoop is a (open source) Framework for parallelism using
the Map-Reduce algorithm by Google.
• Solves many general issues of large parallel jobs:
• Scheduling
• inter-job communication
• failure
https://hadoop.apache.org/
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
PRIDE Cluster Home page
http://www.ebi.ac.uk/pride/cluster/#/
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
PRIDE Cluster: result of searches
http://www.ebi.ac.uk/pride/cluster/#/
A couple of examples …
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Examples: one perfect cluster
- 880 PSMs give the same peptide ID
- 4 species
- 28 datasets
- Same instruments
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Examples: one perfect cluster (2)
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
2. Inferring identifications for originally unidentified spectra
13
• 9.1 M unidentified spectra were contained in clusters with a reliable
identification.
• These are candidate new identifications (that need to be confirmed),
often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
3. Consistently unidentified clusters
• 19 M clusters contain only unidentified spectra.
• 41,155 of these spectra have more than 100 spectra (= 12 M spectra).
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
3. Consistently unidentified clusters
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
3. Consistently unidentified clusters
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
PRIDE Cluster as a Public Data Mining Resource
18
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Consistently unidentified clusters
• We provide the results split per species in MGF and mzML format.
• Very interested in getting people trying to work in those.
• Available for several species (Largest clusters at present).
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Aknowledgements: People
Attila Csordas
Tobias Ternent
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Former team members,
especially Rui Wang, Florian
Reisinger, Noemi del Toro, Jose
A. Dianes & Henning Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!
Juan A. VizcaĂ­no
juan@ebi.ac.uk
Bioinformatics Hub HUPO 2016
Taipei, September 2016
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics dataJuan Antonio Vizcaino
 
ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016Juan Antonio Vizcaino
 
Presentation PRIDE Cluster HUPO 2014
Presentation PRIDE Cluster HUPO 2014Presentation PRIDE Cluster HUPO 2014
Presentation PRIDE Cluster HUPO 2014Juan Antonio Vizcaino
 
Improving the Management of Computational Models -- Invited talk at the EBI
Improving the Management of Computational Models -- Invited talk at the EBIImproving the Management of Computational Models -- Invited talk at the EBI
Improving the Management of Computational Models -- Invited talk at the EBIMartin Scharm
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
Triples for the People (Scientists):  Liberating biological knowledge with t...
Triples for the People (Scientists):   Liberating biological knowledge with t...Triples for the People (Scientists):   Liberating biological knowledge with t...
Triples for the People (Scientists):  Liberating biological knowledge with t...Michel Dumontier
 
Data volumes in proteomics data resources: PRIDE and ProteomeXchange
Data volumes in proteomics data resources: PRIDE and ProteomeXchangeData volumes in proteomics data resources: PRIDE and ProteomeXchange
Data volumes in proteomics data resources: PRIDE and ProteomeXchangeJuan Antonio Vizcaino
 
Reproducibility of model-based results: standards, infrastructure, and recogn...
Reproducibility of model-based results: standards, infrastructure, and recogn...Reproducibility of model-based results: standards, infrastructure, and recogn...
Reproducibility of model-based results: standards, infrastructure, and recogn...FAIRDOM
 
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...MarĂ­a Poveda VillalĂłn
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Valery Tkachenko
 
Mtsr2015 goble-keynote
Mtsr2015 goble-keynoteMtsr2015 goble-keynote
Mtsr2015 goble-keynoteCarole Goble
 
Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016Carole Goble
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...Carole Goble
 

Was ist angesagt? (20)

Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
 
ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016
 
Presentation PRIDE Cluster HUPO 2014
Presentation PRIDE Cluster HUPO 2014Presentation PRIDE Cluster HUPO 2014
Presentation PRIDE Cluster HUPO 2014
 
Improving the Management of Computational Models -- Invited talk at the EBI
Improving the Management of Computational Models -- Invited talk at the EBIImproving the Management of Computational Models -- Invited talk at the EBI
Improving the Management of Computational Models -- Invited talk at the EBI
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Triples for the People (Scientists):  Liberating biological knowledge with t...
Triples for the People (Scientists):   Liberating biological knowledge with t...Triples for the People (Scientists):   Liberating biological knowledge with t...
Triples for the People (Scientists):  Liberating biological knowledge with t...
 
Data volumes in proteomics data resources: PRIDE and ProteomeXchange
Data volumes in proteomics data resources: PRIDE and ProteomeXchangeData volumes in proteomics data resources: PRIDE and ProteomeXchange
Data volumes in proteomics data resources: PRIDE and ProteomeXchange
 
CWA & SWAT4LS Pitch at DILS2009
CWA & SWAT4LS Pitch at DILS2009CWA & SWAT4LS Pitch at DILS2009
CWA & SWAT4LS Pitch at DILS2009
 
Reproducibility of model-based results: standards, infrastructure, and recogn...
Reproducibility of model-based results: standards, infrastructure, and recogn...Reproducibility of model-based results: standards, infrastructure, and recogn...
Reproducibility of model-based results: standards, infrastructure, and recogn...
 
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
 
Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...
 
Mtsr2015 goble-keynote
Mtsr2015 goble-keynoteMtsr2015 goble-keynote
Mtsr2015 goble-keynote
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 

Ähnlich wie Pride cluster presentation

Experiences to learn from the MS proteomics field
Experiences to learn from the MS proteomics fieldExperiences to learn from the MS proteomics field
Experiences to learn from the MS proteomics fieldJuan Antonio Vizcaino
 
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataPRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataJuan Antonio Vizcaino
 
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Juan Antonio Vizcaino
 
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...Juan Antonio Vizcaino
 
Reuse of public data in proteomics
Reuse of public data in proteomicsReuse of public data in proteomics
Reuse of public data in proteomicsJuan Antonio Vizcaino
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...Juan Antonio Vizcaino
 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateJuan Antonio Vizcaino
 
Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Juan Antonio Vizcaino
 
Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Juan Antonio Vizcaino
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics dataJuan Antonio Vizcaino
 
PRIDE and ProteomeXchange: Training webinar
PRIDE and ProteomeXchange: Training webinarPRIDE and ProteomeXchange: Training webinar
PRIDE and ProteomeXchange: Training webinarJuan Antonio Vizcaino
 
Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Juan Antonio Vizcaino
 
Reproducibility (and the R*) of Science: motivations, challenges and trends
Reproducibility (and the R*) of Science: motivations, challenges and trendsReproducibility (and the R*) of Science: motivations, challenges and trends
Reproducibility (and the R*) of Science: motivations, challenges and trendsCarole Goble
 

Ähnlich wie Pride cluster presentation (20)

Pride Cluster 062016 Update
Pride Cluster 062016 UpdatePride Cluster 062016 Update
Pride Cluster 062016 Update
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
 
Proteomexchange
ProteomexchangeProteomexchange
Proteomexchange
 
Experiences to learn from the MS proteomics field
Experiences to learn from the MS proteomics fieldExperiences to learn from the MS proteomics field
Experiences to learn from the MS proteomics field
 
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataPRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
 
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
 
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
 
Reuse of public data in proteomics
Reuse of public data in proteomicsReuse of public data in proteomics
Reuse of public data in proteomics
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 update
 
Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?
 
PRIDE and ProteomeXchange
PRIDE and ProteomeXchangePRIDE and ProteomeXchange
PRIDE and ProteomeXchange
 
Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
ProteomeXchange update
ProteomeXchange updateProteomeXchange update
ProteomeXchange update
 
PRIDE and ProteomeXchange: Training webinar
PRIDE and ProteomeXchange: Training webinarPRIDE and ProteomeXchange: Training webinar
PRIDE and ProteomeXchange: Training webinar
 
Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
Reproducibility (and the R*) of Science: motivations, challenges and trends
Reproducibility (and the R*) of Science: motivations, challenges and trendsReproducibility (and the R*) of Science: motivations, challenges and trends
Reproducibility (and the R*) of Science: motivations, challenges and trends
 

Mehr von Juan Antonio Vizcaino

Introduction to the PSI standard data formats
Introduction to the PSI standard data formatsIntroduction to the PSI standard data formats
Introduction to the PSI standard data formatsJuan Antonio Vizcaino
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics dataJuan Antonio Vizcaino
 
PRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchangePRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchangeJuan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Juan Antonio Vizcaino
 
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...Juan Antonio Vizcaino
 
PSI-Proteome Informatics update
PSI-Proteome Informatics updatePSI-Proteome Informatics update
PSI-Proteome Informatics updateJuan Antonio Vizcaino
 
The ELIXIR Proteomics community
The ELIXIR Proteomics community The ELIXIR Proteomics community
The ELIXIR Proteomics community Juan Antonio Vizcaino
 
The ELIXIR Proteomics Community
The ELIXIR Proteomics CommunityThe ELIXIR Proteomics Community
The ELIXIR Proteomics CommunityJuan Antonio Vizcaino
 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?Juan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Juan Antonio Vizcaino
 
Enabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataEnabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataJuan Antonio Vizcaino
 
Introduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRIntroduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRJuan Antonio Vizcaino
 
The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)Juan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Juan Antonio Vizcaino
 

Mehr von Juan Antonio Vizcaino (17)

Introduction to the PSI standard data formats
Introduction to the PSI standard data formatsIntroduction to the PSI standard data formats
Introduction to the PSI standard data formats
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
PRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchangePRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchange
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018
 
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
 
PSI-Proteome Informatics update
PSI-Proteome Informatics updatePSI-Proteome Informatics update
PSI-Proteome Informatics update
 
The ELIXIR Proteomics community
The ELIXIR Proteomics community The ELIXIR Proteomics community
The ELIXIR Proteomics community
 
The ELIXIR Proteomics Community
The ELIXIR Proteomics CommunityThe ELIXIR Proteomics Community
The ELIXIR Proteomics Community
 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017
 
ProteomeXchange update 2017
ProteomeXchange update 2017ProteomeXchange update 2017
ProteomeXchange update 2017
 
Enabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataEnabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics data
 
Introduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRIntroduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIR
 
The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)
 
Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016
 

KĂźrzlich hochgeladen

Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSĂŠrgio Sacani
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSĂŠrgio Sacani
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .Poonam Aher Patil
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 

KĂźrzlich hochgeladen (20)

Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 

Pride cluster presentation

  • 1. Update to the PRIDE Cluster project Dr. Juan Antonio VizcaĂ­no Proteomics Team Leader EMBL-European Bioinformatics Institute Hinxton, Cambridge, UK
  • 2. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 •PRIDE stores mass spectrometry (MS)- based proteomics data: •Peptide and protein expression data (identification and quantification) •Post-translational modifications •Mass spectra (raw data and peak lists) •Technical and biological metadata •Any other related information •Full support for tandem MS approaches PRIDE (PRoteomics IDEntifications) database http://www.ebi.ac.uk/pride/archive Martens et al., Proteomics, 2005 VizcaĂ­no et al., NAR, 2016
  • 3. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 PRIDE Cluster: Initial Motivation • Provide a QC-filtered peptide-centric view of PRIDE. • Data is stored in PRIDE Archive as originally analysed by the submitters (no data reprocessing is done). • Heterogeneous quality, difficult to make the data comparable. • Enable assessment of (published) proteomics data. Pre- requisite for data reuse (e.g. in UniProt).
  • 4. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 PRIDE Cluster - Concept Griss et al., Nat Methods, 2016 NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR Consensus spectrum PPECPDFDPPR NMMAACDPR NMMAACDPR Threshold: At least 3 spectra in a cluster and ratio >70%. Originally submitted identified spectra Spectrum clustering
  • 5. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 PRIDE Cluster: Second Implementation • Griss et al., Nat. Methods, 2013 • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF • Griss et al., Nat. Methods, 2016 • Clustered all public spectra in PRIDE by April 2015 • Apache Hadoop. • Starting with 256 M spectra. • 190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide). • 66 M identified spectra • Result: 28 M clusters • 5 calendar days on 30 node Hadoop cluster, 340 CPU cores
  • 6. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 Parallelizing Spectrum Clustering: Hadoop • Optimizes work distribution among machines. • Hadoop is a (open source) Framework for parallelism using the Map-Reduce algorithm by Google. • Solves many general issues of large parallel jobs: • Scheduling • inter-job communication • failure https://hadoop.apache.org/
  • 7. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 PRIDE Cluster Home page http://www.ebi.ac.uk/pride/cluster/#/
  • 8. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 PRIDE Cluster: result of searches http://www.ebi.ac.uk/pride/cluster/#/ A couple of examples …
  • 9. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 Examples: one perfect cluster - 880 PSMs give the same peptide ID - 4 species - 28 datasets - Same instruments
  • 10. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 Examples: one perfect cluster (2)
  • 11. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 Output of the analysis • 1. Inconsistent spectrum clusters • 2. Clusters including identified and unidentified spectra. • 3. Clusters just containing unidentified spectra.
  • 12. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 Output of the analysis • 1. Inconsistent spectrum clusters • 2. Clusters including identified and unidentified spectra. • 3. Clusters just containing unidentified spectra.
  • 13. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 2. Inferring identifications for originally unidentified spectra 13 • 9.1 M unidentified spectra were contained in clusters with a reliable identification. • These are candidate new identifications (that need to be confirmed), often missed due to search engine settings • Example: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.
  • 14. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 Output of the analysis • 1. Inconsistent spectrum clusters • 2. Clusters including identified and unidentified spectra. • 3. Clusters just containing unidentified spectra.
  • 15. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 3. Consistently unidentified clusters • 19 M clusters contain only unidentified spectra. • 41,155 of these spectra have more than 100 spectra (= 12 M spectra). • Most of them are likely to be derived from peptides. • They could correspond to PTMs or variant peptides. • With various methods, we found likely identifications for about 20%. • Vast amount of data mining remains to be done.
  • 16. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 3. Consistently unidentified clusters
  • 17. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 3. Consistently unidentified clusters
  • 18. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 PRIDE Cluster as a Public Data Mining Resource 18 • http://www.ebi.ac.uk/pride/cluster • Spectral libraries for 16 species. • All clustering results, as well as specific subsets of interest available. • Source code (open source) and Java API
  • 19. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 Consistently unidentified clusters • We provide the results split per species in MGF and mzML format. • Very interested in getting people trying to work in those. • Available for several species (Largest clusters at present).
  • 20. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016
  • 21. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 Aknowledgements: People Attila Csordas Tobias Ternent Gerhard Mayer (de.NBI) Johannes Griss Yasset Perez-Riverol Manuel Bernal-Llinares Andrew Jarnuczak Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob Acknowledgements: The PRIDE Team All data submitters !!!
  • 22. Juan A. VizcaĂ­no juan@ebi.ac.uk Bioinformatics Hub HUPO 2016 Taipei, September 2016 Questions?