SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Spectrum clustering of PRIDE MS/MS
data
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-EBI
Hinxton, Cambridge, UK
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Overview
• Brief introduction to PRIDE
• PRIDE Cluster: motivation and concept, first implementation
• PRIDE Cluster second implementation
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Overview
• Brief introduction to PRIDE
• PRIDE Cluster: motivation and concept, first implementation
• PRIDE Cluster second implementation
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
• PRIDE stores mass spectrometry (MS)-
based proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak
lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
ProteomeXchange: A Global, distributed proteomics
database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
155 datasets/month
since July 2015
Mandatory raw data deposition
since July 2015
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Overview
• Brief introduction to PRIDE
• PRIDE Cluster: motivation and concept, first
implementation
• PRIDE Cluster second implementation
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Motivation
• Data is stored in PRIDE as originally analysed by the
submitters (no data reprocessing is done)
• Heterogeneous quality, difficult to make the data
comparable
• Enable assessment of (published) proteomics data
• Pre-requisite for data reuse (e.g. in other bioinformatics
resources such as UniProt)
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2013
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 10 spectra in
a cluster and ratio >70%.
Originally submitted identified spectra
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster - Concept
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster: Implementation
• Griss et al, Nat. Methods 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Overview
• Brief introduction to PRIDE
• PRIDE Cluster: motivation and concept, first implementation
• PRIDE Cluster second implementation
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Archive
• World-leading repository for MS/MS-based proteomics
data
• Founding member of ProteomeXchange
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster
Sequence-based
search engines
Spectrum clustering
Incorrectly or
unidentified spectra
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster: Second Implementation
• Griss et al, Nat. Methods 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
Griss et al, Nat. Methods 2016, in
press
Clustered all public spectra in
PRIDE by April 2015
Apache Hadoop
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster - Concept
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster - Concept
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified clusters
• 3. Clusters just containing unidentified spectra
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
1. Re-analysis of inconsistent clusters
NMMAACDPR
NMMAACDPR
IGGIGTVPVGR
NMMAACDPR
PPECPDFDPPR
VFDEFKPLVEEPQNLIK
NMMAACDPR
IGGIGTVPVGR
No sequence has a
proportion in the cluster
>50%
Consensus spectrum
PPECPDFDPPR
VFDEFKPLVEEP
QNLIK
Originally submitted identified spectra
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
1. Re-analysis of inconsistent clusters
• Re-analysed 3,997 large (>100 spectra), inconsistent clusters with
PepNovo, SpectraST, X!Tandem.
• 453 clusters (11%) were identified as peptides originated from
keratins, trypsin, albumin, and hemoglobin.
• In this case, it is likely that a contaminants DB was not used in the
search.
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Validation
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
2. Inferring identifications for originally unidentified spectra
30
• 9.1 M unidentified spectra were contained in clusters with a reliable
identification.
• These are candidate new identifications (that need to be confirmed),
often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
3. Consistently unidentified clusters
31
• 19 M clusters contain only unidentified spectra.
• 41,155 of these spectra have more than 100 spectra (= 12 M
spectra).
• Most are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
3. Consistently unidentified clusters
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
PRIDE Cluster as a Public Data Mining Resource
36
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Applications of spectrum clustering…
38
• In individual or small groups or “similar” datasets:
• Can be used to target spectra that are “consistently” unidentified.
• Unidentified spectra could represent PTMs or sequence variants.
• Try “more-expensive” computational analysis methods (e.g.
spectral searches, de novo).
• When mixing identified and unidentified spectra from different
experiments, if “non-initially” found PTMs are identified, one could
modify the initial search parameters.
• For quantification purposes, the alignment of different runs could
be improved by clustering the spectra first?
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Acknowledgements
Johannes Griss
Rui Wang
Yasset Perez-Riverol
Steve Lewis
Henning Hermjakob
Open MS team (led by O. Kohlbacher)
David Tabb
The rest of the PRIDE team especially
Noemi del Toro and Jose A. Dianes
Juan A. Vizcaíno
juan@ebi.ac.uk
Seminar
20 June 2016
Questions?

Weitere ähnliche Inhalte

Andere mochten auch

8 2-43 normas apa y referencias bibliograficas
8 2-43 normas apa y referencias bibliograficas8 2-43 normas apa y referencias bibliograficas
8 2-43 normas apa y referencias bibliograficasgeraldinezapata23
 
Using Social Media for Community Planning
Using Social Media for Community PlanningUsing Social Media for Community Planning
Using Social Media for Community PlanningOrville Morales
 
πολυτεχνείο 2014
πολυτεχνείο 2014πολυτεχνείο 2014
πολυτεχνείο 2014michaelathea
 
Issue desk slides summer 11
Issue desk slides summer 11Issue desk slides summer 11
Issue desk slides summer 11uclmainlibrary
 
Инновационный метод лечения широкого круга аллергических заболеваний
Инновационный метод лечения широкого круга аллергических заболеванийИнновационный метод лечения широкого круга аллергических заболеваний
Инновационный метод лечения широкого круга аллергических заболеванийkulibin
 
compte-rendu du colloque international sur le financement de la création
compte-rendu du colloque international sur le financement de la créationcompte-rendu du colloque international sur le financement de la création
compte-rendu du colloque international sur le financement de la créationMinistereCC
 
Tutorial Forever El Salvador
Tutorial Forever El SalvadorTutorial Forever El Salvador
Tutorial Forever El SalvadorCamilo Acosta
 
1st Zone Asian Photo Circuit 2016:Awarded Images (2)
1st Zone Asian Photo Circuit 2016:Awarded Images (2)1st Zone Asian Photo Circuit 2016:Awarded Images (2)
1st Zone Asian Photo Circuit 2016:Awarded Images (2)maditabalnco
 
GoogleStrategyTeardown_Whitepaper_2016
GoogleStrategyTeardown_Whitepaper_2016GoogleStrategyTeardown_Whitepaper_2016
GoogleStrategyTeardown_Whitepaper_2016Kerry Wu
 
Basic use of xcms
Basic use of xcmsBasic use of xcms
Basic use of xcmsXiuxia Du
 
2011 dmb baseball_package
2011 dmb baseball_package2011 dmb baseball_package
2011 dmb baseball_package태준 박
 
Геймификация для бизнеса
Геймификация для бизнесаГеймификация для бизнеса
Геймификация для бизнесаPryaniky.com
 

Andere mochten auch (15)

8 2-43 normas apa y referencias bibliograficas
8 2-43 normas apa y referencias bibliograficas8 2-43 normas apa y referencias bibliograficas
8 2-43 normas apa y referencias bibliograficas
 
Cl zeel plast-machinery
Cl zeel plast-machineryCl zeel plast-machinery
Cl zeel plast-machinery
 
Using Social Media for Community Planning
Using Social Media for Community PlanningUsing Social Media for Community Planning
Using Social Media for Community Planning
 
πολυτεχνείο 2014
πολυτεχνείο 2014πολυτεχνείο 2014
πολυτεχνείο 2014
 
Issue desk slides summer 11
Issue desk slides summer 11Issue desk slides summer 11
Issue desk slides summer 11
 
Agosto (2)jardim
Agosto (2)jardimAgosto (2)jardim
Agosto (2)jardim
 
Инновационный метод лечения широкого круга аллергических заболеваний
Инновационный метод лечения широкого круга аллергических заболеванийИнновационный метод лечения широкого круга аллергических заболеваний
Инновационный метод лечения широкого круга аллергических заболеваний
 
compte-rendu du colloque international sur le financement de la création
compte-rendu du colloque international sur le financement de la créationcompte-rendu du colloque international sur le financement de la création
compte-rendu du colloque international sur le financement de la création
 
Tutorial Forever El Salvador
Tutorial Forever El SalvadorTutorial Forever El Salvador
Tutorial Forever El Salvador
 
1st Zone Asian Photo Circuit 2016:Awarded Images (2)
1st Zone Asian Photo Circuit 2016:Awarded Images (2)1st Zone Asian Photo Circuit 2016:Awarded Images (2)
1st Zone Asian Photo Circuit 2016:Awarded Images (2)
 
GoogleStrategyTeardown_Whitepaper_2016
GoogleStrategyTeardown_Whitepaper_2016GoogleStrategyTeardown_Whitepaper_2016
GoogleStrategyTeardown_Whitepaper_2016
 
FTIR-ATR Characterization of Commercial Honey Samples
FTIR-ATR Characterization of Commercial Honey SamplesFTIR-ATR Characterization of Commercial Honey Samples
FTIR-ATR Characterization of Commercial Honey Samples
 
Basic use of xcms
Basic use of xcmsBasic use of xcms
Basic use of xcms
 
2011 dmb baseball_package
2011 dmb baseball_package2011 dmb baseball_package
2011 dmb baseball_package
 
Геймификация для бизнеса
Геймификация для бизнесаГеймификация для бизнеса
Геймификация для бизнеса
 

Ähnlich wie Pride Cluster 062016 Update

Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsJuan Antonio Vizcaino
 
Experiences to learn from the MS proteomics field
Experiences to learn from the MS proteomics fieldExperiences to learn from the MS proteomics field
Experiences to learn from the MS proteomics fieldJuan Antonio Vizcaino
 
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataPRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataJuan Antonio Vizcaino
 
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...Juan Antonio Vizcaino
 
Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Juan Antonio Vizcaino
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...Juan Antonio Vizcaino
 
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Juan Antonio Vizcaino
 
Presentation PRIDE Cluster HUPO 2014
Presentation PRIDE Cluster HUPO 2014Presentation PRIDE Cluster HUPO 2014
Presentation PRIDE Cluster HUPO 2014Juan Antonio Vizcaino
 
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Juan Antonio Vizcaino
 
Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Juan Antonio Vizcaino
 
Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Juan Antonio Vizcaino
 

Ähnlich wie Pride Cluster 062016 Update (20)

Pride cluster presentation
Pride cluster presentation Pride cluster presentation
Pride cluster presentation
 
ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasets
 
Experiences to learn from the MS proteomics field
Experiences to learn from the MS proteomics fieldExperiences to learn from the MS proteomics field
Experiences to learn from the MS proteomics field
 
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataPRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
 
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
 
Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
 
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
 
Presentation PRIDE Cluster HUPO 2014
Presentation PRIDE Cluster HUPO 2014Presentation PRIDE Cluster HUPO 2014
Presentation PRIDE Cluster HUPO 2014
 
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...
 
Pride and ProteomeXchange
Pride and ProteomeXchangePride and ProteomeXchange
Pride and ProteomeXchange
 
PRIDE and ProteomeXchange
PRIDE and ProteomeXchangePRIDE and ProteomeXchange
PRIDE and ProteomeXchange
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 

Mehr von Juan Antonio Vizcaino

Introduction to the PSI standard data formats
Introduction to the PSI standard data formatsIntroduction to the PSI standard data formats
Introduction to the PSI standard data formatsJuan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Juan Antonio Vizcaino
 
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...Juan Antonio Vizcaino
 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateJuan Antonio Vizcaino
 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?Juan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Juan Antonio Vizcaino
 
Enabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataEnabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataJuan Antonio Vizcaino
 
Introduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRIntroduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRJuan Antonio Vizcaino
 
The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)Juan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Juan Antonio Vizcaino
 

Mehr von Juan Antonio Vizcaino (20)

Introduction to the PSI standard data formats
Introduction to the PSI standard data formatsIntroduction to the PSI standard data formats
Introduction to the PSI standard data formats
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
PRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchangePRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchange
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018
 
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
 
PSI-Proteome Informatics update
PSI-Proteome Informatics updatePSI-Proteome Informatics update
PSI-Proteome Informatics update
 
ProteomeXchange update
ProteomeXchange updateProteomeXchange update
ProteomeXchange update
 
The ELIXIR Proteomics community
The ELIXIR Proteomics community The ELIXIR Proteomics community
The ELIXIR Proteomics community
 
The ELIXIR Proteomics Community
The ELIXIR Proteomics CommunityThe ELIXIR Proteomics Community
The ELIXIR Proteomics Community
 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 update
 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017
 
ProteomeXchange update 2017
ProteomeXchange update 2017ProteomeXchange update 2017
ProteomeXchange update 2017
 
Enabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataEnabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics data
 
Introduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRIntroduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIR
 
The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)
 
Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016
 
Reuse of public data in proteomics
Reuse of public data in proteomicsReuse of public data in proteomics
Reuse of public data in proteomics
 

Kürzlich hochgeladen

Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 

Kürzlich hochgeladen (20)

Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 

Pride Cluster 062016 Update

  • 1. Spectrum clustering of PRIDE MS/MS data Dr. Juan Antonio Vizcaíno Proteomics Team Leader EMBL-EBI Hinxton, Cambridge, UK
  • 2. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Overview • Brief introduction to PRIDE • PRIDE Cluster: motivation and concept, first implementation • PRIDE Cluster second implementation
  • 3. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Overview • Brief introduction to PRIDE • PRIDE Cluster: motivation and concept, first implementation • PRIDE Cluster second implementation
  • 4. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE (PRoteomics IDEntifications) database http://www.ebi.ac.uk/pride • PRIDE stores mass spectrometry (MS)- based proteomics data: • Peptide and protein expression data (identification and quantification) • Post-translational modifications • Mass spectra (raw data and peak lists) • Technical and biological metadata • Any other related information • Full support for tandem MS approaches Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2016
  • 5. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 ProteomeXchange: A Global, distributed proteomics database PASSEL (SRM data) PRIDE (MS/MS data) MassIVE (MS/MS data) Raw ID/Q Meta 155 datasets/month since July 2015 Mandatory raw data deposition since July 2015
  • 6. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Overview • Brief introduction to PRIDE • PRIDE Cluster: motivation and concept, first implementation • PRIDE Cluster second implementation
  • 7. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Motivation • Data is stored in PRIDE as originally analysed by the submitters (no data reprocessing is done) • Heterogeneous quality, difficult to make the data comparable • Enable assessment of (published) proteomics data • Pre-requisite for data reuse (e.g. in other bioinformatics resources such as UniProt)
  • 8. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster - Concept Griss et al., Nat Methods, 2013 NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR Consensus spectrum PPECPDFDPPR NMMAACDPR NMMAACDPR Threshold: At least 10 spectra in a cluster and ratio >70%. Originally submitted identified spectra
  • 9. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster - Concept
  • 10. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster: Implementation • Griss et al, Nat. Methods 2013 • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF
  • 11. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Overview • Brief introduction to PRIDE • PRIDE Cluster: motivation and concept, first implementation • PRIDE Cluster second implementation
  • 12. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Archive • World-leading repository for MS/MS-based proteomics data • Founding member of ProteomeXchange
  • 13. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster Sequence-based search engines Spectrum clustering Incorrectly or unidentified spectra
  • 14. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster: Second Implementation • Griss et al, Nat. Methods 2013 • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF Griss et al, Nat. Methods 2016, in press Clustered all public spectra in PRIDE by April 2015 Apache Hadoop • Starting with 256 M spectra. • 190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide). • 66 M identified spectra • Result: 28 M clusters • 5 calendar days on 30 node Hadoop cluster, 340 CPU cores
  • 15. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster - Concept Griss et al., Nat Methods, 2016 NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR Consensus spectrum PPECPDFDPPR NMMAACDPR NMMAACDPR Threshold: At least 3 spectra in a cluster and ratio >70%. Originally submitted identified spectra
  • 16. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster - Concept
  • 17. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster - Concept Griss et al., Nat Methods, 2016 NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR Consensus spectrum PPECPDFDPPR NMMAACDPR NMMAACDPR Threshold: At least 3 spectra in a cluster and ratio >70%. Originally submitted identified spectra
  • 18. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster - Concept
  • 19. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Output of the analysis • 1. Inconsistent spectrum clusters • 2. Clusters including identified and unidentified clusters • 3. Clusters just containing unidentified spectra
  • 20. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 1. Re-analysis of inconsistent clusters NMMAACDPR NMMAACDPR IGGIGTVPVGR NMMAACDPR PPECPDFDPPR VFDEFKPLVEEPQNLIK NMMAACDPR IGGIGTVPVGR No sequence has a proportion in the cluster >50% Consensus spectrum PPECPDFDPPR VFDEFKPLVEEP QNLIK Originally submitted identified spectra
  • 21. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 1. Re-analysis of inconsistent clusters • Re-analysed 3,997 large (>100 spectra), inconsistent clusters with PepNovo, SpectraST, X!Tandem. • 453 clusters (11%) were identified as peptides originated from keratins, trypsin, albumin, and hemoglobin. • In this case, it is likely that a contaminants DB was not used in the search.
  • 28. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 2. Inferring identifications for originally unidentified spectra 30 • 9.1 M unidentified spectra were contained in clusters with a reliable identification. • These are candidate new identifications (that need to be confirmed), often missed due to search engine settings • Example: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.
  • 29. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 3. Consistently unidentified clusters 31 • 19 M clusters contain only unidentified spectra. • 41,155 of these spectra have more than 100 spectra (= 12 M spectra). • Most are likely to be derived from peptides. • They could correspond to PTMs or variant peptides. • With various methods, we found likely identifications for about 20%. • Vast amount of data mining remains to be done.
  • 30. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 3. Consistently unidentified clusters
  • 31. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 PRIDE Cluster as a Public Data Mining Resource 36 • http://www.ebi.ac.uk/pride/cluster • Spectral libraries for 16 species. • All clustering results, as well as specific subsets of interest available. • Source code (open source) and Java API
  • 33. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Applications of spectrum clustering… 38 • In individual or small groups or “similar” datasets: • Can be used to target spectra that are “consistently” unidentified. • Unidentified spectra could represent PTMs or sequence variants. • Try “more-expensive” computational analysis methods (e.g. spectral searches, de novo). • When mixing identified and unidentified spectra from different experiments, if “non-initially” found PTMs are identified, one could modify the initial search parameters. • For quantification purposes, the alignment of different runs could be improved by clustering the spectra first?
  • 34. Juan A. Vizcaíno juan@ebi.ac.uk Seminar 20 June 2016 Acknowledgements Johannes Griss Rui Wang Yasset Perez-Riverol Steve Lewis Henning Hermjakob Open MS team (led by O. Kohlbacher) David Tabb The rest of the PRIDE team especially Noemi del Toro and Jose A. Dianes