SlideShare ist ein Scribd-Unternehmen logo
1 von 23
PRIDE: Quality control in a proteomics
data repository
Attila Csordas
Proteomics Services Team
Biocuration Conference
April 2nd, 2012



1/23
Overview

              who are we?

             what are we dealing with?

              manual curation and submission

              quick detour: ProteomeXchange

              automated curation & submission pipeline

              conclusion


       April 2, 2012
2/23
PRIDE: http://www.ebi.ac.uk/pride
       The PRoteomics IDEntifications database is
       a centralised, primary, archival, public data
          repository for MS/MS proteomics data
        containing peptide ids, protein ids, mass
            spectra, protein expression values,
                         metadata.




3/23
        April 2, 2012
Acknowledgements
                 colleagues at the PRIDE team




                             @pride_ebi

                         pride-ebi@ebi.ac.uk
                         pride-support@ebi.ac.uk


       http://code.google.com/p/pride-toolsuite/
       http://code.google.com/p/pride-converter-2/


4/23
        April 2, 2012
Mass spectrometry
analytical technique measuring the mass-to-charge (m/z) ratio of charged
        particles to determine masses of particles, composition of
        samples/molecules and chemical structures of molecules




             April 2, 2012
5/23
Shotgun/bottom-up proteomics

                                                      P
peptides                             MS/MS analysis
                                                      R
                                                      O
           sequence
           database                                   T
proteins                                              O
                              fragmentation
                                                      C
      MS analysis                                     O
                                                      L



              April 2, 2012
 6/23
What is a PRIDE submission?




7/23
        April 2, 2012
growth of
core data types                   130 million




                                   23 million
                                   4.6 million




  8/23
                  April 2, 2012
Manual curation and submission process
       Search
   Engine + spectra

                                   PRIDE
                                  Converter


                                  pride xml

Mascot (.dat),
X!Tandem (.xml) + mgf




9/23
                  April 2, 2012
PRIDE Inspector

initial assessment
on data quality

visualise/check data

summary charts

support for submitters &
reviewers/editors

more flexible than web
interface




  10/23
                 April 2, 2012
Frequent Data Quality Issues

                           <SearchEngine>PeptideShaker</SearchEngine>
  1. syntactic problems    <PeptideItem>



   2a. core data missing                no protein/peptide identifications




   2b. or metadata missing              no species




   3.inconsistent/incorrect data        protein modifications




11/23
           April 2, 2012
Delta m/z of detected peptide precursors


experimental precursor ion m/z - theoretical precursor ion m/z




   source of delta m/z outliers: incorrect or missing protein
   modifications and charge state misassignments




 12/23
             April 2, 2012
Fixing modifications based on delta m/z outliers




13/23
            April 2, 2012
Fixing modifications based on delta m/z outliers




14/23
            April 2, 2012
but the manual approach does not scale!




15/23
         April 2, 2012
10 times as many & big submissions/ day?




16/23
        April 2, 2012
single point of submission of data to the main repositories to encourage data exchange

                          Published        Raw       Reprocessed


 Individual
submissions
                                                       PeptideAtlas
                                 EBI
                                PRIDE   Raw files                                 Users
                                         archive
Large-scale
submissions

                            UniProt
                                               Other DBs
                                              (GPMDB, …)



17/23
                April 2, 2012
PX submission pipeline




                                                                    Proteome
PX Tool                     Validation   Submission   Publication
                                                                     Central




            Files

    Raw             PRIDE
    Files            XML

        Summary




18/23
                       April 2, 2012
Automated regular submission pipeline
         curation-submission time is ~1/6th of manual time

                            actionable curation summary

  number of files: 3
  Project: Combined personal saliva proteome and microbioproteome
  XML generator software         PRIDE Converter Toolsuite 2.0-
  SNAPSHOT
Filename size         Species      #Proteins   #Peptides #Spectra   #Unid-d   PTMs   % delta
                                                                    spectra          m/z
                                                                                     outlier

22143.    3.3 GB      Homo         4128        60544    184209      123665    3      0.0
xml                   sapiens                           spectra     spectra




 19/23
                   April 2, 2012
Conclusion

                growing amount of data


                growingly complex data


                scalability issues


              overcoming them by automation
              and new, smarter curation strategies




20/23
        April 2, 2012
21/23
        April 2, 2012
Thanks for the attention!




22/23
        April 2, 2012
acsordas@ebi.ac.uk
        Q&A                 @attilacsordas

23/23
        April 2, 2012

Weitere ähnliche Inhalte

Ähnlich wie Pride quality controlattilacsordasbiocuration2012

Proteomics & Metabolomics
Proteomics & MetabolomicsProteomics & Metabolomics
Proteomics & Metabolomics
gumccomm
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
TERN Australia
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototype
pathsproject
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
OSTHUS
 
Automated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseAutomated data pipelines at the rat genome database
Automated data pipelines at the rat genome database
Jennifer Smith
 

Ähnlich wie Pride quality controlattilacsordasbiocuration2012 (13)

C044041723
C044041723C044041723
C044041723
 
Proteomics & Metabolomics
Proteomics & MetabolomicsProteomics & Metabolomics
Proteomics & Metabolomics
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
 
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for MetadataRDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototype
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easy
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 
OpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scaleOpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scale
 
Presentation of agriopenlink @ EFITA (main program)
Presentation of agriopenlink @ EFITA (main program)Presentation of agriopenlink @ EFITA (main program)
Presentation of agriopenlink @ EFITA (main program)
 
Information systems a revision
Information systems  a revisionInformation systems  a revision
Information systems a revision
 
IntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotationsIntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotations
 
Nanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkNanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS Talk
 
Automated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseAutomated data pipelines at the rat genome database
Automated data pipelines at the rat genome database
 

Mehr von attilacsordas

Mehr von attilacsordas (15)

Aging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological agingAging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological aging
 
Towards a consensus definition of biological aging
Towards a consensus definition of biological agingTowards a consensus definition of biological aging
Towards a consensus definition of biological aging
 
Aging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitionsAging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitions
 
What is it like to be 572 year old?
What is it like to be 572 year old?What is it like to be 572 year old?
What is it like to be 572 year old?
 
Cell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenationCell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenation
 
The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...
 
Open Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original PositionOpen Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original Position
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Ultrcentifugation: Basic Training
Ultrcentifugation: Basic TrainingUltrcentifugation: Basic Training
Ultrcentifugation: Basic Training
 
Merry XOmas
Merry XOmasMerry XOmas
Merry XOmas
 
Google's Palimpsest Project
Google's Palimpsest ProjectGoogle's Palimpsest Project
Google's Palimpsest Project
 
LindaPowers onSENS3
LindaPowers onSENS3LindaPowers onSENS3
LindaPowers onSENS3
 
SENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest OldSENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest Old
 
SENS3: Michael Rose
SENS3: Michael RoseSENS3: Michael Rose
SENS3: Michael Rose
 
Microvesiclesslide
MicrovesiclesslideMicrovesiclesslide
Microvesiclesslide
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Pride quality controlattilacsordasbiocuration2012

  • 1. PRIDE: Quality control in a proteomics data repository Attila Csordas Proteomics Services Team Biocuration Conference April 2nd, 2012 1/23
  • 2. Overview who are we? what are we dealing with? manual curation and submission quick detour: ProteomeXchange automated curation & submission pipeline conclusion April 2, 2012 2/23
  • 3. PRIDE: http://www.ebi.ac.uk/pride The PRoteomics IDEntifications database is a centralised, primary, archival, public data repository for MS/MS proteomics data containing peptide ids, protein ids, mass spectra, protein expression values, metadata. 3/23 April 2, 2012
  • 4. Acknowledgements colleagues at the PRIDE team @pride_ebi pride-ebi@ebi.ac.uk pride-support@ebi.ac.uk http://code.google.com/p/pride-toolsuite/ http://code.google.com/p/pride-converter-2/ 4/23 April 2, 2012
  • 5. Mass spectrometry analytical technique measuring the mass-to-charge (m/z) ratio of charged particles to determine masses of particles, composition of samples/molecules and chemical structures of molecules April 2, 2012 5/23
  • 6. Shotgun/bottom-up proteomics P peptides MS/MS analysis R O sequence database T proteins O fragmentation C MS analysis O L April 2, 2012 6/23
  • 7. What is a PRIDE submission? 7/23 April 2, 2012
  • 8. growth of core data types 130 million 23 million 4.6 million 8/23 April 2, 2012
  • 9. Manual curation and submission process Search Engine + spectra PRIDE Converter pride xml Mascot (.dat), X!Tandem (.xml) + mgf 9/23 April 2, 2012
  • 10. PRIDE Inspector initial assessment on data quality visualise/check data summary charts support for submitters & reviewers/editors more flexible than web interface 10/23 April 2, 2012
  • 11. Frequent Data Quality Issues <SearchEngine>PeptideShaker</SearchEngine> 1. syntactic problems <PeptideItem> 2a. core data missing no protein/peptide identifications 2b. or metadata missing no species 3.inconsistent/incorrect data protein modifications 11/23 April 2, 2012
  • 12. Delta m/z of detected peptide precursors experimental precursor ion m/z - theoretical precursor ion m/z source of delta m/z outliers: incorrect or missing protein modifications and charge state misassignments 12/23 April 2, 2012
  • 13. Fixing modifications based on delta m/z outliers 13/23 April 2, 2012
  • 14. Fixing modifications based on delta m/z outliers 14/23 April 2, 2012
  • 15. but the manual approach does not scale! 15/23 April 2, 2012
  • 16. 10 times as many & big submissions/ day? 16/23 April 2, 2012
  • 17. single point of submission of data to the main repositories to encourage data exchange Published Raw Reprocessed Individual submissions PeptideAtlas EBI PRIDE Raw files Users archive Large-scale submissions UniProt Other DBs (GPMDB, …) 17/23 April 2, 2012
  • 18. PX submission pipeline Proteome PX Tool Validation Submission Publication Central Files Raw PRIDE Files XML Summary 18/23 April 2, 2012
  • 19. Automated regular submission pipeline curation-submission time is ~1/6th of manual time actionable curation summary number of files: 3 Project: Combined personal saliva proteome and microbioproteome XML generator software PRIDE Converter Toolsuite 2.0- SNAPSHOT Filename size Species #Proteins #Peptides #Spectra #Unid-d PTMs % delta spectra m/z outlier 22143. 3.3 GB Homo 4128 60544 184209 123665 3 0.0 xml sapiens spectra spectra 19/23 April 2, 2012
  • 20. Conclusion growing amount of data growingly complex data scalability issues overcoming them by automation and new, smarter curation strategies 20/23 April 2, 2012
  • 21. 21/23 April 2, 2012
  • 22. Thanks for the attention! 22/23 April 2, 2012
  • 23. acsordas@ebi.ac.uk Q&A @attilacsordas 23/23 April 2, 2012