SlideShare ist ein Scribd-Unternehmen logo
1 von 28
You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;

Blog, live-blog, or post video of;

This presentation. Provided that:
You attribute the work to its author and respect the rights
and licenses associated with its components.

Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;

Blog, live-blog, or post video of;

This presentation. Provided that:
You attribute the work to its author and respect the rights
and licenses associated with its components.

Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
Disclaimer
• I do not (and will not) profit in any way, shape
or form, from any of the brands, products or
companies I may mention in this
presentation.
Data availability and re‐usability in the
transition from microarray to next‐generation
sequencing: can we do better?
B.F. Francis Ouellette
• Senior Scientist & Associate Director, Informatics and
Biocomputing, Ontario Institute for Cancer Research,
Toronto, ON
• Associate Professor, Department of Cell and Systems
Biology, University of Toronto, Toronto, ON.

@bffo on
•

Gabriella Rustici, Eleanor Williams, B.F. Francis Ouellette,
Alvis Brazma and the Functional Genomics Data Society
http://fged.org

•
•
•
•
•
•
•
•
•
•
•
•
•
•

Alvis Brazma - EBI
Roger Bumgarner - U of Washington
Cesare Furlanello - FBK – MPBA
Michael Miller - ISB
Francis Ouellette - OICR
John Quackenbush – Dana-Farber
Michael Reich - Broad
Gabriella Rustici - EBI
Chris Stoeckert – U Penn
Ronald Taylor - PNNL
Steve Chervitz Trutane - Personalis
Jennifer Weller - UNC
Brian Wilhelm - IRIC
Neil Winegarden - UHN
FGED’s mission:

To be a positive agent of
change in the effective
sharing and reproducibility
of functional genomic data
Poster # 142 (Friday)
fged.org
I come here wearing many hats!
• Officer of FGED
• Data submitter to a large international cancer
genomics initiative
• Receiving and curating data from that same
initiative from 67 cancer genome projects.
• Editor in an #openaccess journal where we are just
now rewriting the data submission policy to ensure
reproducibility
• Associate Editor of an #OA DATABASE journal
• Also on the SAB of Galaxy and Genomespace
What do we do with this?
FGED
(Functional Genomics Data Society)
was
MGED
(Microarray Gene Expression
Data Society)
we evaluated the replication of data analyses in 18 articles on
microarray-based gene expression profiling. (…) We reproduced
two analyses in principle and six partially or with some
discrepancies; ten could not be reproduced. The main reason
for failure to reproduce was data unavailability, and discrepancies
were mostly due to incomplete data annotation or specification of
data processing and analysis. Repeatability of published
microarray studies is apparently limited. More strict publication
rules enforcing public data availability and explicit description of
data processing and analysis should be considered.
Does it matter?
• In Ioannidis et al (2009), they were not saying that
the papers were wrong.

• But there were problems
– missing data (38%)
– missing software, hardware details (50%)
– missing method, processing details (66%)
… forensic bioinformatics [was needed] to infer what
was done to obtain the results
- Keith Baggerly
Does it matter?
• In both cases the supporting data WERE deposited
in GEO or ArrayExpress
• Forensic bioinformatics was needed and more
often than not failed
• May be just depositing is not quite enough?
What was in MIAME?
1. The raw data
2. The final processed (normalised) data
3. The essential sample annotation and experimental
variables
4. Sample data relationships
5. Array annotation (e.g., probe oligonucleotide
sequences)
6. The laboratory and data processing protocols
Did it work? The glass half empty…
• Where were the hiccups? MIAME was asking too
much!
• However, some now say that MIAME is much too
little to ask! (e.g., publishing fully documented code
with instructions how to run it)
• What does it mean ‘sufficient data processing
protocols’?
• Even when data and protocols were deposited,
would the reviewers check these? Probably not
• So does it help at all?
Did it work? The glass half full …
• ArrayExpress and GEO have data from well
over 6 million high throughput assays from
some 30,000 functional genomics studies
• The MIAME compliance has been increasing
over time
• Many studies have shown the reusability of
these data
• We can have an informed discussion about the
reproducibility rather than forensics
Standards for content vs
standards for format
• Developing a usable format is challenging
– If it’s too ‘flexible’, too much free text, it’s no longer a
standard, no software can reasonably parse it
– If it’s too rigid, too granular, it can’t handle new type of
data, and people end up putting things in fields that don’t
work

• Human readable formats is useful, but machine
readability is essential!
A simple human readable format for Functional
genomics experiment metadata
• Sample-Data Relationship File (SDRF)
Lessons learned
• Keep it simple, keep it simple, keep it simple!
• Perils of designing standards by a committee vs
advantages of community agreement
• Successful formats are mostly defined by
successful software, e.g., GFF in UCSC GB or
Bioconductors gene_set
• The attraction and perils of perfection – the last few
steps of full automation cost most effort
– A human person may be a cheep broker between two
pieces of software (again – Bioconductor example)
What does it mean for HTS?
• (RNASeq – ChIPSeq)
• The metadata for functional genomics HTS
experiments are not so different from microarray
experiments – replace cel files with BAM files
MINSEQE - Minimum Information about a highthroughput Nucleotide SeQuencing Experiment
1. A general description of the aim of the experiment;
2. The submitter contact details;
3. Essential sample annotation and the experimental
factors;
4. An ‘experiment’ or ‘run’ date, which may be
important for identifying batch effects;
5. Sufficient information to correctly identify bio &
tech reps;
6. Experimental and data processing protocols
7. Raw sequencing reads location; and processed
data.
Percentage of publications from 2012
containing new gene expression data
Data type

Number of
PMID with new
data

% of data in
SRA/Arrayexpr
ess/GEO

Microarray

347

49

RNA-SEQ

334

61
Percentage of RNA-Seq studies
providing metadata (1/2)
Original
Database

ArrayExpress GEO

SRA

Experimental
description

95

100

100

Contact

100

100

0

Sample &
Factor info

100

100

60

Experimental
Or Run date

0

0

60
Percentage of RNA-Seq studies
providing metadata (2/2)
Original
Database

ArrayExpress GEO

SRA

Biological
and Tech
replicates

Yes

Sometimes

Yes

Exp and data
processing
protocol

60

100

0

Raw reads

100

100

100

Processed
data

35

90

0
Things we still need to do:
• Involves folks from NCBI
• Compare methods and metrics over time (20092012)
• Compare methods with ENCODE, ICGC, EGA and
the databases we presented here.
• Look for shared meta data and seek to mate what
is best and core to all.
• Make sure it aligns with large funder’s current
requirements.
• Share and publish this information
Take home messages
• Archiving just something is not the same as
making data available and useful – metadata,
analysis code, usable format, …
– Storing metadata doesn’t cost too much, extracting them
from data generators does!

• Minimising the human mediation in moving data
between the LIMS, archives and analysis tools is
more realistic goal than eliminating it – the need for
brokerage
• The main source of variability in RNSseq
interpretation seems to be the alignments – we
don’t know how to do this well yet. Getting the
short reads for RNASeq is a beginning.
• FGED: The Functional Genomics Data Society is a
very open society, and we welcome feedback and
input!

– http://fged.org
– Twitter: @fged
Acknowledgements:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

Gabriella Rustici, Eleanor Williams, Alvis Brazma and
the Functional Genomics Data Society http://fged.org
Alvis Brazma - EBI
Roger Bumgarner - U of Washington
Cesare Furlanello - FBK – MPBA
Michael Miller - ISB
Francis Ouellette - OICR
John Quackenbush – Dana-Farber
Michael Reich - Broad
Gabriella Rustici - EBI
Chris Stoeckert – U Penn
Ronald Taylor - PNNL
Steve Chervitz Trutane - Personalis
Jennifer Weller - UNC
Brian Wilhelm - IRIC
Neil Winegarden - UHN

Weitere ähnliche Inhalte

Was ist angesagt?

NetBioSIG2012 ugurdogrusoz-cbio
NetBioSIG2012 ugurdogrusoz-cbioNetBioSIG2012 ugurdogrusoz-cbio
NetBioSIG2012 ugurdogrusoz-cbio
Alexander Pico
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final Report
Shruthi Choudary
 
NetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizNetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-viz
Alexander Pico
 

Was ist angesagt? (20)

Gene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -TutorialGene Ontology Enrichment Network Analysis -Tutorial
Gene Ontology Enrichment Network Analysis -Tutorial
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload
 
Introduction to METAGENOTE
Introduction to METAGENOTE Introduction to METAGENOTE
Introduction to METAGENOTE
 
OpenTox Europe 2013
OpenTox Europe 2013OpenTox Europe 2013
OpenTox Europe 2013
 
NetBioSIG2012 ugurdogrusoz-cbio
NetBioSIG2012 ugurdogrusoz-cbioNetBioSIG2012 ugurdogrusoz-cbio
NetBioSIG2012 ugurdogrusoz-cbio
 
NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw
 
Intro bioinformatics
Intro bioinformaticsIntro bioinformatics
Intro bioinformatics
 
Canadian health census to lod
Canadian health census to lodCanadian health census to lod
Canadian health census to lod
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of Nature
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final Report
 
NETTAB 2013
NETTAB 2013NETTAB 2013
NETTAB 2013
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
Kishor Presentation
Kishor PresentationKishor Presentation
Kishor Presentation
 
NetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizNetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-viz
 
Introduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEASTIntroduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEAST
 
NETTAB 2012
NETTAB 2012NETTAB 2012
NETTAB 2012
 
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ... Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahu
 

Andere mochten auch

miRNA Breast Cancer Prognosis -- Ingenuity Systems
miRNA Breast Cancer Prognosis -- Ingenuity SystemsmiRNA Breast Cancer Prognosis -- Ingenuity Systems
miRNA Breast Cancer Prognosis -- Ingenuity Systems
Natalie Ng
 

Andere mochten auch (9)

Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
 
Jenny Giannopoulou, Prostate cancer methylome, fged_seattle_2013
Jenny Giannopoulou, Prostate cancer methylome, fged_seattle_2013Jenny Giannopoulou, Prostate cancer methylome, fged_seattle_2013
Jenny Giannopoulou, Prostate cancer methylome, fged_seattle_2013
 
Ishwar Chandramouliswaran, Cancer Research, fged_seattle_2013
Ishwar Chandramouliswaran, Cancer Research, fged_seattle_2013Ishwar Chandramouliswaran, Cancer Research, fged_seattle_2013
Ishwar Chandramouliswaran, Cancer Research, fged_seattle_2013
 
Kimberly Glass, Network model - Ovarian Cancer, fged_seattle_2013
Kimberly Glass, Network model - Ovarian Cancer, fged_seattle_2013Kimberly Glass, Network model - Ovarian Cancer, fged_seattle_2013
Kimberly Glass, Network model - Ovarian Cancer, fged_seattle_2013
 
Information, Science, and Society
Information, Science, and SocietyInformation, Science, and Society
Information, Science, and Society
 
miRNA Breast Cancer Prognosis -- Ingenuity Systems
miRNA Breast Cancer Prognosis -- Ingenuity SystemsmiRNA Breast Cancer Prognosis -- Ingenuity Systems
miRNA Breast Cancer Prognosis -- Ingenuity Systems
 
Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...
Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...
Big Data and the Future of Journalism (Futurist Keynote Speaker Gerd Leonhard...
 
Big Data and Advanced Analytics
Big Data and Advanced AnalyticsBig Data and Advanced Analytics
Big Data and Advanced Analytics
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 

Ähnlich wie Cshl minseqe 2013_ouellette

BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
Philip Cheung
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Carole Goble
 

Ähnlich wie Cshl minseqe 2013_ouellette (20)

2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management plan
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Importance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistryImportance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistry
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
Publication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic moleculesPublication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic molecules
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
FAIR BioData Management
FAIR BioData ManagementFAIR BioData Management
FAIR BioData Management
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Considerations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowConsiderations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflow
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
AgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use CasesAgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use Cases
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Cshl minseqe 2013_ouellette

  • 1. You are free to: Copy, share, adapt, or re-mix; Photograph, film, or broadcast; Blog, live-blog, or post video of; This presentation. Provided that: You attribute the work to its author and respect the rights and licenses associated with its components. Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero. Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at; http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
  • 2. You are free to: Copy, share, adapt, or re-mix; Photograph, film, or broadcast; Blog, live-blog, or post video of; This presentation. Provided that: You attribute the work to its author and respect the rights and licenses associated with its components. Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero. Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at; http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
  • 3. Disclaimer • I do not (and will not) profit in any way, shape or form, from any of the brands, products or companies I may mention in this presentation.
  • 4. Data availability and re‐usability in the transition from microarray to next‐generation sequencing: can we do better? B.F. Francis Ouellette • Senior Scientist & Associate Director, Informatics and Biocomputing, Ontario Institute for Cancer Research, Toronto, ON • Associate Professor, Department of Cell and Systems Biology, University of Toronto, Toronto, ON. @bffo on
  • 5. • Gabriella Rustici, Eleanor Williams, B.F. Francis Ouellette, Alvis Brazma and the Functional Genomics Data Society http://fged.org • • • • • • • • • • • • • • Alvis Brazma - EBI Roger Bumgarner - U of Washington Cesare Furlanello - FBK – MPBA Michael Miller - ISB Francis Ouellette - OICR John Quackenbush – Dana-Farber Michael Reich - Broad Gabriella Rustici - EBI Chris Stoeckert – U Penn Ronald Taylor - PNNL Steve Chervitz Trutane - Personalis Jennifer Weller - UNC Brian Wilhelm - IRIC Neil Winegarden - UHN
  • 6. FGED’s mission: To be a positive agent of change in the effective sharing and reproducibility of functional genomic data Poster # 142 (Friday) fged.org
  • 7. I come here wearing many hats! • Officer of FGED • Data submitter to a large international cancer genomics initiative • Receiving and curating data from that same initiative from 67 cancer genome projects. • Editor in an #openaccess journal where we are just now rewriting the data submission policy to ensure reproducibility • Associate Editor of an #OA DATABASE journal • Also on the SAB of Galaxy and Genomespace
  • 8. What do we do with this? FGED (Functional Genomics Data Society) was MGED (Microarray Gene Expression Data Society)
  • 9. we evaluated the replication of data analyses in 18 articles on microarray-based gene expression profiling. (…) We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced. The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis. Repeatability of published microarray studies is apparently limited. More strict publication rules enforcing public data availability and explicit description of data processing and analysis should be considered.
  • 10. Does it matter? • In Ioannidis et al (2009), they were not saying that the papers were wrong. • But there were problems – missing data (38%) – missing software, hardware details (50%) – missing method, processing details (66%)
  • 11. … forensic bioinformatics [was needed] to infer what was done to obtain the results - Keith Baggerly
  • 12. Does it matter? • In both cases the supporting data WERE deposited in GEO or ArrayExpress • Forensic bioinformatics was needed and more often than not failed • May be just depositing is not quite enough?
  • 13.
  • 14. What was in MIAME? 1. The raw data 2. The final processed (normalised) data 3. The essential sample annotation and experimental variables 4. Sample data relationships 5. Array annotation (e.g., probe oligonucleotide sequences) 6. The laboratory and data processing protocols
  • 15. Did it work? The glass half empty… • Where were the hiccups? MIAME was asking too much! • However, some now say that MIAME is much too little to ask! (e.g., publishing fully documented code with instructions how to run it) • What does it mean ‘sufficient data processing protocols’? • Even when data and protocols were deposited, would the reviewers check these? Probably not • So does it help at all?
  • 16. Did it work? The glass half full … • ArrayExpress and GEO have data from well over 6 million high throughput assays from some 30,000 functional genomics studies • The MIAME compliance has been increasing over time • Many studies have shown the reusability of these data • We can have an informed discussion about the reproducibility rather than forensics
  • 17. Standards for content vs standards for format • Developing a usable format is challenging – If it’s too ‘flexible’, too much free text, it’s no longer a standard, no software can reasonably parse it – If it’s too rigid, too granular, it can’t handle new type of data, and people end up putting things in fields that don’t work • Human readable formats is useful, but machine readability is essential!
  • 18. A simple human readable format for Functional genomics experiment metadata • Sample-Data Relationship File (SDRF)
  • 19. Lessons learned • Keep it simple, keep it simple, keep it simple! • Perils of designing standards by a committee vs advantages of community agreement • Successful formats are mostly defined by successful software, e.g., GFF in UCSC GB or Bioconductors gene_set • The attraction and perils of perfection – the last few steps of full automation cost most effort – A human person may be a cheep broker between two pieces of software (again – Bioconductor example)
  • 20. What does it mean for HTS? • (RNASeq – ChIPSeq) • The metadata for functional genomics HTS experiments are not so different from microarray experiments – replace cel files with BAM files
  • 21. MINSEQE - Minimum Information about a highthroughput Nucleotide SeQuencing Experiment 1. A general description of the aim of the experiment; 2. The submitter contact details; 3. Essential sample annotation and the experimental factors; 4. An ‘experiment’ or ‘run’ date, which may be important for identifying batch effects; 5. Sufficient information to correctly identify bio & tech reps; 6. Experimental and data processing protocols 7. Raw sequencing reads location; and processed data.
  • 22. Percentage of publications from 2012 containing new gene expression data Data type Number of PMID with new data % of data in SRA/Arrayexpr ess/GEO Microarray 347 49 RNA-SEQ 334 61
  • 23. Percentage of RNA-Seq studies providing metadata (1/2) Original Database ArrayExpress GEO SRA Experimental description 95 100 100 Contact 100 100 0 Sample & Factor info 100 100 60 Experimental Or Run date 0 0 60
  • 24. Percentage of RNA-Seq studies providing metadata (2/2) Original Database ArrayExpress GEO SRA Biological and Tech replicates Yes Sometimes Yes Exp and data processing protocol 60 100 0 Raw reads 100 100 100 Processed data 35 90 0
  • 25. Things we still need to do: • Involves folks from NCBI • Compare methods and metrics over time (20092012) • Compare methods with ENCODE, ICGC, EGA and the databases we presented here. • Look for shared meta data and seek to mate what is best and core to all. • Make sure it aligns with large funder’s current requirements. • Share and publish this information
  • 26. Take home messages • Archiving just something is not the same as making data available and useful – metadata, analysis code, usable format, … – Storing metadata doesn’t cost too much, extracting them from data generators does! • Minimising the human mediation in moving data between the LIMS, archives and analysis tools is more realistic goal than eliminating it – the need for brokerage • The main source of variability in RNSseq interpretation seems to be the alignments – we don’t know how to do this well yet. Getting the short reads for RNASeq is a beginning.
  • 27. • FGED: The Functional Genomics Data Society is a very open society, and we welcome feedback and input! – http://fged.org – Twitter: @fged
  • 28. Acknowledgements: • • • • • • • • • • • • • • • Gabriella Rustici, Eleanor Williams, Alvis Brazma and the Functional Genomics Data Society http://fged.org Alvis Brazma - EBI Roger Bumgarner - U of Washington Cesare Furlanello - FBK – MPBA Michael Miller - ISB Francis Ouellette - OICR John Quackenbush – Dana-Farber Michael Reich - Broad Gabriella Rustici - EBI Chris Stoeckert – U Penn Ronald Taylor - PNNL Steve Chervitz Trutane - Personalis Jennifer Weller - UNC Brian Wilhelm - IRIC Neil Winegarden - UHN