SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Provenance of Microarray Experiments for a
Better Understanding of Experiment Results
Helena F. Deus
University of
Texas
Jun Zhao
University of
Oxford
Satya Sahoo
Wright State
University
Matthias Samwald
DERI, Galway
Eric
Prud’hommeaux
W3C
Michael Miller
Tantric Designs
M. Scott Marshall
Leiden University
Medical Center
Kei-Hoi Cheung
Yale University
Outline
 Background: microarrays, gene expression and why is provenance
important for experimental biomedical data
 Objectives
 Data: Microarray workflow and gene results
 The provenance model
 Demo
 Future work
 Summary
Introduction
 High throughput experiments, such as microarray technologies,
have revolutionized the way we study disease and basic biology.
 Microarray experiments allow scientists to quantify thousands of
genomic features in a single experiment
Source: http://www.scq.ubc.ca/
Affymetrix microarray
gene chips
Genes can be
used as
biomarkers for
disease
Introduction
 Since 1997, the number of published results based on an analysis of
gene expression microarray data has grown from 30 to over
5,000 publications per year
 Existing microarray data repositories and standards, but lack of
provenance and interoperable data access
Source:YJBM(2007)80(4):165-78
Introduction Cont.
 A pilot study of the W3C
HCLS BioRDF task force
 Bottom-up approach
 Use Microarray
experiments for
Alzheimer’s Diseases as
the test-bed
 Aggregate results
across microarray
experiments
 Combine different
types of data
Objectives
 To facilitate a better understanding of microarray gene results
 Efficiently query gene results
 Efficiently combine existing life science datasets
 To transform Microarray gene results into Semantic Web format
 To encode provenance information about these gene results in the
same format as the data itself
Microarray Workflow
Biological question
Differentially expressed genes
Sample gathering etc.
Experiment design
Microarray experiment
Image analysis
Normalization
Estimation Clustering
Discriminatio
n
T-test… …
Data extraction
Data analysis and modeling
An Example of differentially
expressed genes
8
An Example of gene list from
different studies
What microarray experiments analyze
samples taken from the entorhinal
cortex region of Alzheimer's patients?
What genes are overexpressed in the
entorhinal cortex region and what is
their expression fold change and
associated p-value?
What other diseases may be
associated with the same genes
found to be linked to AD? 
A Bottom-up Approach
 Separate concerns/perspectives
 Too many existing vocabularies to choose from
 Lack of standardization among existing provenance vocabularies
 Lack of a clear understanding of what needs to be captured
 Process
 Identify user query
 Define terms
 Test the query using test data
A Bottom-up Approach
Raw Data
Results
A Bottom-up Approach
Raw Data
Results
Questions
Which genes are markers
for neurodegenerative
diseases?
Was gene ALG2
differentially expressed in
multiple experiments?
What software was used
to analyse the data?
How can the experiment
be replicated?
A Bottom-up Approach
Raw Data
Results
Questions
Which genes are markers
for neurodegenerative
diseases?
Was gene ALG2
differentially expressed in
multiple experiments?
Provenance of
Microarray
experiment
What software was used
to analyse the data?
How can the experiment
be replicated?
A Bottom-up Approach
Provenance
models
Workflow,
experimental design
Domain ontologies
(DO, GO…)
Community
models
Raw Data
Results
Questions
Which genes are markers
for neurodegenerative
diseases?
Was gene ALG2
differentially expressed in
multiple experiments?
Provenance of
Microarray
experiment
What software was used
to analyse the data?
How can the experiment
be replicated?
The Provenance Data Model: Four Types of
Provenance
http://purl.org/net/biordfmicroarray/ns#
RDF genelist representation
 Institutional level: metadata associated with each genelist such as the laboratory
where the experiments were performed or the reference to the genelist.
 Experimental context level: experimental protocols such as the region of the
brain and the disease (terms were partially mapped to MGED, DO and NIF).
RDF genelist representation
 Data analysis and significance: statistical analysis methodology for selecting the
relevant genes
 Dataset descriptions: version of a source dataset, who published the dataset.
The vocabulary of interlinked datasets (voiD) and dublin core terms (dct) were
used.
Provenance types are perspectives on the data
Provenance types are perspectives on the data
Provenance types are perspectives on the data
Provenance types are perspectives on the data
Query federation with diseasome Is there a gene
network for AD?
Source: PNAS 104:21, 8685 (2007)
Demo
 Go to http://purl.org/net/biordfmicroarray/demo
Conclusions
 Levels of provenance: 1) institutional; 2) experimental context; 3)
Statistical analysis and significance; 4) dataset description
 Provenance as RDF: SPARQL queries to express contrains both
about the origins and context of the data
 Data model is driven by the biological question: a bottom-
up approach shields the model from rapidly evolving ontologies while
enabling linking to widely used ontologies
 Mapping is facilitated: Mapping to existing provenance
vocabularies, like OPM, PML, Provenir is facilitated by:
 biordf:has_input_value, which can be made a sub-property of the
inverse of OPM property used
 biordf:derives_from_region, which can become a sub-property of
OPM property wasDerivedFrom.
Summary and Future Work
 Provenance modeling in a semantic web application
 Query genes gathered from specific samples, in a given condition
or from given organizations
 Query genes produced through particular statistical analysis process
 Query for information about genes from a most recent dataset
 The bottom-up approach
 Separate concerns of interests
 Create a minimum set of terms required for motivation queries
 Future work
 To integrate our model with provenance information generated in
scientific workflow workbench
 To integrate provenance information as part of the Excel
Spreadsheet where most biologists report their results
Acknowledgement
 W3C BioRDF group
 Kei Cheung, Michael Miller, M. Scott Marshall, Eric Prud’hommeaux,
Satya Sahoo, Matthias Samwald
 The HCLS IG as well as Helen Parkinson, James Malone, Misha
Kapushesky and Jonas Almeida.

Weitere ähnliche Inhalte

Was ist angesagt?

Model Organism Linked Data
Model Organism Linked DataModel Organism Linked Data
Model Organism Linked Data
Michel Dumontier
 
Data integration with STRING
Data integration with STRINGData integration with STRING
Data integration with STRING
Lars Juhl Jensen
 
2014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 1402062014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 140206
GenomeInABottle
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
butest
 
Making gene networks through data integration
Making gene networks through data integrationMaking gene networks through data integration
Making gene networks through data integration
Lars Juhl Jensen
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological Data
Yannick Pouliot
 

Was ist angesagt? (20)

AI in translational medicine webinar
AI in translational medicine webinarAI in translational medicine webinar
AI in translational medicine webinar
 
Model Organism Linked Data
Model Organism Linked DataModel Organism Linked Data
Model Organism Linked Data
 
Final Acb All Hands 26 11 07.Key
Final Acb All Hands 26 11 07.KeyFinal Acb All Hands 26 11 07.Key
Final Acb All Hands 26 11 07.Key
 
NetBioSIG2014-Talk by David Amar
NetBioSIG2014-Talk by David AmarNetBioSIG2014-Talk by David Amar
NetBioSIG2014-Talk by David Amar
 
Data sharing - Data management - The SysMO-SEEK Story
Data sharing - Data management - The SysMO-SEEK StoryData sharing - Data management - The SysMO-SEEK Story
Data sharing - Data management - The SysMO-SEEK Story
 
X-Meeting Poster 2015 - Vallys A Coverage tool
X-Meeting Poster 2015 - Vallys A Coverage toolX-Meeting Poster 2015 - Vallys A Coverage tool
X-Meeting Poster 2015 - Vallys A Coverage tool
 
Free webinar-introduction to bioinformatics - biologist-1
Free webinar-introduction to bioinformatics - biologist-1Free webinar-introduction to bioinformatics - biologist-1
Free webinar-introduction to bioinformatics - biologist-1
 
Data integration with STRING
Data integration with STRINGData integration with STRING
Data integration with STRING
 
Machine learning in biology
Machine learning in biologyMachine learning in biology
Machine learning in biology
 
2014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 1402062014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 140206
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Making gene networks through data integration
Making gene networks through data integrationMaking gene networks through data integration
Making gene networks through data integration
 
Pine Biotech
Pine BiotechPine Biotech
Pine Biotech
 
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceEfficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
 
Link Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked Data
 
Towards a gold standard and regarding quality in public domain chemistry data...
Towards a gold standard and regarding quality in public domain chemistry data...Towards a gold standard and regarding quality in public domain chemistry data...
Towards a gold standard and regarding quality in public domain chemistry data...
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marc
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological Data
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
 

Andere mochten auch

GIS Indeni - Den korte vej til den menneskelige applikation
GIS Indeni - Den korte vej til den menneskelige applikationGIS Indeni - Den korte vej til den menneskelige applikation
GIS Indeni - Den korte vej til den menneskelige applikation
Sik Cambon Jensen
 
Slidesharefor
SlideshareforSlidesharefor
Slidesharefor
Yerele
 
Al’S Pizza Mandarin
Al’S Pizza   MandarinAl’S Pizza   Mandarin
Al’S Pizza Mandarin
wesnic
 
Obama Inauguration 2009
Obama Inauguration 2009Obama Inauguration 2009
Obama Inauguration 2009
KjerstiOfstad
 
Gartner A Portfolio
Gartner A PortfolioGartner A Portfolio
Gartner A Portfolio
foxlika
 
Fuel Cell Vehicles
Fuel Cell VehiclesFuel Cell Vehicles
Fuel Cell Vehicles
guestc4c2a3
 
JAMP DAY 2010 - ROMA (4)
JAMP DAY 2010 - ROMA (4)JAMP DAY 2010 - ROMA (4)
JAMP DAY 2010 - ROMA (4)
jampslide
 
Elementos da aula Abalar
Elementos da aula AbalarElementos da aula Abalar
Elementos da aula Abalar
Olga Veiga
 

Andere mochten auch (19)

GIS Indeni - Den korte vej til den menneskelige applikation
GIS Indeni - Den korte vej til den menneskelige applikationGIS Indeni - Den korte vej til den menneskelige applikation
GIS Indeni - Den korte vej til den menneskelige applikation
 
Podcasting In Education
Podcasting In EducationPodcasting In Education
Podcasting In Education
 
Slidesharefor
SlideshareforSlidesharefor
Slidesharefor
 
Bonsai GIS
Bonsai GISBonsai GIS
Bonsai GIS
 
Udfordringen
UdfordringenUdfordringen
Udfordringen
 
Geografisk kommunikation i Web 2.0 æraen
Geografisk kommunikation i Web 2.0 æraenGeografisk kommunikation i Web 2.0 æraen
Geografisk kommunikation i Web 2.0 æraen
 
LinkedIn The Tool, Tips, And Techniques
LinkedIn The Tool, Tips, And TechniquesLinkedIn The Tool, Tips, And Techniques
LinkedIn The Tool, Tips, And Techniques
 
Seven Wonders Of Concrete
Seven  Wonders Of  ConcreteSeven  Wonders Of  Concrete
Seven Wonders Of Concrete
 
Al’S Pizza Mandarin
Al’S Pizza   MandarinAl’S Pizza   Mandarin
Al’S Pizza Mandarin
 
Obama Inauguration 2009
Obama Inauguration 2009Obama Inauguration 2009
Obama Inauguration 2009
 
Bpm Agile Bucharest Nov 2011
Bpm Agile Bucharest Nov 2011Bpm Agile Bucharest Nov 2011
Bpm Agile Bucharest Nov 2011
 
Google med begge benene på jorden - PPT
Google med begge benene på jorden - PPTGoogle med begge benene på jorden - PPT
Google med begge benene på jorden - PPT
 
Gartner A Portfolio
Gartner A PortfolioGartner A Portfolio
Gartner A Portfolio
 
Fuel Cell Vehicles
Fuel Cell VehiclesFuel Cell Vehicles
Fuel Cell Vehicles
 
Reilly Studios Portfolio
Reilly Studios PortfolioReilly Studios Portfolio
Reilly Studios Portfolio
 
JAMP DAY 2010 - ROMA (4)
JAMP DAY 2010 - ROMA (4)JAMP DAY 2010 - ROMA (4)
JAMP DAY 2010 - ROMA (4)
 
Iain Stewart Architectural Watercolors
Iain Stewart Architectural WatercolorsIain Stewart Architectural Watercolors
Iain Stewart Architectural Watercolors
 
Elementos da aula Abalar
Elementos da aula AbalarElementos da aula Abalar
Elementos da aula Abalar
 
Widening The Circle: Recruiting and Retaining Diverse Campers
Widening The Circle: Recruiting and Retaining Diverse CampersWidening The Circle: Recruiting and Retaining Diverse Campers
Widening The Circle: Recruiting and Retaining Diverse Campers
 

Ähnlich wie provenance of microarray experiments

The Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision MedicineThe Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision Medicine
mhaendel
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
Ian Foster
 
Interpret gene expression results 2013
Interpret gene expression results 2013Interpret gene expression results 2013
Interpret gene expression results 2013
Elsa von Licy
 
Ontology based phenotype database and mining tool
Ontology based phenotype database and mining toolOntology based phenotype database and mining tool
Ontology based phenotype database and mining tool
Jennifer Smith
 

Ähnlich wie provenance of microarray experiments (20)

Semantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical InformaticsSemantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical Informatics
 
The Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision MedicineThe Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision Medicine
 
NTU-2019
NTU-2019NTU-2019
NTU-2019
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management
 
2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe
 
A systematic approach to Genotype-Phenotype correlations
A systematic approach to Genotype-Phenotype correlationsA systematic approach to Genotype-Phenotype correlations
A systematic approach to Genotype-Phenotype correlations
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
 
Eccmid meet the expert 2015
Eccmid meet the expert 2015Eccmid meet the expert 2015
Eccmid meet the expert 2015
 
Interpret gene expression results 2013
Interpret gene expression results 2013Interpret gene expression results 2013
Interpret gene expression results 2013
 
Friend harvard 2013-01-30
Friend harvard 2013-01-30Friend harvard 2013-01-30
Friend harvard 2013-01-30
 
NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Dynamic Semantic Metadata in Biomedical Communications
Dynamic Semantic Metadata in Biomedical CommunicationsDynamic Semantic Metadata in Biomedical Communications
Dynamic Semantic Metadata in Biomedical Communications
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient Stratification
 
Ontology based phenotype database and mining tool
Ontology based phenotype database and mining toolOntology based phenotype database and mining tool
Ontology based phenotype database and mining tool
 
JALANov2000
JALANov2000JALANov2000
JALANov2000
 

Kürzlich hochgeladen

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 

Kürzlich hochgeladen (20)

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 

provenance of microarray experiments

  • 1. Provenance of Microarray Experiments for a Better Understanding of Experiment Results Helena F. Deus University of Texas Jun Zhao University of Oxford Satya Sahoo Wright State University Matthias Samwald DERI, Galway Eric Prud’hommeaux W3C Michael Miller Tantric Designs M. Scott Marshall Leiden University Medical Center Kei-Hoi Cheung Yale University
  • 2. Outline  Background: microarrays, gene expression and why is provenance important for experimental biomedical data  Objectives  Data: Microarray workflow and gene results  The provenance model  Demo  Future work  Summary
  • 3. Introduction  High throughput experiments, such as microarray technologies, have revolutionized the way we study disease and basic biology.  Microarray experiments allow scientists to quantify thousands of genomic features in a single experiment Source: http://www.scq.ubc.ca/ Affymetrix microarray gene chips Genes can be used as biomarkers for disease
  • 4. Introduction  Since 1997, the number of published results based on an analysis of gene expression microarray data has grown from 30 to over 5,000 publications per year  Existing microarray data repositories and standards, but lack of provenance and interoperable data access Source:YJBM(2007)80(4):165-78
  • 5. Introduction Cont.  A pilot study of the W3C HCLS BioRDF task force  Bottom-up approach  Use Microarray experiments for Alzheimer’s Diseases as the test-bed  Aggregate results across microarray experiments  Combine different types of data
  • 6. Objectives  To facilitate a better understanding of microarray gene results  Efficiently query gene results  Efficiently combine existing life science datasets  To transform Microarray gene results into Semantic Web format  To encode provenance information about these gene results in the same format as the data itself
  • 7. Microarray Workflow Biological question Differentially expressed genes Sample gathering etc. Experiment design Microarray experiment Image analysis Normalization Estimation Clustering Discriminatio n T-test… … Data extraction Data analysis and modeling
  • 8. An Example of differentially expressed genes 8
  • 9. An Example of gene list from different studies
  • 10. What microarray experiments analyze samples taken from the entorhinal cortex region of Alzheimer's patients?
  • 11. What genes are overexpressed in the entorhinal cortex region and what is their expression fold change and associated p-value?
  • 12. What other diseases may be associated with the same genes found to be linked to AD? 
  • 13. A Bottom-up Approach  Separate concerns/perspectives  Too many existing vocabularies to choose from  Lack of standardization among existing provenance vocabularies  Lack of a clear understanding of what needs to be captured  Process  Identify user query  Define terms  Test the query using test data
  • 14. A Bottom-up Approach Raw Data Results
  • 15. A Bottom-up Approach Raw Data Results Questions Which genes are markers for neurodegenerative diseases? Was gene ALG2 differentially expressed in multiple experiments? What software was used to analyse the data? How can the experiment be replicated?
  • 16. A Bottom-up Approach Raw Data Results Questions Which genes are markers for neurodegenerative diseases? Was gene ALG2 differentially expressed in multiple experiments? Provenance of Microarray experiment What software was used to analyse the data? How can the experiment be replicated?
  • 17. A Bottom-up Approach Provenance models Workflow, experimental design Domain ontologies (DO, GO…) Community models Raw Data Results Questions Which genes are markers for neurodegenerative diseases? Was gene ALG2 differentially expressed in multiple experiments? Provenance of Microarray experiment What software was used to analyse the data? How can the experiment be replicated?
  • 18. The Provenance Data Model: Four Types of Provenance http://purl.org/net/biordfmicroarray/ns#
  • 19. RDF genelist representation  Institutional level: metadata associated with each genelist such as the laboratory where the experiments were performed or the reference to the genelist.  Experimental context level: experimental protocols such as the region of the brain and the disease (terms were partially mapped to MGED, DO and NIF).
  • 20. RDF genelist representation  Data analysis and significance: statistical analysis methodology for selecting the relevant genes  Dataset descriptions: version of a source dataset, who published the dataset. The vocabulary of interlinked datasets (voiD) and dublin core terms (dct) were used.
  • 21. Provenance types are perspectives on the data
  • 22. Provenance types are perspectives on the data
  • 23. Provenance types are perspectives on the data
  • 24. Provenance types are perspectives on the data
  • 25. Query federation with diseasome Is there a gene network for AD? Source: PNAS 104:21, 8685 (2007)
  • 26. Demo  Go to http://purl.org/net/biordfmicroarray/demo
  • 27. Conclusions  Levels of provenance: 1) institutional; 2) experimental context; 3) Statistical analysis and significance; 4) dataset description  Provenance as RDF: SPARQL queries to express contrains both about the origins and context of the data  Data model is driven by the biological question: a bottom- up approach shields the model from rapidly evolving ontologies while enabling linking to widely used ontologies  Mapping is facilitated: Mapping to existing provenance vocabularies, like OPM, PML, Provenir is facilitated by:  biordf:has_input_value, which can be made a sub-property of the inverse of OPM property used  biordf:derives_from_region, which can become a sub-property of OPM property wasDerivedFrom.
  • 28. Summary and Future Work  Provenance modeling in a semantic web application  Query genes gathered from specific samples, in a given condition or from given organizations  Query genes produced through particular statistical analysis process  Query for information about genes from a most recent dataset  The bottom-up approach  Separate concerns of interests  Create a minimum set of terms required for motivation queries  Future work  To integrate our model with provenance information generated in scientific workflow workbench  To integrate provenance information as part of the Excel Spreadsheet where most biologists report their results
  • 29. Acknowledgement  W3C BioRDF group  Kei Cheung, Michael Miller, M. Scott Marshall, Eric Prud’hommeaux, Satya Sahoo, Matthias Samwald  The HCLS IG as well as Helen Parkinson, James Malone, Misha Kapushesky and Jonas Almeida.

Hinweis der Redaktion

  1. A revolution in biology and biomedicine has taken place in the past 10 years with the invention of microarray technologies. A microarray is a small chip (point to figure on the left) where fragments of every known gene have been mapped into precise positions on the chip. Microarrays makes it possible to measure how much of each of those genes is being expressed in cells. This technique is very often used, for example, to determine which genes may be the cause of specific diseases (point to figure on the right). A simplistic view of performing a microarray experiment involves collecting sample cells from both the experimental case (for example, a disease sample) and a control. Inside the cell, DNA is converted into mRNA, which is the active molecule of the cell that will be transcribed into whatever biological product that is causing or preventing the disease. mRNA molecules are complementary to the DNA that is bound to the array. To measure the gene expression, the mRNA molecules are first labelled with fluorescent dyes and deposited either in a single array (represented in the figure) or in separate arrays for control vs test. mRNA will naturally bind to the DNA of the arrays and, because the dyes are fluorescent, it will emit light according to the amount of mRNA present in the cell. As a result, it is possible to compare the amount of DNA being expressed in the disease sample, when compared to the normal sample, simply based on the bightness of each spot in the array.
  2. Microarray technologies have been so succesfull in predicting disease that they have become standard practice in most biomedical studies. This has resulted in over 5000 results being published each year. Many of these datasets are made publicly available in repositories such as EBI’s array express or NCBI’s gene expression omnibus. Beyond studying a single disease, the data deposited in these repositores is often used, for example, for translational studies where invidivual genes are studied across multiple experiments in an attempt to understand the underlying molecular biology. Although some standards such as MIAME exist for reporting the results of microarray experiments, there is still a lack of interoperable data access
  3. This work has resulted from ongoing efforts at the W3C HCLS bioRDF task force to understand neurodegenerative diseases. For our pilot study, we haved taken a bottom-up approach to understand what genes affect alzheimer’s patients and to what extent do those results agree over many experiments. For example, one characterization of AD is the formation of intracellular neurofibrillary tangles that affect neurons in brain regions involved in the memory function. It is therefore important to have meta-data such as the cell type(s), cell histopathology, and brain region(s) for comparing/integrating the results across different AD microarray experiments. It is important to consider the (raw) data source and the types of analysis performed on the data to arrive at meaningful interpretations.  Finally, gene expression data may be combined with other types of data including genomic functions, pathways, and associated diseases to broaden the spectrum of integrative data analysis.
  4. Separate concerns of interests Understanding the needs for provenance v.s. a steep learning curve of existing ontolgoies Local, stable terminologies v.s. less stable, external vocabularies Identify the minimum set of terms for required users’ queries
  5. In order to explicitelly represent microarray experiments in semantic web formats, we needed first to understand the microarray workflow and how experiment translates into results. A microarray experiment usually starts with a question. For example, are there genes that affect the incidence of Alzheimer’s disease? Why are certain areas of the brain more affected by neurofibrillary tangles and could there be a list of genes responsible for that? According to the question, the experiment is designed to collect samples from specific areas of the brain from disease and normal subjects. These samples are then used in the microarray experiment, which is only the first step of data processing. The microarray are scanned, resulting in an image which has to be analysed to expreact the intensity signals. Before they can be used, the raw results need to be normalized in order to reduce the noise. Finally, a variety of statistical techniques and methods are used to trim the list of tens of thousans of genes to only a few highly significant ones that may be responsible for the disease.
  6. The results from a microarray experiment is a list of differentially expressed genes. The standard methodology to reporting these results is using a heatmap with the samples as columns and the genes as lines. These resulst are normaly reported as table in a pdf or, if we are lucky, the experimentalists provide the list of differentially expressed genes as a text file in supplementary material. Very seldom is the workflow for data collection/data processing and data analysis recorded in any format other than in the methodology section of an article.
  7. Different studies also tend to report their results using different data structured, making it harder to compare the experimental results. For example, in the three experiments that we selected, although the software package used to analyse the data was the same and the microarray workflow was very similar, the results were reported in very different ways. This, is part, reflects the necessity to report not only the sample specifice gene expression values, indicated in dark blue, but also the annotations available for each of the genes discovered, indicated in light blue, as well as the aggregated gene expression values, indicated in green. In some cases, the average signal was used to derive the final list of genes (point to genelist 2) and in other, only the fold change and it’s associated p-value were used (genelist 3). For our bottom-up approach we have identified and created a namespace for the terminologies necessary to describe in each of these genelists the portions that are most informative about the procedure to preproecess and to analyse the data, without creating an extensive list of terms that would multiply the amount of data to be represented.
  8. These multiple meta-data types or layers of information will allows us to answer specific biomedical questions. For example, a biologist or physician looking into comparing several experiments about the same disease might use information about the samples that were analysed to answer questions such as “What microarray experiments….”
  9. Information about the statistical fold-change and p-value, on the other hand, enabe a very dfferent class of questions that are much more oriented by a need to understand the reliability of the results, such as “What genes are overexpresed…” A statistitian might need to look at these value before cross-experiment comparison is possible.
  10. Finally, for a researcher interested in a deeper understanding of the molecular biology behind diseases, the gene annotation information is necessary to answer questions like “what other diseases may be…”. It is in this type of question tha the advantage of using RDF representations of genelist is most obvious.
  11. To represent the genelists as RDF we chose to follow a bottom-up approach, that is, our model is derived from the data but was can still link to upper ontologies. The reason for this was two fold – although there are several provenance ontologies, none of them was granular enough to represent our data. The second reason was because our model arose from the experimental results but most importantly, we wanted to be able to respond to biological questions. Because the process of knowledge discovery often leads back to the raw data, it is not uncommon for data models to be incremented with new data derived from the collection of new experimental results. Furthermore, the bottom up approach shields our model from rapidly evolving ontologies while at the same time enabling linking to other community accepted ontologies.
  12. To represent the genelists as RDF we chose to follow a bottom-up approach, that is, our model is derived from the data but was can still link to upper ontologies. The reason for this was two fold – although there are several provenance ontologies, none of them was granular enough to represent our data. The second reason was because our model arose from the experimental results but most importantly, we wanted to be able to respond to biological questions. Because the process of knowledge discovery often leads back to the raw data, it is not uncommon for data models to be incremented with new data derived from the collection of new experimental results. Furthermore, the bottom up approach shields our model from rapidly evolving ontologies while at the same time enabling linking to other community accepted ontologies.
  13. To represent the genelists as RDF we chose to follow a bottom-up approach, that is, our model is derived from the data but was can still link to upper ontologies. The reason for this was two fold – although there are several provenance ontologies, none of them was granular enough to represent our data. The second reason was because our model arose from the experimental results but most importantly, we wanted to be able to respond to biological questions. Because the process of knowledge discovery often leads back to the raw data, it is not uncommon for data models to be incremented with new data derived from the collection of new experimental results. Furthermore, the bottom up approach shields our model from rapidly evolving ontologies while at the same time enabling linking to other community accepted ontologies.
  14. To represent the genelists as RDF we chose to follow a bottom-up approach, that is, our model is derived from the data but was can still link to upper ontologies. The reason for this was two fold – although there are several provenance ontologies, none of them was granular enough to represent our data. The second reason was because our model arose from the experimental results but most importantly, we wanted to be able to respond to biological questions. Because the process of knowledge discovery often leads back to the raw data, it is not uncommon for data models to be incremented with new data derived from the collection of new experimental results. Furthermore, the bottom up approach shields our model from rapidly evolving ontologies while at the same time enabling linking to other community accepted ontologies.
  15. To answer our varied classes of questions, we have identified four types of provenance for our data model. Starting from the center, the provenance levels are the institutional leve, the eexperiment protocol level, the data analysis and significance level and the dataset description level.
  16. The institutional level contains metadata about the genelists, such as the laboratory where they were performed, or the link to the publication where the results were published. These help determine the trusthworthiness of the data. The experimental context level includes data about the brain region where samples where collected from or the histology of the cells
  17. The data analysis and significance level is concerned with information about the statistical procedures that were used to convert the raw data into a list of diffentially expressed genes. This information helps determine the statistical significance of the results. Finally, the dataset description level is the description of the RDF dataset itself such as the version, the published or the URL location where it is made available. The dataset description level makes use of the vocabulary of interlinked datasets and dublic core terms
  18. Each of the the provenance levels described is usefull to answer a specific class of questions, depending on the perspective of who is querying the data. Someone interested in knowing the laboratory where the experiment was performed or where the raw data is published, for example, would need institutional level procenance information. Other types of questions, such as finding all the experiments that used saples from the Enrothinal cortex of Alzheimer’s disease patients, would use experimental context level data. To answer questions such as the p-value associated with a measurement of genetic fold change, data from the data analysis and significance level is necessary. Finally, the dataset description level can be used to find more RDF datasets that contain information about genes and diseases.
  19. Each of the the provenance levels described is usefull to answer a specific class of questions, depending on the perspective of who is querying the data. Someone interested in knowing the laboratory where the experiment was performed or where the raw data is published, for example, would need institutional level procenance information. Other types of questions, such as finding all the experiments that used saples from the Enrothinal cortex of Alzheimer’s disease patients, would use experimental context level data. To answer questions such as the p-value associated with a measurement of genetic fold change, data from the data analysis and significance level is necessary. Finally, the dataset description level can be used to find more RDF datasets that contain information about genes and diseases.
  20. Each of the the provenance levels described is usefull to answer a specific class of questions, depending on the perspective of who is querying the data. Someone interested in knowing the laboratory where the experiment was performed or where the raw data is published, for example, would need institutional level procenance information. Other types of questions, such as finding all the experiments that used saples from the Enrothinal cortex of Alzheimer’s disease patients, would use experimental context level data. To answer questions such as the p-value associated with a measurement of genetic fold change, data from the data analysis and significance level is necessary. Finally, the dataset description level can be used to find more RDF datasets that contain information about genes and diseases.
  21. Each of the the provenance levels described is usefull to answer a specific class of questions, depending on the perspective of who is querying the data. Someone interested in knowing the laboratory where the experiment was performed or where the raw data is published, for example, would need institutional level procenance information. Other types of questions, such as finding all the experiments that used saples from the Enrothinal cortex of Alzheimer’s disease patients, would use experimental context level data. To answer questions such as the p-value associated with a measurement of genetic fold change, data from the data analysis and significance level is necessary. Finally, the dataset description level can be used to find more RDF datasets that contain information about genes and diseases.
  22. As part of this work, we have created federated queries that link our AD – related genes to genes in the diseasome. The diseasome is a community effort to link known human diseases to genes known to be related to those diseases. By federating our list of AD – related genes with the diseasome, we were able to find other disease that are affected by the same genes. This type of discovery is greatly facilitated by the RDF representation of the genelists.