SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
From Scientific Workflow Patterns
to 5-star Linked Open Data
Alban Gaignard Hala Skaf-Molli Audrey Bihouée
Nantes Academic
Hospital (CHU de
Nantes), CNRS, France
LINA,
Nantes University,
CNRS,
France
Institut du Thorax,
Nantes University,
INSERM, CNRS,
France
8th USENIX workshop on Theory and Practice of Provenance (TaPP’16)
Washington DC
Needs for linked experiment reports
2
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
3
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
4
1 sample
Input data 2 x 17 Gb
1-core CPU 170 hours
32-cores CPU 32 hours
Output data 12 Gb
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
5
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
1 sample
Input data 2 x 17 Gb
1-core CPU 170 hours
32-cores CPU 32 hours
Output data 12 Gb
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
6
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
Challenges
Algorithmic performance, storage, preservation,
reuse (limit recompute) & share.
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
1 sample
Input data 2 x 17 Gb
1-core CPU 170 hours
32-cores CPU 32 hours
Output data 12 Gb
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing experiment results
Scientific experiment: RNA sequencing to quantify gene expression
levels under multiple biological conditions.
7
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing experiment results
Scientific experiment: RNA sequencing to quantify gene expression
levels under multiple biological conditions.
8
Need for scientific context : metadata
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Expected result: human+machine tractable reports
9
Annotated “Material & Methods”
Links to some workflow artifacts (algorithms, data)
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
5-star Linked Open Data
10
W3C standards for machine and human
readable data on the web.
⭑⭑⭑⭑⭑ : time and expertise !
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
5-star Linked Open Data
11
How to ease this process ?
● Workflow engines → automation
● PROV → workflow runs as linked data
W3C standards for machine and human
readable data on the web.
⭑⭑⭑⭑⭑ : time and expertise !
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PROV only
12
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PROV only
13
too fine-grained
no domain concepts
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Provenance as a Linked Experiment Report
14
few + meaningful statements
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Problem statement & objectives
15
Problem statement
Scientific workflows produce massive raw results. Their
publication into curated query-able linked data repositories
requires lot of time and expertise.
Can we exploit provenance traces to ease the publication of
scientific results as Linked Data ?
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Problem statement & objectives
16
Problem statement
Scientific workflows produce massive raw results. Their
publication into curated query-able linked data repositories
requires lot of time and expertise.
Can we exploit provenance traces to ease the publication of
scientific results as Linked Data ?
Objectives
(1) Leverage annotated workflow patterns to generate
provenance mining rules.
(2) Refine provenance traces into linked experiment reports.
Rules generation
17
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Approach
18
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Input domain-specific annotations (❶,❷)
Workflow patterns ❶
19
Sequence patterns, with possibly intermediate steps
● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar
● EDAM ontology: hasFunction, RNA sequence, Genome map
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Input domain-specific annotations (❶,❷)
Workflow patterns ❶
20
Sequence patterns, with possibly intermediate steps
● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar
● EDAM ontology: hasFunction, RNA sequence, Genome map
Experiment report template ❷
Link scientific claims, statements, material and methods
● MicroPublication ontology: Material, Method, Claims
● Experimental factor ontology: Transcriptome, Gene expression
● NCBI taxonomy: Homo Sapiens
● Open Annotation model: hasBody, hasTarget
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PoeM: generating PrOvEnance Mining rules ❸
21
(SPARQL Property path)
(SPARQL Basic graph pattern)
(SPARQL Construct query)
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PoeM: sample generated rule ❸
22
<If> part
<Then> part
First experiments &
results
23
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment context
24
SyMeTRIC: systems medicine project (2015-2017, call “Connect
Talent”), funded by the french region Pays de la Loire.
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment
25
Material & methods
● Real-life RNA-seq workflow to study 3 mice populations
● WF implemented in Galaxy, run on 2 biological samples
● PROV traces exported from Galaxy Histories (API)
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment
26
Material & methods
● Real-life RNA-seq workflow to study 3 mice populations
● WF implemented in Galaxy, run on 2 biological samples
● PROV traces exported from Galaxy Histories (API)
Results (for 1 biological sample)
● 60h CPU (12 cores for genome alignment), 21Gb storage
● 3s to export 81 PROV triples from the Galaxy history
● 2s to apply the rule and produce 35 Micropublication triples
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Conclusion & perspectives
27
Semi-automated approach
(1) PoeM generates semantic web rules
(2) PoeM rules applied on PROV traces to assemble
linked experiment reports (MicroPublication)
Limitations:
- Sequence workflow patterns only
- SPARQL property paths with complex WF patterns ?
- Syntactic matching between WF patterns and PROV labels
Usage scenarios:
→ Query workflow datasets with domain concepts
→ Populate RDF repositories with WF results
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Conclusion & perspectives
28
Future works
(1) WF patterns: split-merge, “common motifs”
(2) Genericity: other domains / other reports (RO, Nanopub.)
(3) PROV heterogeneity: multi-systems PROV
(4) Evaluation: involving biologists, at larger scale
Questions ?
Demo: http://poem.univ-nantes.fr
Contact: alban.gaignard@univ-nantes.fr
Acknowledgments
BiRD bioinformatics facility Connect Talent Call
TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 30
Larger scale experiments (PROV traces)
1232 edges
TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 31
Larger scale experiments (PROV traces)
49 edges

Weitere ähnliche Inhalte

Was ist angesagt?

Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
 
Mapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesMapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesYasset Perez-Riverol
 
Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...Yasset Perez-Riverol
 
Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录insight-labs
 
The Lily RowLog library
The Lily RowLog libraryThe Lily RowLog library
The Lily RowLog libraryNGDATA
 
New methods hackathon mhc project
New methods   hackathon mhc projectNew methods   hackathon mhc project
New methods hackathon mhc projectGenomeInABottle
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysisYun Lung Li
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Genome Reference Consortium
 
Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013Elsa von Licy
 
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...GENALICE
 
Population Calling: a powerful tool for novel mutation detection in larger sa...
Population Calling: a powerful tool for novel mutation detection in larger sa...Population Calling: a powerful tool for novel mutation detection in larger sa...
Population Calling: a powerful tool for novel mutation detection in larger sa...GENALICE
 

Was ist angesagt? (20)

Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Mapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesMapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome Coordinates
 
Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...
 
Ashg grc workshop2015_tg
Ashg grc workshop2015_tgAshg grc workshop2015_tg
Ashg grc workshop2015_tg
 
agbt 2016 workshop lindsay
agbt 2016 workshop lindsayagbt 2016 workshop lindsay
agbt 2016 workshop lindsay
 
Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
The Lily RowLog library
The Lily RowLog libraryThe Lily RowLog library
The Lily RowLog library
 
New methods hackathon mhc project
New methods   hackathon mhc projectNew methods   hackathon mhc project
New methods hackathon mhc project
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
 
Sept2016 sv nabsys
Sept2016 sv nabsysSept2016 sv nabsys
Sept2016 sv nabsys
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT 2016 Workshop Magrini
AGBT 2016 Workshop MagriniAGBT 2016 Workshop Magrini
AGBT 2016 Workshop Magrini
 
Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013
 
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
 
Population Calling: a powerful tool for novel mutation detection in larger sa...
Population Calling: a powerful tool for novel mutation detection in larger sa...Population Calling: a powerful tool for novel mutation detection in larger sa...
Population Calling: a powerful tool for novel mutation detection in larger sa...
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
 

Ähnlich wie PoemTapp16

The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceRobert Grossman
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...Baptiste Mayjonade
 
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Alejandra Gonzalez-Beltran
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009Ian Foster
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartAraport
 
Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128GenomeInABottle
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizeAnn Loraine
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposalGenomeInABottle
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsJuan Antonio Vizcaino
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’ Paolo Missier
 
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...GenomeInABottle
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngsDin Apellidos
 
Bic bellec 2016_small
Bic bellec 2016_smallBic bellec 2016_small
Bic bellec 2016_smallPierre Bellec
 
SHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenanceSHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenanceSyed Muhammad Ali Hasnain
 
SHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenanceSHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenanceGaignard Alban
 
RAMP Data Challenge
RAMP Data Challenge RAMP Data Challenge
RAMP Data Challenge Proto204
 

Ähnlich wie PoemTapp16 (20)

The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
 
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
 
Thesis biobix
Thesis biobixThesis biobix
Thesis biobix
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick Provart
 
Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasets
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’
 
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
 
ISMB Workshop 2014
ISMB Workshop 2014ISMB Workshop 2014
ISMB Workshop 2014
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs
 
Bic bellec 2016_small
Bic bellec 2016_smallBic bellec 2016_small
Bic bellec 2016_small
 
SHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenanceSHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenance
 
SHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenanceSHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenance
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
RAMP Data Challenge
RAMP Data Challenge RAMP Data Challenge
RAMP Data Challenge
 

Kürzlich hochgeladen

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Kürzlich hochgeladen (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

PoemTapp16

  • 1. From Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic Hospital (CHU de Nantes), CNRS, France LINA, Nantes University, CNRS, France Institut du Thorax, Nantes University, INSERM, CNRS, France 8th USENIX workshop on Theory and Practice of Provenance (TaPP’16) Washington DC
  • 2. Needs for linked experiment reports 2
  • 3. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 3
  • 4. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 4 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  • 5. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 5 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  • 6. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 6 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb Challenges Algorithmic performance, storage, preservation, reuse (limit recompute) & share. 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  • 7. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing experiment results Scientific experiment: RNA sequencing to quantify gene expression levels under multiple biological conditions. 7
  • 8. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing experiment results Scientific experiment: RNA sequencing to quantify gene expression levels under multiple biological conditions. 8 Need for scientific context : metadata
  • 9. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Expected result: human+machine tractable reports 9 Annotated “Material & Methods” Links to some workflow artifacts (algorithms, data)
  • 10. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée 5-star Linked Open Data 10 W3C standards for machine and human readable data on the web. ⭑⭑⭑⭑⭑ : time and expertise !
  • 11. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée 5-star Linked Open Data 11 How to ease this process ? ● Workflow engines → automation ● PROV → workflow runs as linked data W3C standards for machine and human readable data on the web. ⭑⭑⭑⭑⭑ : time and expertise !
  • 12. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PROV only 12
  • 13. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PROV only 13 too fine-grained no domain concepts
  • 14. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Provenance as a Linked Experiment Report 14 few + meaningful statements
  • 15. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Problem statement & objectives 15 Problem statement Scientific workflows produce massive raw results. Their publication into curated query-able linked data repositories requires lot of time and expertise. Can we exploit provenance traces to ease the publication of scientific results as Linked Data ?
  • 16. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Problem statement & objectives 16 Problem statement Scientific workflows produce massive raw results. Their publication into curated query-able linked data repositories requires lot of time and expertise. Can we exploit provenance traces to ease the publication of scientific results as Linked Data ? Objectives (1) Leverage annotated workflow patterns to generate provenance mining rules. (2) Refine provenance traces into linked experiment reports.
  • 18. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Approach 18
  • 19. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Input domain-specific annotations (❶,❷) Workflow patterns ❶ 19 Sequence patterns, with possibly intermediate steps ● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar ● EDAM ontology: hasFunction, RNA sequence, Genome map
  • 20. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Input domain-specific annotations (❶,❷) Workflow patterns ❶ 20 Sequence patterns, with possibly intermediate steps ● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar ● EDAM ontology: hasFunction, RNA sequence, Genome map Experiment report template ❷ Link scientific claims, statements, material and methods ● MicroPublication ontology: Material, Method, Claims ● Experimental factor ontology: Transcriptome, Gene expression ● NCBI taxonomy: Homo Sapiens ● Open Annotation model: hasBody, hasTarget
  • 21. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PoeM: generating PrOvEnance Mining rules ❸ 21 (SPARQL Property path) (SPARQL Basic graph pattern) (SPARQL Construct query)
  • 22. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PoeM: sample generated rule ❸ 22 <If> part <Then> part
  • 24. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment context 24 SyMeTRIC: systems medicine project (2015-2017, call “Connect Talent”), funded by the french region Pays de la Loire.
  • 25. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment 25 Material & methods ● Real-life RNA-seq workflow to study 3 mice populations ● WF implemented in Galaxy, run on 2 biological samples ● PROV traces exported from Galaxy Histories (API)
  • 26. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment 26 Material & methods ● Real-life RNA-seq workflow to study 3 mice populations ● WF implemented in Galaxy, run on 2 biological samples ● PROV traces exported from Galaxy Histories (API) Results (for 1 biological sample) ● 60h CPU (12 cores for genome alignment), 21Gb storage ● 3s to export 81 PROV triples from the Galaxy history ● 2s to apply the rule and produce 35 Micropublication triples
  • 27. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Conclusion & perspectives 27 Semi-automated approach (1) PoeM generates semantic web rules (2) PoeM rules applied on PROV traces to assemble linked experiment reports (MicroPublication) Limitations: - Sequence workflow patterns only - SPARQL property paths with complex WF patterns ? - Syntactic matching between WF patterns and PROV labels Usage scenarios: → Query workflow datasets with domain concepts → Populate RDF repositories with WF results
  • 28. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Conclusion & perspectives 28 Future works (1) WF patterns: split-merge, “common motifs” (2) Genericity: other domains / other reports (RO, Nanopub.) (3) PROV heterogeneity: multi-systems PROV (4) Evaluation: involving biologists, at larger scale
  • 29. Questions ? Demo: http://poem.univ-nantes.fr Contact: alban.gaignard@univ-nantes.fr Acknowledgments BiRD bioinformatics facility Connect Talent Call
  • 30. TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 30 Larger scale experiments (PROV traces) 1232 edges
  • 31. TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 31 Larger scale experiments (PROV traces) 49 edges