(Credit to Varun Ratnakar and Yolanda Gil).
The automation of important aspects of scientific data analysis would significantly accelerate the pace of science and innovation. Although important aspects of data analysis can be automated, the hypothesize-test-evaluate discovery cycle is largely carried out by hand by researchers. This introduces a significant human bottleneck, which is inefficient and can lead to erroneous and incomplete explorations. We introduce a novel approach to automate the hypothesize-test-evaluate discovery cycle with an intelligent system that a scientist can task to test hypotheses of interest in a data repository. Our approach captures three types of data analytics knowledge: 1) common data analytic methods represented as semantic workflows; 2) meta-analysis methods that aggregate those results, represented as meta-workflows; and 3) data analysis strategies that specify for a type of hypothesis what data and methods to use, represented as lines of inquiry. Given a hypothesis specified by a scientist, appropriate lines of inquiry are triggered, which lead to retrieving relevant datasets, running relevant workflows on that data, and finally running meta-workflows on workflow results. The scientist is then presented with a level of confidence on the initial hypothesis (or a revised hypothesis) based on the data and methods applied. We have implemented this approach in the DISK system, and applied it to multi-omics data analysis.
Automated Hypothesis Testing with Large Scale Scientific Workflows
1. Automated HypothesisTesting with
Large Scale Scientific Workflows
Yolanda Gil
Daniel Garijo
Rajiv Mayani
Varun Ratnakar
Information Sciences Institute
& Department of Computer Science
University of Southern California
http://www.isi.edu
Parag Mallick
Ravali Adusumilli
Hunter Boyce
Stanford School of Medicine
Canary Center for Early Cancer Detection
Stanford University
http://mallicklab.stanford.edu
http://www.disk-project.org
2. Talk Outline
๏ Motivation
๏ Research Challenges
1. Representing Hypotheses
2. Representing Lines of Inquiry
3. Meta-analysis to review workflow results
๏ DISK Scenario walkthrough
๏ Results in cancer multi-omics
๏ Related work
๏ Contributions and Future Work
3. Scientific Data AnalysisToday:
Inefficient, Incomplete, Irreproducible
๏ Data analysis is time consuming
๏ Not systematic
๏ Not updated when new data/methods
become available
๏ Hard/impractical to reproduce prior
work
๏ Overall process is manually done:
inefficient and error-prone
๏ Analytic knowledge is
compartmentalised
New
hypothesis
Formulate
line of inquiry
(data + method)
Retrieve
data
Run
workflows
(methods)
Meta-analysis
of results
4. Our Focus: Cancer Multi-Omics
๏ Data Availability and Complexity:
• The multi-omic domain is filled with multiple levels of heterogeneous data that is
regularly expanding in volume and complexity through projects likeThe Cancer
Genome AtlasTCGA and and the associated Clinical ProteomicTumor Analysis
Consortium (CPTAC)
5. Our Focus: Cancer Multi-Omics
๏ Analytic Complexity:
• Multi-omic analysis requires the
use of dozens of interconnected
tools each of which may require
substantial domain knowledge. MAQ
BWA
BWA-SW (SE
only)
PERM
SOAPv2
MOSAIK
NOVOALIGN
SAMTOOLS
PICARD
GATK
PICARD
SAMTOOLS
IGVtools
Domain Knowledge is isolated
6. Our Focus: Cancer Multi-Omics
๏ Multiple types and complexities
of hypotheses:
• Hypotheses span the range from
single-gene/single dataset to
multi-gene/multi-ome/multi-
dataset
• Is this protein is found in this sample ?
• Is this gene is found in this sample ?
• Is this protein is associated with a
certain cancer ?
• Which proteins are associated with a
certain cancer ?
• ..
• ..
7. Talk Outline
๏ Motivation
๏ Our Approach & Research Challenges
1. Representing Hypotheses
2. Representing Lines of Inquiry
3. Meta-analysis to review workflow results
๏ DISK Scenario walkthrough
๏ Results in cancer multi-omics
๏ Related work
๏ Contributions and Future Work
8. Our Approach: Hypotheses-Driven Discovery
๏ Represent scientist
hypotheses
๏ Formulate lines of inquiry
that express how a type of
hypothesis can be pursued by
data analysis workflows
๏ Design a meta-analysis that
examines the results of lines of
inquiry and either validates or
revises the original hypotheses
๏ Develop an intelligent agent
that can report and explain
new findings to the scientist
Hypothesis
Lines of Inquiry
Specify relevant analytic methods (workflows),
type of data needed, and how to combine results
Query to
retrieve Data
Data Analysis
Workflows
Workflow
Bindings
Meta-Workflows
Confidence
Estimation
Benchmarking
Revised hypothesis &
interesting findings
9. Representing Hypotheses
Hypothesis
Lines of Inquiry
Specify relevant analytic methods (workflows),
type of data needed, and how to combine results
Query to
retrieve Data
Data Analysis
Workflows
Workflow
Bindings
Meta-Workflows
Confidence
Estimation
Benchmarking
Revised hypothesis &
interesting findings
Representing Hypotheses
10. Requirements from Omics
๏ Graph-based hypothesis
representation
• Entities are nodes
• Relationships are links
๏ Annotations on graphs
• Represent qualifications of hypotheses:
confidence and evidence
๏ Representing hypothesis evolution
• Graph versioning
Graph representation in RDF
๏ Standard semantic web language
๏ Scalable reasoners available
๏ Qualifications and provenance
through triple reification
๏ Versioning through multiple
named graphs
Representing Hypotheses
12. Lifecycle of a hypothesis
Biology
ontology
Hypothesis
ontology
hyp:expressedIn
user:TCGA-AA-3561-01A-22
User data
definitions
hyp:associatedWith
bio:ColonCancer
Graph Hy1
Graph Hy2
bio:PRKCDBP
bio:PRKCDBP
13. 1. Initial Hypothesis, Data & Workflows
Data Available
Workflows Available
Proteomics
Proteogenomics
XX_3561Proteome_VU.zip
(MassSpecData)
producedData TCGA-AA-3561
(Patient)
collectedFromTCGA-AA-3561-01A-22
(Sample)
AA_3561_EX2
(Experiment)
experimentedOn
Hypothesis Statement Hy1
PRKCDBP
expressedIn
TCGA-AA-3561-01A-22
14. 2. Running workflows on Data
Data Available
Workflows Available
Proteomics
Proteogenomics
XX_3561Proteome_VU.zip
(MassSpecData)
producedData TCGA-AA-3561
(Patient)
collectedFromTCGA-AA-3561-01A-22
(Sample)
AA_3561_EX2
(Experiment)
experimentedOn
Workflow Execution
W1
hasWorkflowTemplate
used
Hypothesis Statement Hy1
PRKCDBP
expressedIn
TCGA-AA-3561-01A-22
15. Qualifications of Hy1'Provenance of Hy1'
Hypothesis Statement Hy1
3. Meta reasoning about workflow results
PRKCDBP
expressedIn
TCGA-AA-3561-01A-22
Data Available
Workflows Available
Proteomics
Proteogenomics
XX_3561Proteome_VU.zip
(MassSpecData)
producedData TCGA-AA-3561
(Patient)
collectedFromTCGA-AA-3561-01A-22
(Sample)
AA_3561_EX2
(Experiment)
experimentedOn
Workflow Execution
W1
hasWorkflowTemplate
used
Meta-Workflow Execution
MW1
used
Revised Hypothesis Statement Hy1'
PRKCDBP
expressedIn
TCGA-AA-3561-01A-22
hasConfidenceValue
0
Statement Hy1'-S1
hasProvenance
producedused
produced
revisionOf
16. 4. New Data becomes available
Workflows Available
Proteomics
Proteogenomics
Hypothesis Statement Ha1
PRKCDBP
expressedIn
TCGA-AA-3561-01A-22
Data Available
XX_3561Proteome_VU.zip
(MassSpecData)
producedData
producedData
experimentedOn
experimentedOn
TCGA-AA-3561
(Patient)
collectedFromTCGA-AA-3561-01A-22
(Sample)
AA_3561_EX1
(Experiment)
AA_3561_EX2
(Experiment)
XX_3561_DD.zip
(RNASeqData)
17. 5. New Multi-Workflows are also run
Workflows Available
Proteomics
Proteogenomics
used
Data Available
XX_3561Proteome_VU.zip
(MassSpecData)
producedData
producedData
experimentedOn
experimentedOn
TCGA-AA-3561
(Patient)
collectedFromTCGA-AA-3561-01A-22
(Sample)
AA_3561_EX1
(Experiment)
AA_3561_EX2
(Experiment)
Workflow Execution
W2
XX_3561_DD.zip
(RNASeqData)
Workflow Execution
W1
used
Hypothesis Statement Ha1
PRKCDBP
expressedIn
TCGA-AA-3561-01A-22
18. Qualifications of Ha1'
hasProvenance
Provenance of Ha1'
6. Hypothesis Revision
Workflows Available
Proteomics
Proteogenomics
used
used
Revised Hypothesis Statement Ha1'
PRKCDBP
Mutated
expressedIn
TCGA-AA-3561-01A-22
hasConfidenceValue
0.98
Statement Ha1'-S1
producedused
Data Available
XX_3561Proteome_VU.zip
(MassSpecData)
producedData
producedData
experimentedOn
experimentedOn
TCGA-AA-3561
(Patient)
collectedFromTCGA-AA-3561-01A-22
(Sample)
AA_3561_EX1
(Experiment)
AA_3561_EX2
(Experiment)
Workflow Execution
W2
XX_3561_DD.zip
(RNASeqData)
Workflow Execution
W1
used used
produced
Meta-Workflow Execution
MW2
Hypothesis Statement Ha1
PRKCDBP
expressedIn
TCGA-AA-3561-01A-22
revisionOf
19. Representing Lines of Inquiry & Data analysis workflows
Hypothesis
Lines of Inquiry
Specify relevant analytic methods (workflows),
type of data needed, and how to combine results
Query to
retrieve Data
Data Analysis
Workflows
Workflow
Bindings
Meta-Workflows
Confidence
Estimation
Benchmarking
Revised hypothesis &
interesting findings
20. Data Query Pattern
DataFile ?d
Hypothesis Pattern
Lines of Inquiry
๏ Capture how to setup potential analyses that can be pursued to test a certain type of
hypothesis
bio:Protein ?p
hyp:expressedIn
bio:Sample ?s
producedData
Patient ?pcollectedFromSample ?sExperiment ?e
experimentedOn
Data Analytic Workflows
ProteomicsProteogenomics
DataFile ?d
Meta-workflowsComparisonConfidence estimation Benchmarking
22. Automated Workflow Generation in WINGS by Reasoning about
Semantic Constraints
Example: all input data must be from human species, i.e. must have HS in metadata
Workflow system uses this constraint to select datasets that have HS in their metadata so they are valid
23. Representing Hypotheses
Hypothesis
Lines of Inquiry
Specify relevant analytic methods (workflows),
type of data needed, and how to combine results
Query to
retrieve Data
Data Analysis
Workflows
Workflow
Bindings
Meta-Workflows
Confidence
Estimation
Benchmarking
Revised hypothesis &
interesting findings
24. Meta-workflows:
1) Comparison Meta-Workflows
Variant
Detection
Custom
Protein DB
Protein
Identification
Protein
Identification
Custom DB Reference DB
Protein IDs Protein IDs
Similarity
ScoreData Dependent:
• Peptide Level
• Protein Level
• Scan Level
Comparison
Meta-Workflow
๏ Goals:
• Compare results amongst multiple workflows
• Measure the global similarity amongst multiple workflows
• Provide users with explanation of workflow-dependent
differences in results
25. Meta-workflows:
2) Benchmark Meta-Workflows
๏ Goals:
• Evaluation of workflow performance
• Training of confidence estimation models (probabilistic)
Probabilistic Models
Benchmark
Meta-Workflow
ROC, True/False
Positive Rate
26. Meta-workflows:
3) Confidence estimation Meta-Workflows
๏ Goals:
• Composite results from multiple workflows
• Estimate confidence of the workflow result
• Use estimated confidence to update hypothesis
Protein
Identification
Protein
Identification
Custom DB Reference DB
Protein IDs Protein IDs
Probabilistic
Model
Estimate Confidence
Update Hypothesis
Benchmark
Meta-Workflow
27. Talk Outline
๏ Motivation
๏ Our Approach & Research Challenges
1. Representing Hypotheses
2. Representing Lines of Inquiry
3. Meta-analysis to review workflow results
๏ DISK Scenario walkthrough
๏ Results in cancer multi-omics
๏ Related work
๏ Contributions and Future Work
28. DISK Walkthrough: Initial Hypothesis
๏ Initial hypothesis is provided by the user
• PRKCDBP protein is expressed in a patient sample
29. DISK Walkthrough: Lines of Inquiry
๏ Line of inquiry suggests to find data from different experiments done with the
patient’s sample, then run multi-omic workflows, and then combine evidence into
confidence score
General hypothesis pattern
Data query pattern: search for different experiments
that produced omics data (eg type RNASeq and
MassSpecData)
Data analysis workflows to run on genomics and
proteomics data (more omics in the future)
Meta-workflows to assess confidence on the
hypothesis based on workflow results
30. DISK Walkthrough: Data & Workflows
To test a hypothesis that a protein is present in a patient’s sample:
๏ Retrieve mass spec and RNASeq data
๏ Use workflows
• Wf1: Proteome only
• Wf2: ProteoGenomic
32. DISK Walkthrough: Revised Hypothesis
๏ The hypothesis is revised and given a confidence value:
• A mutation of the protein PRKCDBP has been expressed in the patient’s sample
TCGA-AA-3561-01A-22 with a confidence 0.9887
33. DISK Walkthrough: Provenance Details
๏ Hypothesis provenance stores information about workflows run and the data used
• Workflow execution provenance is published by WINGS in the prov standard.
34. Talk Outline
๏ Motivation
๏ Our Approach & Research Challenges
1. Representing Hypotheses
2. Representing Lines of Inquiry
3. Meta-analysis to review workflow results
๏ DISK Scenario walkthrough
๏ Results in cancer multi-omics
๏ Related work
๏ Contributions and Future Work
35. DISK:Automated DIscovery of Scientific Knowledge
Workflow
Constraints
Workflow
Reasoning
Open
Publication of
Results as
Linked Data
Workflow
Provenance
WINGS Intelligent Workflow System
Lines of Inquiry
Interactive
Discovery
Agent
Hypothesis EvaluationHypotheses
Revised
hypotheses
& interesting
findings
Analytic Workflows
Data Retrieval
Workflow
Binding
Meta-Workflows
Confidence
Estimation
Benchmarking
Formulate
Lines of
Inquiry
Meta-Analysis
of Results
Data
Repository
37. ๏ Replicated [Zhang et al 2014] Proteogenomic analysis of Colo-rectal cancer
๏ Successfully reproduced paper findings comparing results at multiple levels (final figure,
supplementary tables, etc.)
๏ Took months and direct conversations with authors to replicate paper figures and
supplemental figures
๏ Application of analysis approach to new cancer type now takes minutes
• Useful whenTCGA is integrated
๏ Expanded analysis to
• compare how sensitive findings were to workflow details
0
2
4
6
−1.0 −0.5 0.0 0.5 1.0
spearman correlation
density
Correlation between mRNA−protein abundance
(within samples)
0
1
2
−4 −3 −2 −1 0
spearman correlation
density
Correlation between mRNA−protein variation
(across samples)
Impact on Cancer Multi-Omics
38. Talk Outline
๏ Motivation
๏ Our Approach & Research Challenges
1. Representing Hypotheses
2. Representing Lines of Inquiry
3. Meta-analysis to review workflow results
๏ DISK Scenario walkthrough
๏ Results in cancer multi-omics
๏ Related work
๏ Contributions and Future Work
39. Related Work
1) Discovery Systems
๏ [Lenat 1976]
๏ [Lindsay et al 1980]
๏ [Langley 1981]
๏ [Falkenhainer 1985]
๏ [Kulkarni and Simon 1988]
๏ [Cheeseman et al 1989]
๏ [Zytkow et al 1990]
๏ [Simon 1996]
๏ [Valdes-Perez 1997]
๏ [Todorovski et al 2000]
๏ [Schmidt and Lipson 2009]
40. Related Work:
2) Hypothesis Representation as Graphs
๏ Existing vocabularies are related but need to be extended to represent hypotheses in
DISK
• SWAN [Gao et al 2006]
• EXPO [Soldatova and King 2006]
• Nanopublications [Groth et al 2010]
• Ovopublications [Callahan and Dumontier 2013]
• Micropublications [Clark et al 2014]
• LSC
• BEL
41. Talk Outline
๏ Motivation
๏ Our Approach & Research Challenges
1. Representing Hypotheses
2. Representing Lines of Inquiry
3. Meta-analysis to review workflow results
๏ DISK Scenario walkthrough
๏ Results in cancer multi-omics
๏ Related work
๏ Contributions and Future Work
42. Contributions
๏ Represent scientist hypotheses
• Hypothesis ontology includes revisions & provenance
๏ Formulate lines of inquiry that express how a type of hypothesis can be
pursued with a data analysis workflow
• Lines of inquiry outline what type of data and workflows to use, and customize
them to the hypotheses at hand
๏ Design a meta-analysis to assess the results of lines of inquiry and revise the
original hypotheses
• Meta-analysis workflows assess diverse evidence
43. Ongoing & Future Work
๏ Ongoing work:
• Interactive Discovery Agent that explains interesting findings
• Continuous analysis of data (TCGA/CPTAC) as it grows
• Extending and generalizing meta-workflows
• Using DISK in geosciences: Subsurface water resource modeling
๏ Future challenges:
• More complex hypotheses about several entities
• Incorporate evidence over time
• Designing domain-independent meta-workflows
• Resource-bound hypothesis exploration