Web Science 2.0 - in silico science

Web Science 2.0

Conducting in silico research in the Web
from hypothesis to publication
Mark Wilkinson

Isaac Peral Senior Researcher in Biological Informatics
Centro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain
Adjunct Professor of Medical Genetics, University of British Columbia
Vancouver, BC, Canada.

Context

Multiple recent surveys of high-throughput biology

reveal that upwards of 50% of published studies

are not reproducible

- Baggerly, 2009
- Ioannidis, 2009

Context

“the most common errors are simple,

the most simple errors are common”

- Baggerly, 2009

Context

These errors pass peer review

The researcher is unaware of the error

The process that led to the error is not recorded

Therefore it cannot be detected during peer-review

Context

Discovery of such errors have resulted in retractions

and even shut-down clinical trials

Context

In March, 2012, the US Institute of Medicine said

“Enough is enough!”

Context
Institute of Medicine Recommendations
For Conduct of High-Throughput Research:

1. Rigorously-described, -annotated, and -followed data
management procedures

2. “Lock down” the computational analysis pipeline once it
has been selected

3. Publish the workflow in a formal manner, together with the
full starting and result datasets

Evolution of Translational Omics Lessons Learned and the Path Forward. The
Institute of Medicine of the National Academies, Report Brief, March 2012.

Achieving these recommendations

requires integration of existing technologies

and invention of new ones

Context
“While it took 2,300 years after the first report of angina for the condition to be commonly taught in medical
curricula, modern discoveries are being disseminated at an increasingly rapid pace.”

The Healthcare
Singularity and the
Age of Semantic
Medicine, Michael
Gillam, et al, The
Fourth Paradigm:
Data-Intensive
Scientific Discovery
Tony Hey (Editor),
2009

Slide adapted with
permission from
Joanne Luciano,
Presentation at
Health Web
Science Workshop
2012, Evanston IL,
USA
June 22, 2012.

“The Singularity”

The X-intercept is where, the moment a discovery is made,
it is immediately put into practice

(not only medical practice, but any research endeavour...)

The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009
Slide Borrowed with Permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USA
June 22, 2012.

The technology required to achieve this
does not yet exist

You
Are
Here

Scientific research would have to be conducted
within a medium that
immediately interpreted and disseminated
the results...

You
Are
Here

...in a form that immediately (actively!) affected the
research of others...

You
Are
Here

...without requiring them to be aware
of these new discoveries.

I‟d like to show you how close
we now are to this vision

and how we got there

We wanted to duplicate
a real, peer-reviewed, bioinformatics analysis

simply by building a model in the Web
describing what the answer
(if one existed)
would look like

...the machine had to make
every other decision
on it‟s own

Brief Digression

“in” the Web??

By clicking here you cause this incredibly
powerful computational tool called The Web
to retrieve a chunk of text and images that
can only be understood by a human...

To achieve this vision

We must learn how to
do research IN the Web

Not OVER the Web

Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.: Interspecies
data mining to predict novel ING-protein interactions in human. BMC genomics. 9, 426 (2008).

Original Study Simplified

Using what is known about interactions in fly & yeast

predict new interactions with your
human protein of interest

“Pseudo-code” Abstracted Workflow

Given a protein P in Species X

Find proteins similar to P in Species Y
Retrieve interactors in Species Y
Sequence-compare Y-interactors with Species X genome
(1)  Keep only those with homologue in X

Find proteins similar to P in Species Z
Retrieve interactors in Species Z
Sequence-compare Z-interactors with (1)

 Putative interactors in Species X

Modeling the answer...

OWL

Web Ontology Language (OWL) is the
language approved by the W3C
for representing knowledge in the Web


Note that every word in this
diagram is, in reality, a URL
(because it is OWL)

The model of the answer is
published in The Web
and borrows ideas from other
models published in The Web


ProbableInteractor
is homologous to (
Potential Interactor from ModelOrganism1…)
and
Potential Interactor from ModelOrganism2…)

Probable Interactor is defined in OWL as a subclass of Potential Interactor
that requires homologous pairs of interacting proteins to exist in both
comparator model organisms.

(Effectively, an intersection)

Publish our OWL model of a Probable Interactor

in the Web

Running the Web Science Experiment

In a local data-file

provide the protein we are interested in

and the two species we wish to use in our comparison

taxon:9606 a i:OrganismOfInterest . # human
uniprot:Q9UK53 a i:ProteinOfInterest . # ING1
taxon:4932 a i:ModelOrganism1 . # yeast
taxon:7227 a i:ModelOrganism2 . # fly

The tricky bit is...

In the abstract, the
search for homology is
“generic” – ANY
Protein, ANY model
system

But when the machine
does the experiment, it
must use specific of
resources because the
answer requires taxon:4932 a i:ModelOrganism1 . # yeast
information from two taxon:7227 a i:ModelOrganism2 . # fly
declared species

This is the question we ask:
(the query language here is SPARQL)

PREFIX i: <http://sadiframework.org/ontologies/InteractingProteins.owl#>

SELECT ?protein
FROM <file:/local/workflow.input.n3>
WHERE {

?protein a i:ProbableInteractor .

}

The reference (URL) to our OWL model of the answer

Our system then derives (and executes) the following workflow automatically

These are different
Web services!

...selected at run-time
based on the same model

There are four very cool things about what you just saw...


The system was able to
create a workflow based on
an OWL model (ontology)


The system was able to create a
COMPUTATIONAL workflow
based on a BIOLOGICAL model


The workflow it created
(i.e. the services chosen)
differed depending on
context


The choice of tool-selection was guided
by the encoded knowledge of domain-experts
worldwide

We got the answer

“simply” by designing a model of the answer!

A “Smart” Biomedical Resource Representation System

A Web application that answers
SPARQL-DL queries

Query-answering
Enhanced by SADI

Imagine a “virtual database”

all of the data
from all databases
+
result of
every conceivable analysis

How can we query that database?

What is the phenotype of every allele of the
Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE {
locus:DEF genetics:hasVariant ?allele .
?allele info:visualizedByImage ?image .
?image info:hasDescription ?desc
}

What is the phenotype of every allele of the
Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE {
locus:DEF genetics:hasVariant ?allele .
?allele info:visualizedByImage ?image .
?image info:hasDescription ?desc
}

Note that there is no “FROM” clause!
We don‟t tell it where it should get the information,
The machine has to figure that out by itself...

Enter that query into
SHARE

...and in a few seconds you get your answer.

The query results are live hyperlinks
to the respective Database or images

Neither SADI nor SHARE

know anything about

plant biology or genetics

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>
PREFIX ont: <http://ontology.dumontierlab.com/>
PREFIX uniprot: <http://lsrn.org/UniProt:>
SELECT ?gene ?pathway
WHERE {
uniprot:P47989 pred:isEncodedBy ?gene .
?gene ont:isParticipantIn ?pathway .
}

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>
PREFIX ont: <http://ontology.dumontierlab.com/>
PREFIX uniprot: <http://lsrn.org/UniProt:>
SELECT ?gene ?pathway
WHERE {
uniprot:P47989 pred:isEncodedBy ?gene .
?gene ont:isParticipantIn ?pathway .
}

Note again that there is no “From” clause…

I have not told SHARE where to look for the
answer, I am simply asking my question

Two different
Two different providers of
providers of pathway
gene information
information (KEGG and
(KEGG & GO);
NCBI); were found &
were found & accessed
accessed

The results are all links to the original data


know anything about

proteins or biochemical pathways

Recap
what we just saw

We posed, and answered
~complex multi-database queries

WITHOUT A DATA WAREHOUSE

Demo #2
An example from the Clinical domain

Show me the latest Blood Urea Nitrogen and Creatinine levels
of patients who appear to be rejecting their transplants

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#>
PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#>
SELECT ?patient ?bun ?creat
FROM <http://sadiframework.org/ontologies/patients.rdf>
WHERE {
?patient rdf:type patient:LikelyRejecter .
?patient l:latestBUN ?bun .
?patient l:latestCreatinine ?creat .
}

Show me the latest Blood Urea Nitrogen (BUN) and
Creatinine levels of patients who appear to be
rejecting their transplants

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#>
PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#>
SELECT ?patient ?bun ?creat
FROM <http://sadiframework.org/ontologies/patients.rdf>
WHERE {
?patient rdf:type patient:LikelyRejecter .
?patient l:latestBUN ?bun .
?patient l:latestCreatinine ?creat .
}

Likely Rejecter:

A patient who has creatinine levels
that are increasing over time

- - Mark D Wilkinson‟s definition

Likely Rejecter:

…but there is no “likely rejecter”
column or table in our database…
only blood chemistry measurements
at various time-points

Likely Rejecter:

So the data required to answer this question
DOESN‟T EXIST!

Now…

Two “magical” events occur…

The machine decides

by itself

that it needs to do a
Linear Regression analysis
on the blood creatinine measurements
in order to answer your question

The machine decides

by itself

how and where that analysis
can be done

and does it automatically!

http://www.impactlab.net/2009/03/22/improve-your-brain-power/

The SHARE system utilizes SADI to discover
analytical services on the Web that do linear regression analysis
and sends the data to be analysed


know anything about

blood chemistry, or mathematics

So how does the machine know
what to do??

Ontologies explicitly define the kinds of
things that (can) exist…

…and what those things are “like”

i.e. what properties they have
(color, weight, shape, texture, temperature, “state”)
and what relationships they have to one another
(inside-of, adjacent-to, part-of, binds-to, controls, inhibits,
degrades, etc.)

So we create ………….
ontologies about biology
and health

We* publish them on the Web

* We… or anybody! Anybody can publish an ontology!

My definition of a Likely Rejecter is encoded in
a machine-readable document written in the OWL Ontology language

Basically:

“the regression line over creatinine measurements should have an increasing slope”

Our ontology refers to other ontologies (possibly published by other people)
to learn about what the properties of “regression models” are
e.g. that regression models have slopes and intercepts
and that slopes and intercepts have decimal values

SHARE examines the query

Looks on the Web for ontologies that describe the
problem it is trying to solve, and “reads” them

then uses that “knowledge” to figure out which
data-sources and analytical tools it needs
to answer the query

The way SHARE “interprets” data varies
depending on the context of the query
(i.e. which ontologies it reads – Mine? Yours?)

and on what part of the query
it is trying to answer at any given moment
(which ontological concept is relevant to that clause)

Data exhibits “late binding”

Late binding:

“purpose and meaning”
of the data is
not determined until
the moment it is required

a.k.a The “semantics” of the data

Benefit
of late binding

Data is amenable to
constant re-interpretation

Example?

Blood Creatinine measurements

were not dictated to be (only)

Blood Creatinine measurements

Example?

The data had the „qualities/properties‟ that

allowed one machine to interpret

that they were Blood Creatinine measurements

(e.g. to determine which patients were rejecting)

Example?

But the data also had the „qualities/properties‟ that

allowed another machine to interpret them as

Simple X/Y coordinate data

(e.g. the Linear Regression calculation tool)

http://www.flickr.com/people/faernworks/

We built a model of the proposed answer

Our system converted the model into the experiment

The analytical tools chosen for that
experiment changed depending on

context

even though the biological model driving
their selection was the same

i.e.

The published model is re-usable

i.e.

The published model is re-usable

In different contexts... By different researchers

and because the model IS the experiment

the published EXPERIMENT is re-usable!!

simply point the same query at your own dataset...

The publication is an
executable document!

Every component of the model

Every component of the input data

Every component of the output data

is a URL

Therefore the question, the experiment, and the
answer, are immediately published IN the Web

Every component of the model

Every component of the input data

Every component of the output data

is a URL

The answer, and the knowledge derived from it,
is immediately available to search engines
and moreover, can affect the outcome of other
Web Science experiments

An experiment... based on a hypothesis

An experiment... based on a hypothesis

now modeled in OWL

Does this OWL Class represent the Hypothesis?

I think it does!

We modeled the answer...
...but the answer was hypothetical

Change the way we think of “hypotheses”

In Web Science 2.0

Model what the world would “look like”
if your hypothesis were true

Then ask “is there any data that
fits that model?”

Like the blind men examining an elephant

Seemingly different aspects of research
when viewed from the perspective of Web Science
become the same “thing”

The Model

Our vision of Web Science 2.0

Hypothesis Query

Workflow

Ontology Result
Materials &
Methods
These can be automatically derived through
provenance information during workflow execution

Please join us!

SADI and SHARE are Open-Source projects

http://sadiframework.org

University of British Columbia

Luke McCarthy – Lead Dev. Edward Kawas
Everything... SADI Service auto-generator

Benjamin VanderValk Ian Wood
SHARE & SADI & Experimental modeling & Experimental modeling project
myHeath Button

Soroush Samadian
Cardiovascular data modeling and queries

C-BRASS Collaborators at other sites

U of New Brunswick Carleton University

Dr. Chris Baker Dr. Michel Dumontier
Alexandre Riazanov Marc-Alexandre Nolin
Leonid Chepelev
Steve Etlinger
Nichaella Kieth
Jose Cruz

Web Science 2.0 - in silico science

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (16)

Ähnlich wie Web Science 2.0 - in silico science

Ähnlich wie Web Science 2.0 - in silico science (20)

Mehr von Mark Wilkinson

Mehr von Mark Wilkinson (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Web Science 2.0 - in silico science