This is a moderately technical overview of SADI principles and capabilities, and IPSNP tools, including an overview of Life Science case studies. It is designed to be accessible to the general Computer Science and Software Engineering audience.
See also the sequel talk "A practical introduction to SADI semantic Web services and HYDRA query tool"
Genetics and epigenetics of ADHD and comorbid conditions
Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA
1. COMPREHENSIVE SELF-SERVICE
LIFE SCIENCE DATA FEDERATION
WITH SADI SEMANTIC WEB SERVICES
AND HYDRA
Alexandre Riazanov, CTO
IPSNP Computing Inc
Oslo University, Sep 23, 2015
2. WHO WE ARE
• IPSNP Computing Inc -- a Canadian startup,
building on and commercializing prior academic
research on SADI.
• Founded to develop an industrial strength query
tool for SADI, to supercede a research proof-of-
concept prototype.
• Looking for customers/partners and investors.
3. BIOMEDICAL RESEARCHERS AND CLINICIANS USE DATA
FROM MULTIPLE SOURCES
• Online and in-house databases, spreadsheets.
• Web services, e.g., literature search, etc.
• Nomenclatures, ontologies, controlled
vocabularies.
• Web sites, scientific publications, patents, etc.
• Algorithms, e.g., BLAST, molecular structure
prediction, various text mining programs, etc.
4. BIG VISION: FEDERATED QUERYING OF
HETEROGENEOUS AND DISTRIBUTED DATA SOURCES
• We want to query 1000s of data sources as a
single database.
• We want more agility than datawarehousing
can provide: e.g., just-in-time algorithm
execution, plug-and-play data source addition,
live data querying.
• We want to use simple and declarative queries,
not to program workflow scripts.
6. WE CAN ACTUALLY DO THIS
WITH SEMANTIC WEB SERVICES
Here is how our data federation engine HYDRA works:
7. HOW IS THIS ALL POSSIBLE?
• Key ingredient: the SADI framework for
Semantic Web services (Semantic Automated
Discovery and Integration).
• SADI services are:
• RESTful services
• consuming and producing one format -- RDF,
• with semantic descriptions (in OWL) fully defining
their functionality.
8. PLAN OF THE TALK
• What are SADI services?
• Automatic service discovery and
invocation in query engines (HYDRA).
• Self-service querying vision.
• Query composition with HYDRA GUI.
• An overview of Bioinformatics and Clinical
Intelligence case studies.
Tons of screenshots!
9. SADI SERVICE I/O
• Input: RDF description of an input object.
• Output: another RDF graph providing more
(computed or retrieved) info about the input
object or linking it to other objects.
• Since all SADI services “talk the same
language” (RDF), they are 100% syntactically
interoperable:
– output of one SADI service can be directly
consumed by any other SADI services.
Describe your
input, and I will
tell you
something else
about it”
10. COMPLETE SEMANTIC DESCRIPTIONS
OF SERVICE FUNCTIONALITY
• SADI services carry semantic descriptions of their
I/O that completely define what the service expects
and can accept as input, and what RDF assertions the
service can output.
• Unique and extremely powerful property: it facilitates
completely automatic discovery
and
orchestration of services.
11. HYDRA QUERY ENGINE
● Given a SPARQL query, HYDRA analyses it
by using an intelligent logic-based algorithm
(proprietary, unlike SADI itself).
● HYDRA requests descriptions of potentially
useful services from available SADI service
registries.
● HYDRA processes the descriptions and
figures out which services have to be
invoked, on what data and in what order.
SPARQL is a W3C
standard semantic
query language --
much more intuitive
than SQL.
12. QUERY EXAMPLE
• Find documents mentioning "haloalkane dehalogenase
activity", extract information about mutations and visualise the
mutations on 3D protein structure images.
• HYDRA automatically finds and orchestrates 5 services from
our registry:
– PubMed search: keyword query ⟶ document PubMed IDs
– PDF retrieval: PubMed ID ⟶ PDF file URL
– ASCII extraction: PDF file ⟶ ASCII text
– Text mining: ASCII text ⟶ mutation info
– Visualisation: mutation & protein ⟶ 3D image (Jmol)
13. RESULTS
Deploying mutation impact text-mining software with the SADI Semantic Web Services framework
http://www.biomedcentral.com/qc/1471-2105/12/S4/S6
14. WHAT IS SO COOL ABOUT IT?
• Data federation at its best:
– independent, heterogeneous data sources (PubMed
doc search, PubMed Central for PDFs);
– not only data is integrated: ASCII extraction, text
mining and 3D visualisation are algorithms!
• Execution is completely automatic: HYDRA finds and
invokes the services without any help from the user.
15. MORE QUERY EXAMPLES
• Find drug products that contain active ingredient X.
• Find drugs that have been studied in clinical trials targeting
infections caused by bacteria X.
• Annotate a DNA sequence X with molecular functions of
proteins produced by the corresponding gene.
• Find patients with precondition X diagnosed with infections Y
resulting from procedure Z.
• Many many other questions that Life Scientists and
Clinicians ask on a daily basis.
18. HERE IS AN EVEN BIGGER VISION:
Self-service ad hoc querying of federated data.
19. HYDRA IMPLEMENTS SEMANTIC QUERYING
• Users need not know how the source data
is organised or accessed.
• They just need to know the terminology of
their subject domain.
• Queries are completely declarative:
specify what you want to find, not how.
20. HYDRA ALSO SUPPORTS
CONCEPT HIERARCHIES AND RULES
● Some queries would be too complex if we could not
exploit generality:
o a query concerning all antibiotics requires
generalisation, otherwise all types of antibiotics would
have to be enumerated in the query.
● Much better way to do this is to import a classification of
drugs and use it in query execution.
● HYDRA facilitates such reasoning and even more
complex reasoning with rules.
21. THERE ARE NO PRINCIPLE OBSTACLES
TO SELF-SERVICE QUERYING
We just need an adequate user interface
for building queries.
35. BIOINFORMATICS AND CHEMINFORMATICS CASE
STUDIES AND PILOTS WITH SADI AND HYDRA
• Integrating genomics text mining results with online
biomedical data and visualisation algorithms.
• Integrating programs for lipid molecule structural
analysis and classification.
• Interpreting toxicity experiment data by discovering
relevant info in online databases.
• Large-scale retrieval of toxicity information from
publications.
36. INTERPRETING TOXICITY EXPERIMENT DATA
• Partner: university lab studying effects of
environmental pollutants.
• Querying needs: finding relevant prior experiments,
gene annotation, protein domain annotation, etc.
• Data sources: ArrayExpress, BLAST, HMMER3,
RefSeq, Pfam, ORFPredictor, GO, UniProt, NCBI
Taxonomy -- all queried as a single DB!
37. SUBTASK: DNA MICROARRAY ANNOTATION
• Toxicity experiments with microarrays: which DNA sequences
are under/overexpressed after organism’s exposure to toxin X?
• Interpretation requires knowing affected protein functions and
domains.
• HYDRA virtually implements this workflow:
38. RETRIEVAL OF TOXICITY DATA FROM
PUBLICATIONS
• Customer: government agency (Canada).
• Querying needs: online publication search by
organism and chemical types, text-mining for
toxicity data.
• Data sources: NCBI Taxonomy and ChEBI with
free-text search, PubMed search, electronic
libraries, journal Web sites, Google Scholar,
specialised text-mining algorithm, text utilities.
Apparent
value: some
queries save
many man-
weeks of work
of a postdoc.
39. CLASSIFYING NEW LIPID MOLECULES
• One of the early experiments with SADI.
• A group in Carleton U. had a program for
identifying functional groups in a molecule
structure.
• A group in U. of New Brunswick had a classifier
estimating lipid classes based on
presence/absence of functional groups.
• Publishing the prototypes as SADI services
allowed us to integrate them with each other and
relevant external resources.
40. CLINICAL IT CASE STUDIES AND PILOTS
WITH SADI AND HYDRA
• Ad hoc querying of clinical data for Hospital
Acquired Infections surveillance and research
(with UNB, McGill SoM and Ottawa H.)
• On-going pilot with a US hospital.
• Looking for pilot opportunities for Clinical Trial
Cohort selection:
• trial eligibility criteria can be implemented as queries
over heterogeneous and distributed clinical data;
• benefits: cost reduction and timely alerts.
41. THANK YOU!
Further materials/services are available on request:
• Live and recorded demos.
• Publications on previous (academic) case studies.
• Training/consulting.
• http://ipsnp.com/ (Canada) and http://ipsnp.co/ (UK)