The document discusses an experiment by InfoCodex, Merck, and Thomson Reuters to use machine intelligence to discover novel biomarkers for diabetes and obesity by analyzing over 120,000 biomedical publications without human feedback. The technology used linguistic, information theory and self-organization approaches to analyze text and uncover unnoticed correlations. The output was many potential biomarker candidates, with some having high potential. While the approach showed promise, improvements were still needed to reduce noise and a hybrid human-machine approach may be best.
Carlo Trugenberger: Scientific Discovery by Machine Intelligence: A New Avenue fro Drug Research
1. InfoCodex Semantic Technologies
Turning Information into Knowledge
Scientific Discovery by Machine Intelligence:
A New Avenue for Drug Research?
Dr. Carlo A. Trugenberger
Co-Founder and Chief Scientific Officer
InfoCodex Semantic Technologies AG, CH-9470 Buchs
September
2,
2015
1
www.InfoCodex.com
Semantics 2015
2. InfoCodex Semantic Technologies
Turning Information into Knowledge
Big changes in pharmaceutical research
The end of the blockbuster era?
Challenges Opportunities
02/09/15
www.InfoCodex.com
2
Ø Genomics / Proteomics
Ø Big data / data mining
➪ structure-based design
Ø Drugs are “computed”
rather than discovered
Ø Costs are exploding
Ø Regulatory pressure
Ø Personalized medicine
Ø Outsourcing of critical
processes
Critical for survival:
Ø Shorten time-to market
Ø Early recognition of dead ends
Critical to beat competition:
Ø Data + data analysis power
Ø Machine intelligence
3. InfoCodex Semantic Technologies
Turning Information into Knowledge
The data deluge as an opportunity for eDiscovery
Traditional bioinformatics: structured data
New Idea: exploit unstructured data
02/09/15
www.InfoCodex.com
3
Experiment: Merck + Thomson Reuters + InfoCodex
Is it possible to drive drug research by text mining
large pools of biomedical documents?
sequence alignment, gene finding, genome assembly,
protein structure prediction, gene expression…
PubMed: 22 million citations, growing at the rate of I.7 paper/
minute
4. InfoCodex Semantic Technologies
Turning Information into Knowledge
02/09/15
www.InfoCodex.com
4
The Experiment of Merck & Co with InfoCodex
The tasks:
Ø Discover novel biomarkers for diabetes
and obesity (D&O) by analyzing 120’000
medical publications (PubMed
+ClinicalTrials.org + internal)
Ø Blind experiment, no human feedback
The aim:
Ø Test pure machine intelligence for
“semantic drug research”
Biomarker: $13.6 billion market in 2011, growing to $25 billion by 2016.
5. InfoCodex Semantic Technologies
Turning Information into Knowledge
Semantic technologies in the pharma industry
Most existing projects use NLP to extract triples “entity 1-relation-entity
2” sentence by sentence ➪ help to curate ontologies / libraries
However: this is not a discovery approach
Relations found this way have been explicitly written by human authors
and are thus known in one way or another
Going beyond triples: analyze text collections globally to identify small,
seemingly unrelated and unnoticed facts dispersed over isolated texts
assembling the scattered pieces of a puzzle
Critical: machine intelligence
02/09/15
www.InfoCodex.com
5
6. InfoCodex Semantic Technologies
Turning Information into Knowledge
The Technology: eDiscovery by InfoCodex
Linguistics + Information Theory + Self-Organization
02/09/15
www.InfoCodex.com
6
Ø Completely automatic semantic analysis of content.
Ø Designed for uncovering unnoticed correlations amongst information
distributed over documents groups and collections (contrary to NLP)
Ø “Assemble the pieces of a puzzle”
Ø Knowledge discovery as opposed to information extraction
8. InfoCodex Semantic Technologies
Turning Information into Knowledge
Step 1 : establish reference models for biomarkers / phenotypes
Ø Cluster documents describing known biomarkers (224 references found)
Ø Reference model for each cluster → meanings for “biomarkers diabetes” …
Step 2: determine the meaning of unknown words by machine
inference.
Step 3: analyze documents and generate a list of potential D&O
biomarkers/phenotypes by comparison with the reference models.
Step4: establish confidence levels
02/09/15
www.InfoCodex.com
8
Encoded
meanings
9. InfoCodex Semantic Technologies
Turning Information into Knowledge
Determination of the meaning of unknown words: machine inference
Example:
“Hctz” is a “diuretic drug” and is a
synonym of “hydrochlorothiazide”
Such relations established only on the
basis of machine intelligence combined
with internal knowledge base
02/09/15
www.InfoCodex.com
9
Co-occurrences with words in internal knowledge base
→ most probable hypernym → “is a” , “has to do”
11. InfoCodex Semantic Technologies
Turning Information into Knowledge
02/09/15
www.InfoCodex.com
11
Many uninteresting candidates
Too much noise
(the problem has been identified
and corrected)
Lots of “needles in the haystack”
Tens of extremely interesting and
valuable candidates with very
high potential
The Results
12. InfoCodex Semantic Technologies
Turning Information into Knowledge
Conclusion
ü Approach has high potential for discovery
ü Approach has potential to impact pharma research
q Speed up time-to-market
q Early recognition of dead ends
X Improvements in the process are needed: problems have been
identified and corrected.
Ø Most promising is a hybrid approach
q Human expertise in formulation of reference models
q Human curation of candidates prior to passing to the
laboratory
ü Possibly inevitable development
02/09/15
www.InfoCodex.com
12