Text mining is increasingly being used not only to find patents, but also to provide systematic and automated analysis of the patents and associated literature. Automated workflows are used to alert subscribers to relevant patents, and provide tailored summaries of the patents for fast review. This talk will present new use cases for text mining patents, and demonstrate recent developments in the I2E platform, including better multilingual support, improved visualization, and easier extraction of information from tables. Finally, it will demonstrate how the use of federated text mining facilitates opposition searching across multiple data sources such as literature and grants as well as patents themselves.
2. Agenda
Natural language Processing (NLP) and Text Mining
Customers examples
− Systematic automated analysis
− Patents and associated literature
− Tailored summaries for fast review.
New developments
− Improved visualization
− Better multilingual support
− Easier extraction of information from tables
Federated text mining
− Facilitates opposition searching
− Across multiple data sources
− E.g. Literature, grants as well as patents.
Nice ICIC 2015 Andrew Hinton
3. Search vs. Text Mining
Text Mining
Search Engine
Filter to
find most
relevant
documents,
then read
News Feeds
Manipulate
the text to
discover
what is
there
company activity company
Sanofi bid Aventis
Roche partner Antisoma
Scientific Literature Patents Internal Reports Social Media
Natural Language
Processing (NLP) to
understand meaning
Statistics to provide trends
Nice ICIC 2015 Andrew Hinton
4. Challenges in Unstructured Data
Nice ICIC 2015 Andrew Hinton
Different word, same
meaning
cyclosporine
ciclosporin
Neoral
Sandimmune
Different expression, same
meaning
Non-smoker
Does not smoke
Does not drink or smoke
Denies tobacco use
Different grammar, same
meaning
5mg/kg of cyclosporine per day
5mg/kg per day of cyclosporine
cyclosporine 5mg/kg per day
Same word, different
context
Diagnosed with diabetes
Family history of diabetes
No family history of diabetes
NLP
5. Natural Language Processing (NLP)
Groups words into meaningful units
Morphology allows search for different forms of words
We find that p42mapk phosphorylates c-Myb on serine and threonine .
Purified recombinant p42 MAPK was found to phosphorylate Wee1 .
sentences morphology -
different forms
noun groups
match entities
verb groups
match actions
Nice ICIC 2015 Andrew Hinton
6. From Words to Meaning
Nice ICIC 2015 Andrew Hinton
“Among them, nimesulide, a selective COX2 inhibitor, …”
Entrez Gene ID:
5743
inhibits
Entrez Gene ID: 5743
inhibits
Identifying
entities and
relations
Linguistics to establish relationships
8. Challenges: A Big Data Future
High indexing performance
• Millions of documents, TBs of storage
• Ontologies with 100,000s of terms
• Handles large documents with ease
• Open, configurable pipeline
• Advanced table processing
VOLUME VARIETY VELOCITY VISUALIZATION
Connected data technology
• Unified heterogeneous document types across federated servers
• Connect – Normalize – Use
• Structured, semi-structured and unstructured
Distributed indexing and querying
• Multi-processor
• Multi-machine
Integrates in enterprise applications, portals, pipelines and workflows
• Open web services API
• Public query language
Strong integrated visualization
Nice ICIC 2015 Andrew Hinton
13. Patent Analytics with I2E
Comprehensive Effective Search For Patent
Landscaping
CHALLENGE
Patents are a valuable
source of novel data.
Identifying drug targets
for specific indications is
often slow and manual, as
patents are long and the
language obtuse. To find
targets for 3 therapeutic
areas took 50 FTE days.
SOLUTION
A pipeline was built that
used queries to extract
target, indication,
invention type and
organisations and feed
into a database. Recall
was 10x manual, with
good precision; plus
target relevance scores.
BENEFIT
The integrated process
drastically reduces the
FTEs required to keep the
organization up-to-date
on recent findings
published in the patent
literature.
Nice ICIC 2015 Andrew Hinton
18. Processing Chinese Patents
OCR (Optical Character Recognition) quality can be
worse than English, especially for chemical names
Useful capabilities, without full NLP, include
− terminology matching, N-words apart, co-occurrence in
sentences or regions
− integration with ChemAxon’s Chinese Name-to-
Structure
− deals with appropriate word order for Chinese names
− provides chemically aware OCR correction
我们日前巳经开友了中文化字名称的OCR白动纠错工力能
我们目前已经开发了中文化学名称的OCR自动纠错功能
Nice ICIC 2015 Andrew Hinton
19. Testing N2S
Gold standards are expensive to build and
typically small-scale
We want to provide large sets of systematic
names based on real patent examples
To achieve this we used linguistic patterns
based on the form of the name and its context
Not trying to pull out all names, just those
where there is high confidence that it is a
systematic name
Nice ICIC 2015 Andrew Hinton
20. Creating a Test Set from Chinese Patents
Nice ICIC 2015 Andrew Hinton
Extracted ~1200 chemical names from a sample of 40
Chinese patent literatures. (>94% precision)
22. Text Analytics for Data Tables
Valuable information is often reported in a
combination of text and semi-structured data
tables
− E.g. pharmaceutical safety & toxicity tests
I2E provides a unique capability to find,
highlight, and extract the relevant data from
tables
Nice ICIC 2015 Andrew Hinton
23. Linking information within documents
Connecting information found in different parts of the
document
for example finding a compound as “Example 12” in a
patent and linking to a table where numerical data is
reported
Patent document
…
Combined into a row of data
in the structured results table
Nice ICIC 2015 Andrew Hinton
24. Recent Developments in Table Processing
Correction of the table structures:
Nice ICIC 2015 Andrew Hinton
26. Visualization of Chemical & Inhibition metrics
Dynamically explore SAR data directly extracted from Patents
Nice ICIC 2015 Andrew Hinton
27. Linking Out Using Chemical Structures
Text mining facilities easier exploration of
chemical information “trapped” in patents
Nice ICIC 2015 Andrew Hinton
29. Connected Data Technology
Normalize data: unify same concepts or relationships
Unify querying: query over heterogeneous data sources
Link content servers: query across enterprise or hosted content
Merge results: combine information found from all sources
Merge
Results
Normalize
Data
Link
Content
Servers
Unify
Querying
Nice ICIC 2015 Andrew Hinton
31. Opposition Searching
Single search over multiple data on different
servers providing a single set of results
Information from differently structured data is
brought together and ordered by year
Nice ICIC 2015 Andrew Hinton
32. In Summary
Text mining of patents is actively being used within
Pharma
Text mining can be automated with in workflows
Key Advantages
− Multilingual NLP searching
− Searching for Numeric and Tabular data
− Chemical structure search
− Linking-out from & visualization of unstructured data
− Connecting across data silos
Nice ICIC 2015 Andrew Hinton