Systematic, Automated Analysis of Patents and Related Literature

Systematic, Automated Analysis of Patents & Related Literature
Andrew Hinton: ICIC, Nice France October 19th 2015

Agenda
Natural language Processing (NLP) and Text Mining
Customers examples
− Systematic automated analysis
− Patents and associated literature
− Tailored summaries for fast review.
New developments
− Improved visualization
− Better multilingual support
− Easier extraction of information from tables
Federated text mining
− Facilitates opposition searching
− Across multiple data sources
− E.g. Literature, grants as well as patents.
Nice ICIC 2015 Andrew Hinton

Search vs. Text Mining
Text Mining
Search Engine
Filter to
find most
relevant
documents,
then read
News Feeds
Manipulate
the text to
discover
what is
there
company activity company
Sanofi bid Aventis
Roche partner Antisoma
Scientific Literature Patents Internal Reports Social Media
Natural Language
Processing (NLP) to
understand meaning
Statistics to provide trends

Challenges in Unstructured Data
Different word, same
meaning
cyclosporine
ciclosporin
Neoral
Sandimmune
Different expression, same
meaning
Non-smoker
Does not smoke
Does not drink or smoke
Denies tobacco use
Different grammar, same
meaning
5mg/kg of cyclosporine per day
5mg/kg per day of cyclosporine
cyclosporine 5mg/kg per day
Same word, different
context
Diagnosed with diabetes
Family history of diabetes
No family history of diabetes
NLP

Natural Language Processing (NLP)
Groups words into meaningful units
Morphology allows search for different forms of words
We find that p42mapk phosphorylates c-Myb on serine and threonine .
Purified recombinant p42 MAPK was found to phosphorylate Wee1 .
sentences morphology -
different forms
noun groups
match entities
verb groups
match actions

From Words to Meaning
“Among them, nimesulide, a selective COX2 inhibitor, …”
Entrez Gene ID:
5743
inhibits
Entrez Gene ID: 5743
inhibits
Identifying
entities and
relations
Linguistics to establish relationships

Patent Filings at IP5 Offices
1980-2012

Challenges: A Big Data Future
High indexing performance
• Millions of documents, TBs of storage
• Ontologies with 100,000s of terms
• Handles large documents with ease
• Open, configurable pipeline
• Advanced table processing
VOLUME VARIETY VELOCITY VISUALIZATION
Connected data technology
• Unified heterogeneous document types across federated servers
• Connect – Normalize – Use
• Structured, semi-structured and unstructured
Distributed indexing and querying
• Multi-processor
• Multi-machine
Integrates in enterprise applications, portals, pipelines and workflows
• Open web services API
• Public query language
Strong integrated visualization

Examples & Use Cases

Speed to insight Example I: Patents

Business Impact and value

Leveraging Text Analytics in Patents to
Empower Business Decisions

Patent Analytics with I2E
Comprehensive Effective Search For Patent
Landscaping
CHALLENGE
Patents are a valuable
source of novel data.
Identifying drug targets
for specific indications is
often slow and manual, as
patents are long and the
language obtuse. To find
targets for 3 therapeutic
areas took 50 FTE days.
SOLUTION
A pipeline was built that
used queries to extract
target, indication,
invention type and
organisations and feed
into a database. Recall
was 10x manual, with
good precision; plus
target relevance scores.
BENEFIT
The integrated process
drastically reduces the
FTEs required to keep the
organization up-to-date
on recent findings
published in the patent
literature.

Visualization of results

Multi Lingual Text mining

Processing Chinese Patents
OCR (Optical Character Recognition) quality can be
worse than English, especially for chemical names
Useful capabilities, without full NLP, include
− terminology matching, N-words apart, co-occurrence in
sentences or regions
− integration with ChemAxon’s Chinese Name-to-
Structure
− deals with appropriate word order for Chinese names
− provides chemically aware OCR correction
我们日前巳经开友了中文化字名称的OCR白动纠错工力能
我们目前已经开发了中文化学名称的OCR自动纠错功能

Testing N2S
Gold standards are expensive to build and
typically small-scale
We want to provide large sets of systematic
names based on real patent examples
To achieve this we used linguistic patterns
based on the form of the name and its context
Not trying to pull out all names, just those
where there is high confidence that it is a
systematic name

Creating a Test Set from Chinese Patents
Extracted ~1200 chemical names from a sample of 40
Chinese patent literatures. (>94% precision)

Effective Text mining of Tables

Text Analytics for Data Tables
Valuable information is often reported in a
combination of text and semi-structured data
tables
− E.g. pharmaceutical safety & toxicity tests
I2E provides a unique capability to find,
highlight, and extract the relevant data from
tables

Linking information within documents
Connecting information found in different parts of the
document
for example finding a compound as “Example 12” in a
patent and linking to a table where numerical data is
reported
Patent document
…
Combined into a row of data
in the structured results table

Recent Developments in Table Processing
Correction of the table structures:

Improved Visualization

Visualization of Chemical & Inhibition metrics
Dynamically explore SAR data directly extracted from Patents

Linking Out Using Chemical Structures
Text mining facilities easier exploration of
chemical information “trapped” in patents

Text Mining across Multiple Data Sources

Connected Data Technology
Normalize data: unify same concepts or relationships
Unify querying: query over heterogeneous data sources
Link content servers: query across enterprise or hosted content
Merge results: combine information found from all sources
Merge
Results
Normalize
Data
Link
Content
Servers
Unify
Querying

Connected Data Technology
Single Query over
Multiple Data Sources and Network Locations

Opposition Searching
Single search over multiple data on different
servers providing a single set of results
Information from differently structured data is
brought together and ordered by year

In Summary
Text mining of patents is actively being used within
Pharma
Text mining can be automated with in workflows
Key Advantages
− Multilingual NLP searching
− Searching for Numeric and Tabular data
− Chemical structure search
− Linking-out from & visualization of unstructured data
− Connecting across data silos

Systematic, Automated Analysis of Patents and Related Literature

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (17)

Ähnlich wie Systematic, Automated Analysis of Patents and Related Literature

Ähnlich wie Systematic, Automated Analysis of Patents and Related Literature (20)

Mehr von Dr. Haxel Consult

Mehr von Dr. Haxel Consult (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Systematic, Automated Analysis of Patents and Related Literature