Han-Jo KIm from Standigm presents on using ChemAxon's ChemCurator in processing structures and relevant data from patents, from Google Patents, PDF and text format.
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Patent Data for Artificial Intelligence based Drug Discovery
1. AI Drug Discovery in Patent
Space
Hanjo Kim
Principal Scientist at Standigm Inc.
hanjo.kim@standigm.com
business@standigm.com
apply@standigm.com
www.standigm.com
2. Disclaimer
• Statements of fact and opinions expressed in this presentation
and on the following slides are solely those of the presenter and
not necessarily those of Standigm Inc.
3. Standigm Inc.
2015
Founded by three researchers at Samsung Advanced Institute of Technology
Jinhan Kim, PhD Artificial Intelligence (The University of Edinburgh)
Sang Ok Song, PhD Chemical Engineering (Seoul National University)
So Jeong Yun, PhD Systems Biology (POSTECH)
$23M
Funding raised
SK Holdings, Mirae Asset Capital, Mirae Asset Venture Investment, DSC
Investment, Wonik Investment, Atinum Investment, LB Investment, Kakao
Ventures
Seoul Korea (33)
Ann Arbor
Michigan (2)
Standigm= drug discovery company that generates and optimizes therapeutic
lead compounds by using advanced artificial intelligence toward license-out
Cambridge
UK (1)
AI, 16
Biology, 6
Chemistry, 8
Systems Biology,
4
Advisor, 3
PhD
20/37*
* Except Operation 5, Patent attorney 1
4. The AI solution
Disease Hit Lead Preclinical Clinical Drug
Drug
repositioning
The Standigm AI solution is industrializing drug discovery
Discovery at Scale
Target
* developing
BEST
TM
ASK
TM
Insight
TM
FIRST
*
Standigm ASKTM is freely available at
https://icluenask.standigm.com
5. Standigm BEST Platform
Standigm BESTStandigm
ASK
Knowledge
based biology
platform
for
novel targets,
pathways, and
MoA discovery
Standigm
FIRST
Hit generation
platform
for
novel and/or
undruggable
targets
Generative Models
Graph-based VAE
Scaffold-based
conditional enumerator
Novel Molecular
Representation
Scoring Functions
Simulations
AI rescoring models
Machine learning models
Compound Database
Known Molecules
Seed Molecules
Novel Virtual Structures
Commercial Library Privileged Standigm Library
Target Database Public data (gene, protein, function) BEST Feasibility
Public Library
Strategy setup Hit Generation Hit-2-Lead
Predictive Models
ADME/Tox predictors
Novelty (patentability)
Synthetic accessibility
Filters/Ranking models
External
CROs
Organic
synthesis,
In vitro/in vivo
Assays
Novel/Commercial Hits Lead Series
6. Graph-based VAE
Chemical
space
Encoder Decoder
Latent
space
Chemical
space
E DZ
Learning chemical space
Training DB
~4M
Y
Property/Target information
Contextualizing:
- substructures
- topology
- shape
- etc
property 1
property 2
property 3
Z : latent space
predictor
q(y|z)
seed molecules
decoder
p(x|z)
X : original chemical space
encoder
q(z|x)
Analogue structure generation
functionally similar
but novel scaffolds/molecules
Lead optimization
novel molecules
w/ better desired properties
decoder
p(x|z)
Smart library expansion
IP generation & expansion
7. Patent Space
Target A Compounds in latent space
Competitor 1
Competitor 2
Competitor 3
Interesting Area
potentweak
8. Chemical Space Navigation
• Chemical Space ~ Map
• Known scaffolds ~ POIs
• Information-rich space (ChEMBL, PubChem Bioassays, etc.)
• Novel scaffold ~ New POI
• El Dorado
• Patent
• Markush structure: How to protect as wide as possible area
• Exemplified compounds: boundary stones
9. Using ChemCurator
• Project types
• Google Patents (most cases)
• PDF files (do not use pdf files!)
• Text files (when google ocr is not good)
12. OCR (and chemical OCR)
• Lessons
• Google patents is reliable in most cases
• It even provides the compound table though very primitive
• Professional OCR software can give better results
• Convert pdf file to plain text with chemical names
• Complex tables
• Image (not OCRed) tables (next 3 slides)
• Chemical OCR engine helps a lot
• Text-image comparison
• Chemical OCR engines
• CLiDE (recommended, proprietary)
• Osra (open-source, recommended on Linux machine)
• Imago (I have no experience)
• Unsupported engines (like ChemGrapher,
https://pubs.acs.org/doi/10.1021/acs.jcim.0c00459)
17. Markush Structures
• Very expressive
• Same set of compounds can be written to very different forms
• Not well-validated
• ChemCurator helps
• Extracting example compounds
• Matching them to the Markush structure
• Require manual correction
• Sentence to chemical groups
• Ambiguous/incomplete R-group definitions
18. AI can help
• Reduction of frequent text OCR error
• NLP technique can correct frequent OCR errors
• The availability of large training set is important
• Extraction of relevant data
• Biological activities
• Analytical data
• Chemical OCR can be improved
• AI can do image recognition very well
• Different drawing styles can be managed