Exploring SAR between Patents and PubChem

Using Chemicalize.org with Other Open
Resources to Extract SAR from Patents and
Explore Intersects in PubChem

Christopher Southan

ChrisDS Consulting, Göteborg, Sweden,

Prepared for the ChemAxon UGM, May 2012, version 2nd May

[1]

Key Relationships in Patents and Papers
MAQALPWLLLWMGAGVLPAHGTQHGIRLPLRSGLGGA
PLGLRLPRETDEEPEEPGRRGSFVEMVDNLRGKSGQGY
YVEMTVGSPPQTLNILVDTGSSNFAVGAAPHPFLHRYYQ
RQLSSTYRDLRKGVYVPYTQGKWEGELGTDLVSIPHGP
NVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDD
SLEPFFDSLVKQTHVPNLFSLQLCGAGFPLNQSEVLASV
GGSMIIGGIDHSLYTGSLWYTPIRREWYYEVIIVRVEINGQ
DLKMDCKEYNYDKSIVDSGTTNLRLPKKVFEAAVKSIKA
ASSTEKFPDGFWLGEQLVCWQAGTTPWNIFPVISLYLM
GEVTNQSFRITILPQQYLRPVEDVATSQDDCYKFAISQSS
TGTVMGAVIMEGFYVVFDRARKRIGFAVSACHVHDEFRT
AAVEGPFVTLDMEDCGYNIPQTDESTLMTIAYVMAAICAL
FMLPLCLMVCQWRCLRCLRQQHDDFADDISLLK

Document Assay Result Compound Target

Discerning and
mapping these
2011 http://www.ncbi.nlm.nih.gov/pubmed/21569515 relatioshionships from
documents is crucial
and demanding

Chemicalize.org is a
2010 http://www.citeulike.org/user/cdsouthan/article/8637426 significant advance in
open chemistry
extraction

2012 http://www.slideshare.net/cdsouthan/southan-bio-it2012patents
[2]

Practical Utilities

• Name-to-struc (n>s) for selected or batch conversions from
patents, papers, abstracts, web pages and other sources
• Intersect different content at identity or similarity level
• Molecular properties and bulk download
• Extracted structures archived, searchable and sharable
• Similarity display of analogue series from a document
• Bulk upload to PubChem for intersects and triage
• Result display in JChem for Excel
• Can iterate with OPSIN for IUPAC fixes
[3]

Chemicalize.org Exploitation Challenges

• Specific retrieval of patent or other source (e.g. target recall)
• Working different sources (e.g. CiteXplore/espace/Scibite for retrieval,
Google for cross-checks, WIPO for images and tables,
Freepatentsonline for deeper queries)
• Eyeballing original documents for relevant sections
• Locating exemplified drug-relevant/lead-like structures with data links
• For many patents examples >> activity data links > potent structures
• Selecting best sources/family members for optimal IUPAC extraction
quality (e.g. US pats and FPO)
• Filtering novel structures from common chemistry
• Need to be PubChem cogniscant for effective triage
• For a variety of reasons some documents have low extraction rates
• Tricks and work-rounds enhance exploitation

[4]

Target Recall: CiteExplore

• Title only ”DPPIV” Medline = 37 Patents = 31
• Title + abstract ”DPPIV” Medline = 402 Patents = 144
• Title + abstract ”dipeptidyl peptidase” Medline = 4,838 Patents = 1,520
• Title + abstract ” inhibitor” Medline = 772,053 Patents = 124,516
• Title + abstract ” diabetes” Medline = 431,299 Patents = 36,792
• Title + abstract ”DPPIV OR dipeptidyl peptidase AND inhibitor AND
diabetes” Medline = 1,105 Patents = 604

CiteXplore is restricted to EBI patent abstracts so you can get higher recall at
full-text sources such as SureChemOpen, EPO/espace, WIPO and FPO
(but not search Medline in parallel)

[5]

Target Alerts: SciBite

US2012040982
DPPIV
Boeringer Ingelheim
Feb 2012
[6]

Slicing and Dicing US2012040982 (I)

• Chemicalize converted 1,390 structures from the FreePatentsOnline (FPO) URL
• From the 497 examples 486 converted
• Need to scan the document and iterate with scroll bar to spot lead-like structures
[7]

Slicing and Dicing US2012040982 (II)

• OPSIN picks up some of what chemicalize misses (e.g. 389 above) but not all
• OPSIN error reports may help fix a series for Chemicalize (e.g 1 vs. L)
• Practically more important if that example has potent activity

[8]

Slicing and Dicing US2012040982 (III)

• Similarity display clearly picks out the lead-like analog series (top)
• Select via FPO text > example list only, > Word > PDF > chemicalize
upload > SDF download 486 structures (bottom)
• However, from the partial descriptions these may include prophetics
• Also download 28 claimed examples via PDF [9]

Slicing and Dicing US2012040982 (IV)

• Can locate an SAR table with 11 point IC50s
• But.... only 9 examples below 100 nM, example 25 is 56 nM
• The designation of series 1 and 2 obfuscates their example identity
[10]

PubChem Triage of Chemicalize Output (I)

• Example 25 SMILES > neither an exact match nor tautomer – thus novel
• Repeat search at 95% Tanimoto > 289 neighbors > cluster
• Closest PubChem analog > ChemSpider > SureChem > Novo Nordisk DPPIV
patent from 2005 [11]

PubChem Triage of Chemicalize Output (II)

• Total extraction from US2012040982 > 1,390 SDFs > 1387 uploaded > 7 “failed”
• 493 exact matches (= preexisting PubChem CIDs)
• 486 example-only SDFs > upload > 21 exact-match CIDs
• 34 claims-only give 9 exact-match CIDs, primary sources were:
• 5 from ChEMBL from a Boeringer Ingleheim 2007 Publication
• 7 from Thomson Pharma
• 2 from ChemSpider with SureChem links to Boeringer Ingleheim patents
• Thus 461 examples chemicalized from US2012040982 are “novel” structures
• However, cannot check enatiomeric or tautomeric inexact matches from
PubChem interface (only for existing CIDs)
[12]

PubChem Triage of Chemicalize Output (III)

• Chemicalize examples-plus-claims US201204098 = 29 CIDs (search 36 above)
• Thomson Pharma/Discovery gate intersect is ~ Derwent WPI (search 31)
• This matched 20 from the 29 (search 36), presumably DWPI extractions
• ChEMBL (7) matched 6 from 29 (i.e. extracted from papers)
• SLING matched 8 from 29 (i.e. extracted from EPO patents)
• It was thus possible to intersect the chemicalize extractions from this patent with
four independent primary sources in PubChem from patents and publications

[13]

Patent ”Walking” from Chemicalize
similarity results (I)

• The similarity results from one example gave 1734 matches out to Tanimoto 0.5,
extending ”beyond” the example space of US2012040982.
• Scrolling these shows at Tanimoto 0.6, with shared substructures in blue,
connect to a different older patent US7772226, also for DPPIV, from Eisai

[14]

Patent ”Walking” from Chemicalize
similarity results (II)

• US7772226 from FPO converted 1127 (i.e. more than the 992 from PatBase)
• 680 matched PubChem CIDs
• Example 228 CC#CCN1C(=NC2=C1C(=O)NC(OC1=CC=C(C=C1)C(=O)OC(=O)C(F)(F)F)=N2)N1CCNCC1
had a 12 nM IC50 for DPPIV
• Can even ”walk” to a third DPPIV patent WO2007071738 from Novartis
[15]

Extracting from CiteXplore ChEMBL

• CiteXplore lists ChEMBL
IUPACs and IDs

• Can chemicalize all
ChEMBL structures from
from one paper

• Difficult to ID these in
ChEMBL

• Upload 8 structures to
PubChem

• 7 match ChEMBL IDs

• Only one matches the 29
from US2012040982

• Thus paper probably from
mutiple patents
[16]

Mining PubMed Central Full-text Papers (I)

• Only a few examples converted direct
• So > wordpad > direct chemicalize (iterate) > web page (Google sites)
• Download > Upload to JChem for Excel
• Add in IC50 values from paper

[17]

Mining PubMed Central Full-text Papers (II)

• Add the SAR data from
the paper into the
structure table

• These had no exact
matches in PubChem

[18]

Chemicalizing the DrugBank Entry for DPPIV

41 conversions of
inhbitors, many are PDB
ligands

[19]

Can Even Extract Catalogues that have no
SMILES or InChIs....

Tocris DPPIV
inhibitor >
chemicalize >

PubChem > 6
analogs

[20]

Conclusions
• Chemicalize.org is powerful, flexible and free, as in beer....
• Significantly enables small-scale roll-your-own patent mining
• Ditto for journal article/abstract mining (e.g. for papers not captured in ChEMBL)
• You still need perspicacity to discern SAR details
• Complementary to commercial patent databases populated by manual extraction
(e.g. you can extract more structures)
• Commercial automated patent extraction databases typically combine ChemAxon
n>s with other algorithms (e.g. http://www.chemaxon.com/library/benchmarking-
chemaxon%E2%80%99s-name-to-structure-batch-tool-on-patent-text/)
• While they thus out-perform chemicalize, it is still very useful for intersecting
journal articles or other sources against any databases
• Significant novel content (w.r.t. public databases) is accumulating via ”default
crowdsourcing” in the chemicalize archive which becomes an important cross-
check source and can be ”walked” between documents
• Combined with OPSIN and OSRA structures from most sources are extractable
• Synergies with sources such as PubChem, PubMed Central, ChEMBL and
SureChemOpen will advance academic drug discovery and chemical biology

[21]

Questions Welcome

ChrisDS Consulting: http://www.cdsouthan.info/Consult/CDS_cons.htm
Mobile: +46(0)702-530710
Skype: cdsouthan
Email: cdsouthan – at - hotmail.com
Twitter: http://twitter.com/#!/cdsouthan
Blog: http://cdsouthan.blogspot.com/ (includes postings on patent themes)
LinkedIN: http://www.linkedin.com/in/cdsouthan
Website: http://www.cdsouthan.info/CDS_prof.htm
Publications: http://www.citeulike.org/user/cdsouthan/publications/order/year
Citations: http://scholar.google.com/citations?user=y1DsHJ8AAAAJ&hl=en
Presentations: http://www.slideshare.net/cdsouthan

[22]

Exploring SAR between Patents and PubChem

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Exploring SAR between Patents and PubChem

Ähnlich wie Exploring SAR between Patents and PubChem (20)

Mehr von Chris Southan

Mehr von Chris Southan (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Exploring SAR between Patents and PubChem