1. Pros and cons of 23 million
patent-extracted structures in PubChem
Christopher Southan, Senior Cheminformatician, IUPHAR/BPS Guide to
Pharmacology, Discovery Brain Sciences, University of Edinburgh, UK.
1
ACS Boston, Sunday Aug 19th 2018, Chemical Structure Searching for Patent Information
Session , 2:15 PM - 2:45 PM Harbor Ballroom III - Westin Boston Waterfront
https://www.slideshare.net/cdsouthan
2. Abstract (will not be shown)
As of March 2018, the major automated patent chemistry extractions (in ascending size, NextMove,
SCRIPDB, IBM and SureChEMBL) cover 22.17 million CIDs from the PubChem total of 94.7.These have
become hugely enabling, with advantages including a) majority of patent-exemplified structures of
medicinal chemistry interest are now in PubChem b) first-filings of lead series and clinical candidates
can be tracked d) the PubChem tool box has features difficult to match in commercial sources, e) many
structures can be associated with bioactivity data f) connections between papers and patents can be
made via ChEMBL entries g) BindingDB has accumulated a valuable collection of manual SAR
extraction from US patents that can be intersected with the automatically extracted structures and h)
coverage for some documents approaches that of SciFinder. However, there are a range of
disadvantages and caveats associated with automated extraction.These include; a) coverage
compromised by dense image tables, Markush nesting and poor OCR quality ofWO documents, b) as
the major pipeline in situ SureChEMBL can have a PubChem updating lag of some months c)
automated extraction generates structural “noise” that degrades chemistry quality, mainly from the
conversion of split IUPAC strings d) PubChem patent document indexing is patchy d) nothing in the
records actually indicates IP status, e) continual re-extraction of common chemistry results in irrelevant
structure-to-document associations (e.g. 126,949 patents for aspirin ), f) authentic compounds are
contaminated with spurious mixtures of various types as well as never-made virtuals. Surprisingly these
include 44K of deuterated drug analogues g) outside the BindingDB set, linking between SAR data and
targets from recent filings is still a manual exercise but examples will be shown how this can be done. In
terms of searching using SureChEMBL as an entry portal and moving from intra-document chemistry
exemplifications out to search PubChem, including the advantages of structure clustering, will be
demonstrated. Balancing the pros and cons indicates that the PubChem patent extraction “big bang”
over the last five years presents users with the best of both worlds. Academics can now patent mine
extensively and PubChem has become an essential adjunct to commercial sources of patent chemistry
and associated bio entities such as diseases and drug targets.
2
3. Introduction and outline
• I assume general awareness of patent chemisty value, database chemistry
searching and SAR mapping to targets (background refs in final slide)
• Since PubChem is free there are no serious ”cons” so these slides are better
classified as caveats and gotchas
• Note these related presentations: ”Structure searching for patent information:
The need for speed” (May) 13:45 CINF 35 : ”Automating chemical structure and
inhibition data extraction from patents” 2:45 CINF 37, Hinton,. “Searching for
patent information in PubChem” Kim et al, 3:30 CINF 38. “Beyond journal
articles – extracting bioactivity data from patents” Gaulton et al., 9:00 CINF 116
• Thes slides willl cover: source numbers, source intersects, fragmentation,
vituals, clustering, relative coverage of drug sources, BindingDB, common
chemistry loos ends of pros and cons, summary and further info
3
4. SnapshotAug 2018: PubChem 96.5 mill
• Major sources are Chemical Named Entity Recognician (CNER) pipelines
• Thomson Pharma (2006-2016 R.I.P.) manual extraction of 4.3 million CIDs from
patents and papers, would probably add ~ 1.0 mill patent structures
• 24% of PubChem CIDs include at least one patent extraction SID
• There are 49% single-sourceCIDs in PubChem
• 26% (12.6 mill) of these come from patent sources
• ~ 1.2 SID:CID ratio
• Note NextMove SIDs have had synthesis data extracted (PMID: 27028220)
4
6. Patent CIDs by year (cumulative)
• SureChEMBL is the only major source regularly updating
• But gotchas in exact load times (e.g. as of 04 Aug):
– In situ; WO chemistry downloadable ~ 1 week post-publication
– In UniChem, 27 July 2018 update = 19,648678
– In PubChem, load date 23 June 2018 = 18,415971 CIDs
• Will there be post-2017 IBM refresh ? 6
7. Pro: divergence, cons: this has ceased
and remains largely unexplained
7
IBM = 10.7 mill
SCRIPDB = 4.0 mill a one-off from
SureChEMBL = 17.6 mill
2.9
2.4
4.7 10.1
0.6 0.4
0.50
Union = 21.7
3-way = 2.4
3 + 2-way = 8.1
Unique= 13.5
8. Con: CNER fragmentation and mixtures
8
ChEMBL + Thomson Pharma
manual extraction
Patent CNER Sources
• Low shoulder includes split IUPACs, Markush bits, synthetic schema from single
images and mixture splits
• High shoulder peptide drop-off?
10. Pro: PubChem “slice ‘n dice” features
10
• Some PubChem functionality may be difficult to mimic in commercial databases
• Powerful similarity ”walking” between patents, papers, BioAssays, structures, vendors,etc
11. Pro: manual SAR extraction > BindingDB > PubChem
• 151,314 structures from 2098 USPTO patents, 2013 - 2018 (via CWUs)
• 146,751 patent-only
• Subsumed by ChEMBL at release time (e.g. 24 has 74,050 of thes)
11
12. Common-chemistry-to-many-documents
(futile indexing)
• PubChem aspirin (CID 2244) linked to 134,286 patent documents
• SureChEMBL aspirin structure search gives 401,341 document matches
• SureChEMBL 78,351 document links for aspirin name search
• SureChEMBL aspirin structure search, restricted toWO-only, claims
section and 2018 - gives 152 documents
• SciFinder 8,985 patent references for aspirin by name or structure
• Below; corpus count (x-axis) vs compounds (y-axis) for US9181236
12
13. BISTS (BIg Strange ThingS) from patents:
the infamous “Chessbordanes”
• Mainly a SCRIPDB legacy from CWU’s
• Still there but more amusing that a serious Con
13
14. Con: virtuals
14
• ”Deuterogate” example of
1000,s of enumerations
without reduction to practice
(i.e. no data)
• Unforseen consequences of
flow patents < PubChem
• US20080045558, 506
deuterated codeines (CID
5284371), 206 deuterated
oxycodone (CID 5284603)
3,251 SIDs, SureChEMBL,
SCRIBDB, IBM (all CWUs)
• SciFinder extracted 1014
isotopic substances under
”bological study”
Preparation and utility of opioid analgesics, Auspex
15. Comparative coverage (1) single patent
Pro: overlaps, Con: divergence
15
• US9181236B1, 2015, “2-spiro-
substituted iminothiazines
and their mono-and dioxides
as BACE inhibitors”
• 173 BindingDB CIDs curated
from PubChem
• 405 substances SDF from
SciFinder OpenBabel > 391 IK
> 362 CIDs
• 1657 rows > 834 SureChEMBL
IDs > 664 CIDs
• https://pubchem.ncbi.nlm.nih.
gov/patent/US918123 gives
742 CIDs
16. Comparative coverage (II): patents vs papers
• Intersect of ~0.5 mill CIDs is a Pro, but there are caveats
• ChEMBL extraction from papers is 1,3 mill with the rest confirmed BioAssays
from mostly MLSCN compounds
• Patents include extractions from PubMed abstracts by IBM
• ChEMBL includes the patent extractions of BindingDB (but only 73K)
16
17. Comparative coverage (III): drug source matches
• Chart is ”look back” cumulative CID coverage of INN and Guide to Pharmaclogy
• From 9479 INNs, 87% have a patent match (n.b. 82% have a ChEMBL match)
• From 7159 in GtoPdb 79% have a patent match
• From 9767 in DrugBank 72% have a patent match
• Caveat: some matches may be from secondary patents (i.e. not first-filings)
17
18. Pro and con loose ends
• CNER is confounded by dense image tables and poor OCR (e.g.WO PDFs)
• CNER is brainless compared to manual extraction (e.g. CID 2791850)
• CNER pipelines are divergent
• No Markush handling
• Peptide capture is patchy
• Can only filter ”in claims” via IBM SID tags
• In bioactivity and SAR terms there are probably no more than ~ 50K A61/C07
quality documents with useful data from last decade
• These cover only ~ 3.5 million bioactives (but ~2x the literature)
• So we could have an overhead ~ 20 million non-bioactives
18
19. The security “con”
• Drug discovery organisations that file may prohibit the open searching of
proprietary structures via the PubChem interface outside the firewall
• Notwithstanding, there is no patent case-law precedent for composition-of-
matter claims being challenged on the basis of structures intercepted from an
open server
• Ipso facto prohibition of open searching constitutes a major nailing-of-feet-to-
the-floor
• You can do initial scoping searches from home or your phone anyway
• You can do an InChIKey inner layer search, including against UniChem at 156
mill and Google (~200 mill?) but this is skeleton exact match
19
20. Conclusions
• PubChem open patent chemistry has more Pros that Cons
• Extensive synergy with SureChEMBL as the largest maintained source
• This may be a better first-stop shop for metadata slicing
• Users need to understand CNER quirks, pitfalls
• Difficult to get hard comparative coverage stats but indication is that PubChem
has the majority of exemplified structures from patents
• The non-redundant corpus of quality Med. Chem. patents is not only surprisingly
small but also fully open for text mining
• Those without commercial sources are well enabled for open patent mining
• However, they should be circumspect about relying on it for comprehensive prior-
art and due-diligence checking
• Those with commercial sources now have to perform open searching in // anyway
20
21. Further reading and COI
21
https://www.ncbi.nlm.nih.gov/pubmed/29451740
https://www.researchgate.net/publication/313264567_Examples_of_SAR-
Centric_Patent_Mining_Using_Open_Resources
https://sites.google.com/view/tw2informatics/home
Conflict of interest (minor) Has done patent analysis
consulting