SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Pros and cons of 23 million
patent-extracted structures in PubChem
Christopher Southan, Senior Cheminformatician, IUPHAR/BPS Guide to
Pharmacology, Discovery Brain Sciences, University of Edinburgh, UK.
1
ACS Boston, Sunday Aug 19th 2018, Chemical Structure Searching for Patent Information
Session , 2:15 PM - 2:45 PM Harbor Ballroom III - Westin Boston Waterfront
https://www.slideshare.net/cdsouthan
Abstract (will not be shown)
As of March 2018, the major automated patent chemistry extractions (in ascending size, NextMove,
SCRIPDB, IBM and SureChEMBL) cover 22.17 million CIDs from the PubChem total of 94.7.These have
become hugely enabling, with advantages including a) majority of patent-exemplified structures of
medicinal chemistry interest are now in PubChem b) first-filings of lead series and clinical candidates
can be tracked d) the PubChem tool box has features difficult to match in commercial sources, e) many
structures can be associated with bioactivity data f) connections between papers and patents can be
made via ChEMBL entries g) BindingDB has accumulated a valuable collection of manual SAR
extraction from US patents that can be intersected with the automatically extracted structures and h)
coverage for some documents approaches that of SciFinder. However, there are a range of
disadvantages and caveats associated with automated extraction.These include; a) coverage
compromised by dense image tables, Markush nesting and poor OCR quality ofWO documents, b) as
the major pipeline in situ SureChEMBL can have a PubChem updating lag of some months c)
automated extraction generates structural “noise” that degrades chemistry quality, mainly from the
conversion of split IUPAC strings d) PubChem patent document indexing is patchy d) nothing in the
records actually indicates IP status, e) continual re-extraction of common chemistry results in irrelevant
structure-to-document associations (e.g. 126,949 patents for aspirin ), f) authentic compounds are
contaminated with spurious mixtures of various types as well as never-made virtuals. Surprisingly these
include 44K of deuterated drug analogues g) outside the BindingDB set, linking between SAR data and
targets from recent filings is still a manual exercise but examples will be shown how this can be done. In
terms of searching using SureChEMBL as an entry portal and moving from intra-document chemistry
exemplifications out to search PubChem, including the advantages of structure clustering, will be
demonstrated. Balancing the pros and cons indicates that the PubChem patent extraction “big bang”
over the last five years presents users with the best of both worlds. Academics can now patent mine
extensively and PubChem has become an essential adjunct to commercial sources of patent chemistry
and associated bio entities such as diseases and drug targets.
2
Introduction and outline
• I assume general awareness of patent chemisty value, database chemistry
searching and SAR mapping to targets (background refs in final slide)
• Since PubChem is free there are no serious ”cons” so these slides are better
classified as caveats and gotchas
• Note these related presentations: ”Structure searching for patent information:
The need for speed” (May) 13:45 CINF 35 : ”Automating chemical structure and
inhibition data extraction from patents” 2:45 CINF 37, Hinton,. “Searching for
patent information in PubChem” Kim et al, 3:30 CINF 38. “Beyond journal
articles – extracting bioactivity data from patents” Gaulton et al., 9:00 CINF 116
• Thes slides willl cover: source numbers, source intersects, fragmentation,
vituals, clustering, relative coverage of drug sources, BindingDB, common
chemistry loos ends of pros and cons, summary and further info
3
SnapshotAug 2018: PubChem 96.5 mill
• Major sources are Chemical Named Entity Recognician (CNER) pipelines
• Thomson Pharma (2006-2016 R.I.P.) manual extraction of 4.3 million CIDs from
patents and papers, would probably add ~ 1.0 mill patent structures
• 24% of PubChem CIDs include at least one patent extraction SID
• There are 49% single-sourceCIDs in PubChem
• 26% (12.6 mill) of these come from patent sources
• ~ 1.2 SID:CID ratio
• Note NextMove SIDs have had synthesis data extracted (PMID: 27028220)
4
CID property splits by patent source
5
Patent CIDs by year (cumulative)
• SureChEMBL is the only major source regularly updating
• But gotchas in exact load times (e.g. as of 04 Aug):
– In situ; WO chemistry downloadable ~ 1 week post-publication
– In UniChem, 27 July 2018 update = 19,648678
– In PubChem, load date 23 June 2018 = 18,415971 CIDs
• Will there be post-2017 IBM refresh ? 6
Pro: divergence, cons: this has ceased
and remains largely unexplained
7
IBM = 10.7 mill
SCRIPDB = 4.0 mill a one-off from
SureChEMBL = 17.6 mill
2.9
2.4
4.7 10.1
0.6 0.4
0.50
Union = 21.7
3-way = 2.4
3 + 2-way = 8.1
Unique= 13.5
Con: CNER fragmentation and mixtures
8
ChEMBL + Thomson Pharma
manual extraction
Patent CNER Sources
• Low shoulder includes split IUPACs, Markush bits, synthetic schema from single
images and mixture splits
• High shoulder peptide drop-off?
Con and Pro: intermediates from US-20150291621
9
Pro: PubChem “slice ‘n dice” features
10
• Some PubChem functionality may be difficult to mimic in commercial databases
• Powerful similarity ”walking” between patents, papers, BioAssays, structures, vendors,etc
Pro: manual SAR extraction > BindingDB > PubChem
• 151,314 structures from 2098 USPTO patents, 2013 - 2018 (via CWUs)
• 146,751 patent-only
• Subsumed by ChEMBL at release time (e.g. 24 has 74,050 of thes)
11
Common-chemistry-to-many-documents
(futile indexing)
• PubChem aspirin (CID 2244) linked to 134,286 patent documents
• SureChEMBL aspirin structure search gives 401,341 document matches
• SureChEMBL 78,351 document links for aspirin name search
• SureChEMBL aspirin structure search, restricted toWO-only, claims
section and 2018 - gives 152 documents
• SciFinder 8,985 patent references for aspirin by name or structure
• Below; corpus count (x-axis) vs compounds (y-axis) for US9181236
12
BISTS (BIg Strange ThingS) from patents:
the infamous “Chessbordanes”
• Mainly a SCRIPDB legacy from CWU’s
• Still there but more amusing that a serious Con
13
Con: virtuals
14
• ”Deuterogate” example of
1000,s of enumerations
without reduction to practice
(i.e. no data)
• Unforseen consequences of
flow patents < PubChem
• US20080045558, 506
deuterated codeines (CID
5284371), 206 deuterated
oxycodone (CID 5284603)
3,251 SIDs, SureChEMBL,
SCRIBDB, IBM (all CWUs)
• SciFinder extracted 1014
isotopic substances under
”bological study”
Preparation and utility of opioid analgesics, Auspex
Comparative coverage (1) single patent
Pro: overlaps, Con: divergence
15
• US9181236B1, 2015, “2-spiro-
substituted iminothiazines
and their mono-and dioxides
as BACE inhibitors”
• 173 BindingDB CIDs curated
from PubChem
• 405 substances SDF from
SciFinder OpenBabel > 391 IK
> 362 CIDs
• 1657 rows > 834 SureChEMBL
IDs > 664 CIDs
• https://pubchem.ncbi.nlm.nih.
gov/patent/US918123 gives
742 CIDs
Comparative coverage (II): patents vs papers
• Intersect of ~0.5 mill CIDs is a Pro, but there are caveats
• ChEMBL extraction from papers is 1,3 mill with the rest confirmed BioAssays
from mostly MLSCN compounds
• Patents include extractions from PubMed abstracts by IBM
• ChEMBL includes the patent extractions of BindingDB (but only 73K)
16
Comparative coverage (III): drug source matches
• Chart is ”look back” cumulative CID coverage of INN and Guide to Pharmaclogy
• From 9479 INNs, 87% have a patent match (n.b. 82% have a ChEMBL match)
• From 7159 in GtoPdb 79% have a patent match
• From 9767 in DrugBank 72% have a patent match
• Caveat: some matches may be from secondary patents (i.e. not first-filings)
17
Pro and con loose ends
• CNER is confounded by dense image tables and poor OCR (e.g.WO PDFs)
• CNER is brainless compared to manual extraction (e.g. CID 2791850)
• CNER pipelines are divergent
• No Markush handling
• Peptide capture is patchy
• Can only filter ”in claims” via IBM SID tags
• In bioactivity and SAR terms there are probably no more than ~ 50K A61/C07
quality documents with useful data from last decade
• These cover only ~ 3.5 million bioactives (but ~2x the literature)
• So we could have an overhead ~ 20 million non-bioactives
18
The security “con”
• Drug discovery organisations that file may prohibit the open searching of
proprietary structures via the PubChem interface outside the firewall
• Notwithstanding, there is no patent case-law precedent for composition-of-
matter claims being challenged on the basis of structures intercepted from an
open server
• Ipso facto prohibition of open searching constitutes a major nailing-of-feet-to-
the-floor
• You can do initial scoping searches from home or your phone anyway
• You can do an InChIKey inner layer search, including against UniChem at 156
mill and Google (~200 mill?) but this is skeleton exact match
19
Conclusions
• PubChem open patent chemistry has more Pros that Cons
• Extensive synergy with SureChEMBL as the largest maintained source
• This may be a better first-stop shop for metadata slicing
• Users need to understand CNER quirks, pitfalls
• Difficult to get hard comparative coverage stats but indication is that PubChem
has the majority of exemplified structures from patents
• The non-redundant corpus of quality Med. Chem. patents is not only surprisingly
small but also fully open for text mining
• Those without commercial sources are well enabled for open patent mining
• However, they should be circumspect about relying on it for comprehensive prior-
art and due-diligence checking
• Those with commercial sources now have to perform open searching in // anyway
20
Further reading and COI
21
https://www.ncbi.nlm.nih.gov/pubmed/29451740
https://www.researchgate.net/publication/313264567_Examples_of_SAR-
Centric_Patent_Mining_Using_Open_Resources
https://sites.google.com/view/tw2informatics/home
Conflict of interest (minor) Has done patent analysis
consulting

Weitere ähnliche Inhalte

Was ist angesagt?

CINF 55: SureChEMBL: An open patent chemistry resource
CINF 55: SureChEMBL: An open patent chemistry resourceCINF 55: SureChEMBL: An open patent chemistry resource
CINF 55: SureChEMBL: An open patent chemistry resourceGeorge Papadatos
 
Searching for patent information in PubChem
Searching for patent information in PubChem Searching for patent information in PubChem
Searching for patent information in PubChem Sunghwan Kim
 
GPU-accelerated Virtual Screening
GPU-accelerated Virtual ScreeningGPU-accelerated Virtual Screening
GPU-accelerated Virtual ScreeningOlexandr Isayev
 
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...Chris Southan
 
SureChEMBL and Open PHACTS
SureChEMBL and Open PHACTSSureChEMBL and Open PHACTS
SureChEMBL and Open PHACTSGeorge Papadatos
 
SureChEMBL patent annotations in Open PHACTS
SureChEMBL patent annotations in Open PHACTSSureChEMBL patent annotations in Open PHACTS
SureChEMBL patent annotations in Open PHACTSGeorge Papadatos
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsDr. Haxel Consult
 
Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Chris Southan
 
Digging out Structures for Repurposing: Non-competitive Intelligence ...
Digging out Structures for Repurposing: Non-competitive Intelligence        ...Digging out Structures for Repurposing: Non-competitive Intelligence        ...
Digging out Structures for Repurposing: Non-competitive Intelligence ...Chris Southan
 
Curatorial data wrangling for the Guide to PHARMACOLGY
Curatorial data wrangling for the Guide to PHARMACOLGY Curatorial data wrangling for the Guide to PHARMACOLGY
Curatorial data wrangling for the Guide to PHARMACOLGY Chris Southan
 
Antimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosureAntimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosureChris Southan
 
PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy trainingSunghwan Kim
 
Capturing BIA-10-2474 and related FAAH inhibitor data
Capturing BIA-10-2474 and related FAAH inhibitor dataCapturing BIA-10-2474 and related FAAH inhibitor data
Capturing BIA-10-2474 and related FAAH inhibitor dataChris Southan
 
Guide to PHARMACOLOGY: a web-Based Compendium for Research and Education
Guide to PHARMACOLOGY: a web-Based Compendium for Research and EducationGuide to PHARMACOLOGY: a web-Based Compendium for Research and Education
Guide to PHARMACOLOGY: a web-Based Compendium for Research and EducationChris Southan
 
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...ChemAxon
 
The Patent Chemistry “Big Bang” In Pubchem
The Patent Chemistry “Big Bang” In PubchemThe Patent Chemistry “Big Bang” In Pubchem
The Patent Chemistry “Big Bang” In PubchemChris Southan
 
Semantic Technology: The Basics
Semantic Technology: The BasicsSemantic Technology: The Basics
Semantic Technology: The BasicsPeter Berger
 

Was ist angesagt? (20)

Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBL
 
ChEMBL+KNIME
ChEMBL+KNIMEChEMBL+KNIME
ChEMBL+KNIME
 
CINF 55: SureChEMBL: An open patent chemistry resource
CINF 55: SureChEMBL: An open patent chemistry resourceCINF 55: SureChEMBL: An open patent chemistry resource
CINF 55: SureChEMBL: An open patent chemistry resource
 
Searching for patent information in PubChem
Searching for patent information in PubChem Searching for patent information in PubChem
Searching for patent information in PubChem
 
GPU-accelerated Virtual Screening
GPU-accelerated Virtual ScreeningGPU-accelerated Virtual Screening
GPU-accelerated Virtual Screening
 
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
IUPHAR/BPS Guide to Pharmacology: concise mapping of chemistry, data, and tar...
 
SureChEMBL and Open PHACTS
SureChEMBL and Open PHACTSSureChEMBL and Open PHACTS
SureChEMBL and Open PHACTS
 
SureChEMBL patent annotations in Open PHACTS
SureChEMBL patent annotations in Open PHACTSSureChEMBL patent annotations in Open PHACTS
SureChEMBL patent annotations in Open PHACTS
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
 
Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...
 
Digging out Structures for Repurposing: Non-competitive Intelligence ...
Digging out Structures for Repurposing: Non-competitive Intelligence        ...Digging out Structures for Repurposing: Non-competitive Intelligence        ...
Digging out Structures for Repurposing: Non-competitive Intelligence ...
 
Curatorial data wrangling for the Guide to PHARMACOLGY
Curatorial data wrangling for the Guide to PHARMACOLGY Curatorial data wrangling for the Guide to PHARMACOLGY
Curatorial data wrangling for the Guide to PHARMACOLGY
 
Knowledge is Property- All YOU need to know ABC of Patent Searching
Knowledge is Property- All YOU need to know ABC of Patent SearchingKnowledge is Property- All YOU need to know ABC of Patent Searching
Knowledge is Property- All YOU need to know ABC of Patent Searching
 
Antimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosureAntimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosure
 
PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy training
 
Capturing BIA-10-2474 and related FAAH inhibitor data
Capturing BIA-10-2474 and related FAAH inhibitor dataCapturing BIA-10-2474 and related FAAH inhibitor data
Capturing BIA-10-2474 and related FAAH inhibitor data
 
Guide to PHARMACOLOGY: a web-Based Compendium for Research and Education
Guide to PHARMACOLOGY: a web-Based Compendium for Research and EducationGuide to PHARMACOLOGY: a web-Based Compendium for Research and Education
Guide to PHARMACOLOGY: a web-Based Compendium for Research and Education
 
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
 
The Patent Chemistry “Big Bang” In Pubchem
The Patent Chemistry “Big Bang” In PubchemThe Patent Chemistry “Big Bang” In Pubchem
The Patent Chemistry “Big Bang” In Pubchem
 
Semantic Technology: The Basics
Semantic Technology: The BasicsSemantic Technology: The Basics
Semantic Technology: The Basics
 

Ähnlich wie Patents in PubChem

Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemChris Southan
 
Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Sean Ekins
 
Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Chris Southan
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistrySunghwan Kim
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityChris Southan
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCPChris Southan
 
Multiplexing analysis of 1000 approved drugs in PubChem
Multiplexing analysis of 1000 approved drugs in PubChemMultiplexing analysis of 1000 approved drugs in PubChem
Multiplexing analysis of 1000 approved drugs in PubChemChris Southan
 
PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data ChemistrySunghwan Kim
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataChris Southan
 
Chemicalize.org: User-Selected PubChem Source of Structures from Text
Chemicalize.org: User-Selected PubChem Source of Structures from TextChemicalize.org: User-Selected PubChem Source of Structures from Text
Chemicalize.org: User-Selected PubChem Source of Structures from TextChris Southan
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databasesChris Southan
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyChris Southan
 
Exploiting PubChem for Drug Discovery
Exploiting PubChem for Drug DiscoveryExploiting PubChem for Drug Discovery
Exploiting PubChem for Drug DiscoverySunghwan Kim
 
Patent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSPatent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSopen_phacts
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology Chris Southan
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingSunghwan Kim
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagensChris Southan
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligenceSunghwan Kim
 
BOUNCER: A Privacy-aware Query Processing Over Federations of RDF Datasets
BOUNCER: A Privacy-aware Query Processing Over Federations of RDF DatasetsBOUNCER: A Privacy-aware Query Processing Over Federations of RDF Datasets
BOUNCER: A Privacy-aware Query Processing Over Federations of RDF DatasetsKemele M. Endris
 

Ähnlich wie Patents in PubChem (20)

Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChem
 
Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...
 
Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistry
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
Multiplexing analysis of 1000 approved drugs in PubChem
Multiplexing analysis of 1000 approved drugs in PubChemMultiplexing analysis of 1000 approved drugs in PubChem
Multiplexing analysis of 1000 approved drugs in PubChem
 
PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data Chemistry
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity Data
 
Chemicalize.org: User-Selected PubChem Source of Structures from Text
Chemicalize.org: User-Selected PubChem Source of Structures from TextChemicalize.org: User-Selected PubChem Source of Structures from Text
Chemicalize.org: User-Selected PubChem Source of Structures from Text
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
Exploiting PubChem for Drug Discovery
Exploiting PubChem for Drug DiscoveryExploiting PubChem for Drug Discovery
Exploiting PubChem for Drug Discovery
 
Patent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSPatent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTS
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information training
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
 
BOUNCER: A Privacy-aware Query Processing Over Federations of RDF Datasets
BOUNCER: A Privacy-aware Query Processing Over Federations of RDF DatasetsBOUNCER: A Privacy-aware Query Processing Over Federations of RDF Datasets
BOUNCER: A Privacy-aware Query Processing Over Federations of RDF Datasets
 

Mehr von Chris Southan

Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulationsChris Southan
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Chris Southan
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeChris Southan
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentChris Southan
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Chris Southan
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCPChris Southan
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteinsChris Southan
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFERChris Southan
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 posterChris Southan
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand upChris Southan
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide TribulationsChris Southan
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRChris Southan
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology updateChris Southan
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProtChris Southan
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbChris Southan
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityChris Southan
 
The IUPHAR/MMV Guide to Malaria Pharmacology
The  IUPHAR/MMV Guide to Malaria Pharmacology  The  IUPHAR/MMV Guide to Malaria Pharmacology
The IUPHAR/MMV Guide to Malaria Pharmacology Chris Southan
 
The big data join in pharmacology
The big data join in pharmacologyThe big data join in pharmacology
The big data join in pharmacologyChris Southan
 
Linking GtoP <> PubChem <> PubMed
Linking GtoP <> PubChem <> PubMed Linking GtoP <> PubChem <> PubMed
Linking GtoP <> PubChem <> PubMed Chris Southan
 
Druggable genome in GtoPdb and other dbs
Druggable genome in GtoPdb and other dbsDruggable genome in GtoPdb and other dbs
Druggable genome in GtoPdb and other dbsChris Southan
 

Mehr von Chris Southan (20)

Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem Connectivity
 
The IUPHAR/MMV Guide to Malaria Pharmacology
The  IUPHAR/MMV Guide to Malaria Pharmacology  The  IUPHAR/MMV Guide to Malaria Pharmacology
The IUPHAR/MMV Guide to Malaria Pharmacology
 
The big data join in pharmacology
The big data join in pharmacologyThe big data join in pharmacology
The big data join in pharmacology
 
Linking GtoP <> PubChem <> PubMed
Linking GtoP <> PubChem <> PubMed Linking GtoP <> PubChem <> PubMed
Linking GtoP <> PubChem <> PubMed
 
Druggable genome in GtoPdb and other dbs
Druggable genome in GtoPdb and other dbsDruggable genome in GtoPdb and other dbs
Druggable genome in GtoPdb and other dbs
 

Kürzlich hochgeladen

Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxSilpa
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Silpa
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Silpa
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.Silpa
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxANSARKHAN96
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Silpa
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Silpa
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Silpa
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsbassianu17
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 

Kürzlich hochgeladen (20)

Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 

Patents in PubChem

  • 1. Pros and cons of 23 million patent-extracted structures in PubChem Christopher Southan, Senior Cheminformatician, IUPHAR/BPS Guide to Pharmacology, Discovery Brain Sciences, University of Edinburgh, UK. 1 ACS Boston, Sunday Aug 19th 2018, Chemical Structure Searching for Patent Information Session , 2:15 PM - 2:45 PM Harbor Ballroom III - Westin Boston Waterfront https://www.slideshare.net/cdsouthan
  • 2. Abstract (will not be shown) As of March 2018, the major automated patent chemistry extractions (in ascending size, NextMove, SCRIPDB, IBM and SureChEMBL) cover 22.17 million CIDs from the PubChem total of 94.7.These have become hugely enabling, with advantages including a) majority of patent-exemplified structures of medicinal chemistry interest are now in PubChem b) first-filings of lead series and clinical candidates can be tracked d) the PubChem tool box has features difficult to match in commercial sources, e) many structures can be associated with bioactivity data f) connections between papers and patents can be made via ChEMBL entries g) BindingDB has accumulated a valuable collection of manual SAR extraction from US patents that can be intersected with the automatically extracted structures and h) coverage for some documents approaches that of SciFinder. However, there are a range of disadvantages and caveats associated with automated extraction.These include; a) coverage compromised by dense image tables, Markush nesting and poor OCR quality ofWO documents, b) as the major pipeline in situ SureChEMBL can have a PubChem updating lag of some months c) automated extraction generates structural “noise” that degrades chemistry quality, mainly from the conversion of split IUPAC strings d) PubChem patent document indexing is patchy d) nothing in the records actually indicates IP status, e) continual re-extraction of common chemistry results in irrelevant structure-to-document associations (e.g. 126,949 patents for aspirin ), f) authentic compounds are contaminated with spurious mixtures of various types as well as never-made virtuals. Surprisingly these include 44K of deuterated drug analogues g) outside the BindingDB set, linking between SAR data and targets from recent filings is still a manual exercise but examples will be shown how this can be done. In terms of searching using SureChEMBL as an entry portal and moving from intra-document chemistry exemplifications out to search PubChem, including the advantages of structure clustering, will be demonstrated. Balancing the pros and cons indicates that the PubChem patent extraction “big bang” over the last five years presents users with the best of both worlds. Academics can now patent mine extensively and PubChem has become an essential adjunct to commercial sources of patent chemistry and associated bio entities such as diseases and drug targets. 2
  • 3. Introduction and outline • I assume general awareness of patent chemisty value, database chemistry searching and SAR mapping to targets (background refs in final slide) • Since PubChem is free there are no serious ”cons” so these slides are better classified as caveats and gotchas • Note these related presentations: ”Structure searching for patent information: The need for speed” (May) 13:45 CINF 35 : ”Automating chemical structure and inhibition data extraction from patents” 2:45 CINF 37, Hinton,. “Searching for patent information in PubChem” Kim et al, 3:30 CINF 38. “Beyond journal articles – extracting bioactivity data from patents” Gaulton et al., 9:00 CINF 116 • Thes slides willl cover: source numbers, source intersects, fragmentation, vituals, clustering, relative coverage of drug sources, BindingDB, common chemistry loos ends of pros and cons, summary and further info 3
  • 4. SnapshotAug 2018: PubChem 96.5 mill • Major sources are Chemical Named Entity Recognician (CNER) pipelines • Thomson Pharma (2006-2016 R.I.P.) manual extraction of 4.3 million CIDs from patents and papers, would probably add ~ 1.0 mill patent structures • 24% of PubChem CIDs include at least one patent extraction SID • There are 49% single-sourceCIDs in PubChem • 26% (12.6 mill) of these come from patent sources • ~ 1.2 SID:CID ratio • Note NextMove SIDs have had synthesis data extracted (PMID: 27028220) 4
  • 5. CID property splits by patent source 5
  • 6. Patent CIDs by year (cumulative) • SureChEMBL is the only major source regularly updating • But gotchas in exact load times (e.g. as of 04 Aug): – In situ; WO chemistry downloadable ~ 1 week post-publication – In UniChem, 27 July 2018 update = 19,648678 – In PubChem, load date 23 June 2018 = 18,415971 CIDs • Will there be post-2017 IBM refresh ? 6
  • 7. Pro: divergence, cons: this has ceased and remains largely unexplained 7 IBM = 10.7 mill SCRIPDB = 4.0 mill a one-off from SureChEMBL = 17.6 mill 2.9 2.4 4.7 10.1 0.6 0.4 0.50 Union = 21.7 3-way = 2.4 3 + 2-way = 8.1 Unique= 13.5
  • 8. Con: CNER fragmentation and mixtures 8 ChEMBL + Thomson Pharma manual extraction Patent CNER Sources • Low shoulder includes split IUPACs, Markush bits, synthetic schema from single images and mixture splits • High shoulder peptide drop-off?
  • 9. Con and Pro: intermediates from US-20150291621 9
  • 10. Pro: PubChem “slice ‘n dice” features 10 • Some PubChem functionality may be difficult to mimic in commercial databases • Powerful similarity ”walking” between patents, papers, BioAssays, structures, vendors,etc
  • 11. Pro: manual SAR extraction > BindingDB > PubChem • 151,314 structures from 2098 USPTO patents, 2013 - 2018 (via CWUs) • 146,751 patent-only • Subsumed by ChEMBL at release time (e.g. 24 has 74,050 of thes) 11
  • 12. Common-chemistry-to-many-documents (futile indexing) • PubChem aspirin (CID 2244) linked to 134,286 patent documents • SureChEMBL aspirin structure search gives 401,341 document matches • SureChEMBL 78,351 document links for aspirin name search • SureChEMBL aspirin structure search, restricted toWO-only, claims section and 2018 - gives 152 documents • SciFinder 8,985 patent references for aspirin by name or structure • Below; corpus count (x-axis) vs compounds (y-axis) for US9181236 12
  • 13. BISTS (BIg Strange ThingS) from patents: the infamous “Chessbordanes” • Mainly a SCRIPDB legacy from CWU’s • Still there but more amusing that a serious Con 13
  • 14. Con: virtuals 14 • ”Deuterogate” example of 1000,s of enumerations without reduction to practice (i.e. no data) • Unforseen consequences of flow patents < PubChem • US20080045558, 506 deuterated codeines (CID 5284371), 206 deuterated oxycodone (CID 5284603) 3,251 SIDs, SureChEMBL, SCRIBDB, IBM (all CWUs) • SciFinder extracted 1014 isotopic substances under ”bological study” Preparation and utility of opioid analgesics, Auspex
  • 15. Comparative coverage (1) single patent Pro: overlaps, Con: divergence 15 • US9181236B1, 2015, “2-spiro- substituted iminothiazines and their mono-and dioxides as BACE inhibitors” • 173 BindingDB CIDs curated from PubChem • 405 substances SDF from SciFinder OpenBabel > 391 IK > 362 CIDs • 1657 rows > 834 SureChEMBL IDs > 664 CIDs • https://pubchem.ncbi.nlm.nih. gov/patent/US918123 gives 742 CIDs
  • 16. Comparative coverage (II): patents vs papers • Intersect of ~0.5 mill CIDs is a Pro, but there are caveats • ChEMBL extraction from papers is 1,3 mill with the rest confirmed BioAssays from mostly MLSCN compounds • Patents include extractions from PubMed abstracts by IBM • ChEMBL includes the patent extractions of BindingDB (but only 73K) 16
  • 17. Comparative coverage (III): drug source matches • Chart is ”look back” cumulative CID coverage of INN and Guide to Pharmaclogy • From 9479 INNs, 87% have a patent match (n.b. 82% have a ChEMBL match) • From 7159 in GtoPdb 79% have a patent match • From 9767 in DrugBank 72% have a patent match • Caveat: some matches may be from secondary patents (i.e. not first-filings) 17
  • 18. Pro and con loose ends • CNER is confounded by dense image tables and poor OCR (e.g.WO PDFs) • CNER is brainless compared to manual extraction (e.g. CID 2791850) • CNER pipelines are divergent • No Markush handling • Peptide capture is patchy • Can only filter ”in claims” via IBM SID tags • In bioactivity and SAR terms there are probably no more than ~ 50K A61/C07 quality documents with useful data from last decade • These cover only ~ 3.5 million bioactives (but ~2x the literature) • So we could have an overhead ~ 20 million non-bioactives 18
  • 19. The security “con” • Drug discovery organisations that file may prohibit the open searching of proprietary structures via the PubChem interface outside the firewall • Notwithstanding, there is no patent case-law precedent for composition-of- matter claims being challenged on the basis of structures intercepted from an open server • Ipso facto prohibition of open searching constitutes a major nailing-of-feet-to- the-floor • You can do initial scoping searches from home or your phone anyway • You can do an InChIKey inner layer search, including against UniChem at 156 mill and Google (~200 mill?) but this is skeleton exact match 19
  • 20. Conclusions • PubChem open patent chemistry has more Pros that Cons • Extensive synergy with SureChEMBL as the largest maintained source • This may be a better first-stop shop for metadata slicing • Users need to understand CNER quirks, pitfalls • Difficult to get hard comparative coverage stats but indication is that PubChem has the majority of exemplified structures from patents • The non-redundant corpus of quality Med. Chem. patents is not only surprisingly small but also fully open for text mining • Those without commercial sources are well enabled for open patent mining • However, they should be circumspect about relying on it for comprehensive prior- art and due-diligence checking • Those with commercial sources now have to perform open searching in // anyway 20
  • 21. Further reading and COI 21 https://www.ncbi.nlm.nih.gov/pubmed/29451740 https://www.researchgate.net/publication/313264567_Examples_of_SAR- Centric_Patent_Mining_Using_Open_Resources https://sites.google.com/view/tw2informatics/home Conflict of interest (minor) Has done patent analysis consulting