SlideShare ist ein Scribd-Unternehmen logo
1 von 27
www.guidetopharmacology.org
The open patent chemistry “big bang”:
large opportunities for small enterprises
Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY,
Centre for Integrative Physiology, University of Edinburgh
ACS Mon, Mar 14 CINF: Division of Chemical Information, 79
SESSION: Chemical Information for Small Businesses & Startups
1:00 PM - 4:55 PM- Room 24C 4:25pm - 4:50pm,
1
http://www.slideshare.net/cdsouthan/patent-chemisty-big-bang-utilities-for-smes
Abstract (will be skipped for presentation)
2
In 2012, after the first IBM open deposition of 2.5 million structures, few would have
predicted that PubChem compounds that include patent-extracted submissions would
approach 20 million by 2015 (PMID 26194581). The current major open patent
chemistry feeds (in size order) are NextMove, SCRIPDB, Thomson Pharma, IBM and
SureChEMBL. The comparative statistics of sources and the arguments that the
coverage probability of lead compound prior-art structures is now very high, will be
presented. The consequences are that the academic community and small companies
can now patent-mine extensively in PubChem and SureChEMBL, possibly even
without needing commercial sources to support their own filings. Other recent major
enabling aspects for small institutions include a) the open availability of patent full-text
for querying b) a range of free tools for DIY chemistry extraction (PMID 23618056)
and c) automatic bioentity mark-up in patent text (e.g. protein names) from the
SureChEMBL/SciBite collaboration. Examples of DIY analysis of newly published
patents will be shown. Even for small enterprises not filing directly open patent
chemistry presents a big expansion in accessible SAR space and aspects of mining
this will be exemplified. However, open chemistry extraction does bring in a variety of
artefacts that add confounding structural “noise” These include a) permutations of
mixtures and chiral exemplifications, b) virtual structures c) extractions from
documents cannot directly indicate IP status and d) “common chemistry” swamping.
These problems and some partial solutions using PubChem filters will be discussed.
Encouraging preface
3
Outline
• Balancing IP against bioactivity mining
• Source coverage for patent extraction
• Caveats with automated extraction
• The example of US9056843
• Source extraction comparisons
• DIY extraction
• Questions on open searching
• Conclusions
• References
4
IP vs SAR from open patent mining
IP assessment
• Essential source of prior art chemistry
• De facto adjunct to commercial sources
• Improved portals (EPO, WIPO, FPOL)
• SureChEMBL, TRP & BindingDB active
• PubChem content is chemistry from
patents, not patented chemistry
• CNER brainless compared to expert IP-
relevance selection
• Claim section extraction often weak
• Extracted artefacts confounding (e.g.
mixtures & virtuals)
• Dense image tables still a coverage gap
• IBM and SCRIPDB static in PubChem
• Asian chemistry shortfall
• The “common chemistry” problem
• Patent blitzing for drug candidates
Bioactivity data mining
• Circa 5x more SAR than literature
• Patent families collapse to < 100K
C07D primary documents
• Advanced query options in
SureChEMBL
• Bulk synthesis extraction (NextMove)
• Valuable intersects with papers,
authors and targets via ChEMBL
• Easy intersecting with DIY chemistry
extraction from any document
• Obfuscation in example > assay data
• Challenge of judging scientific quality
• Only ~ 5 mil structures potentially
linkable to bioactivity data
• Thus ~ 15 million have marginal utility
• CNER > structural multiplexing
5
Big chemistry: prior art statistics
March 2016 snapshots
• GDB-13: 907 million virtual structures (similarity search)
• Google InChIKey: 120+? million (exact match search)
• EBI UniChem: 110.7 million 27 sources (exact match search)
• CAS: 109 million substances (commercial, similarity search)
• PubChem: 89 million 390 sources (similarity search)
• ChemSpider: 43 million 510 sources (similarity search)
• SureChEMBL: 16.8 million (similarity search)
• GVKBio: 6.2 million (commercial bioactivity capture from patents and
papers, similarity search)
6
History of patent chemistry feeds into PubChem
• 2006 - Thomson (Reuters) Pharma (TRP) manual extractions from
patents and papers (now 4.3 mil, ~40% patents)
• 2011- IBM phase 1 chemical named entity recognition (CNER) 2.5 mil
- SLING Consortium EPO extraction 0.1 mil
• 2012 - SCRIPDB, CNER plus Complex Work Units (CWU) 4.0 mil
• 2013 - SureChem, CNER + image, 9.0 mil
• 2014 - BindingDB USPTO assay extraction (now 0.08 mil)
• 2015- (CNER+images + CWU)
• SureChEMBL 13.0 mil
• IBM phase 2, 7.0 mil,
• NextMove Software 1.4 mil synthesis mapping
• 2016 - SureChEMBL 15.8 mil
• CIDs from CNER extractions 19.1 mil (from 88.8 mill, 4th March)
• Total patent chemistry with estimate from TRP ~ 20.5 mill
7
CNER patent sources vs. patent and paper curation:
corroboration and divergence
8
IBM +
SCRIPDB +
SureChembl +
NextMove
= 19.01
ChEMBL20 = 1.45
Thomson Pharma = 4.3
17.3
0.18
1.4 2.5
0.12 0.25
0.9
Counts are
PubChem
Compound
Identifiers (CIDs)
in millions
CNER caveats (I) fragmentation: Mw plots
9
Can be partially ameliorated by using Mw ranking as a filter
CNER caveats (II) the bioactivity-gap:
majority of patent chemistry has no linked assay data
10
CNER caveats (III): strange patent-unique structures
11
• Weird stuff generally non-biological chemistry (i.e. not A61)
• For the record C07D = 10.9, A61K = 0.9, (C097D + A61K) = 0.81 mill CIDs
CNER caveats (IV): mixture extractions (a mixed blessing)
12
• Mostly TFA or HCl salts
• Includes combination claims and reactant mixtures
• Causes sources to appear more divergent by exact match statistics
• PubChem splits to component CIDs while maintaining the back-mapping
• Can normalise with “CovalentUnitCount =1” filter
An example
13
“Trifluorom
ethyl-
oxadiazole
derivatives
and their
use in the
treatment
of disease”
(Novartis)
PTC for the
patent
family
WO201300
8162,
2013-01-17
SAR table
14
All three data sets extracted and example-numbered in BindingDB
PubChem retrieval by patent number -> series cluster
15
Extraction splits by source, date and isomeric connectivity:
(it can get complicated….)
16
Different sources (SIDs) for same
structure (CID)
Different CID isomers with same core
connectivity
Impressive SureChEMBL family extraction
17
4830 rows 648 IDs mapped to 511 PubChem CIDs
Extraction
source
selectivity
• 151 BindingDB CIDs direct from PubChem
• 93 Thomson Pharma CIDs (within the 151 above)
• 296 SDFs from SciFinder > 269 CIDs
• 648 SureChEMBL IDs > 511 CIDs
• Numbers are not absolute because of “round tripping” mapping issues
but they illustrate the selectivity and extent of open coverage
18
Orthogonal
entity mark-up
(I) : Ferret
(Chrome
plug-in)
19
Orthogonal entity mark-up (II) :
SciBite’s Termite (within SureChEMBL)
20
Roll-your-own extraction (II): OSRA
21
Roll-your-own extraction (I): ChemAxon chemicalize.org
22
Recent comparative analysis
• Compared SureChEMBL and IBM with SciFinder and Reaxys for a small
patent set (i.e. open vs commercial)
• Concluded; “50–66 % of the relevant content from the latter was also
found in the former”
• Equivalent comparisons executed in the latest PubChem with all patent
sources would probably record a higher overlap
23
Managing expectations: assessment of chemistry databases generated by
automated extraction of chemical structures from patents, Senger, et al. J.
Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL)
http://www.ncbi.nlm.nih.gov/pubmed/26457120
First 64K$ Q:
can you search your novel chemistry in open dbs?
• The InChIKey connectivity layer already facilitates blinded exact match
(isomer-agnostic) searching anywhere, including Google
• PubChem and SureChEMBL default to https; so searching is secure
• There is (and never will be?) patent case law where novelty was challenged
in court based on structures intercepted from public servers
• Without metadata (e.g. target & disease) interception per se not much use
• As for sequence data, hard evidence of serious competitive damage via
query interception remains zero (after 20+ years)
• Commercial dbs cannot capture all prior art, so need open check anyway
24
Second 64K$ Q:
Can you file based on open-only diligence?
If convinced your novel series < billion$ drug, maybe not - but consider
• Chances of completely missing an overlapping chemical series in
open sources from a competing patent are diminishing
• Prior art is confounded anyway by the 18-month publication shadow
and Markush enumeration
• Filing a 12 month provisional is low cost option
• Portal queries allow you to find relevant patents (e.g. by target name)
even if open chemistry extraction was limited
• The searches that really count are the ones the patent examiner does
for you (on payment) using all their sources (including PubChem)
• However, attorney costs for drafting applications need balancing
against savings on commercial patent resources
25
Conclusions
• The “Big Bang” of open chemistry and full text from patents now make these
an essential part of IP and bioactivity assessments for SMEs
• The combination of SureChEMBL and other sources within PubChem
provide over 20 million patent-extracted structures and powerful analysis
options
• The gap between open and commercial has narrowed to the point you can at
least consider doing without the latter
• Note also the former has functionality absent from the latter
• Bioactivity identification, mining and target mapping are still challenging but
becoming easier
• It is important to understand patent chemistry automated extraction quirks,
artefacts, and pitfalls so you can filter these
26
References and questions
27
http://cdsouthan.blogspot.com/ 19 posts have the tag “patents”
http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624
(with PubMed Commons data link)
www.ncbi.nlm.nih.gov/pubmed/25415348 http://www.ncbi.nlm.nih.gov/pubmed/23399051
http://www.ncbi.nlm.nih.gov/pubmed/23618056

Weitere ähnliche Inhalte

Was ist angesagt?

Searching for chemical information using PubChem
Searching for chemical information using PubChemSearching for chemical information using PubChem
Searching for chemical information using PubChem
Sunghwan Kim
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
Sunghwan Kim
 

Was ist angesagt? (20)

2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
 
Assessing GtoPdb ligand content in PubChem
Assessing GtoPdb ligand content in PubChemAssessing GtoPdb ligand content in PubChem
Assessing GtoPdb ligand content in PubChem
 
Will the correct drugs please stand up?
Will  the correct drugs please stand up?Will  the correct drugs please stand up?
Will the correct drugs please stand up?
 
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updates
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updatesThe IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updates
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updates
 
BigDataEurope - Big Data & Health
BigDataEurope - Big Data & HealthBigDataEurope - Big Data & Health
BigDataEurope - Big Data & Health
 
Data-driven drug discovery for rare diseases - Tales from the trenches (CINF ...
Data-driven drug discovery for rare diseases - Tales from the trenches (CINF ...Data-driven drug discovery for rare diseases - Tales from the trenches (CINF ...
Data-driven drug discovery for rare diseases - Tales from the trenches (CINF ...
 
How can you access PubChem programmatically?
How can you access PubChem programmatically?How can you access PubChem programmatically?
How can you access PubChem programmatically?
 
Connecting antimalarial data
Connecting antimalarial dataConnecting antimalarial data
Connecting antimalarial data
 
Searching for chemical information using PubChem
Searching for chemical information using PubChemSearching for chemical information using PubChem
Searching for chemical information using PubChem
 
Finding novel lead compounds in pesticide discovery inspired by pharmaceutica...
Finding novel lead compounds in pesticide discovery inspired by pharmaceutica...Finding novel lead compounds in pesticide discovery inspired by pharmaceutica...
Finding novel lead compounds in pesticide discovery inspired by pharmaceutica...
 
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
 
PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy training
 
GuideToImmunopharmacology_SIF_Nov2019
GuideToImmunopharmacology_SIF_Nov2019GuideToImmunopharmacology_SIF_Nov2019
GuideToImmunopharmacology_SIF_Nov2019
 
2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe
 
Guide to Malaria Pharmacology, GEMM 2019
Guide to Malaria Pharmacology, GEMM 2019Guide to Malaria Pharmacology, GEMM 2019
Guide to Malaria Pharmacology, GEMM 2019
 
Semantic Technology: The Basics
Semantic Technology: The BasicsSemantic Technology: The Basics
Semantic Technology: The Basics
 
GtoPDB_ELIXIR_UK_AllHands_update_Dec2019
GtoPDB_ELIXIR_UK_AllHands_update_Dec2019GtoPDB_ELIXIR_UK_AllHands_update_Dec2019
GtoPDB_ELIXIR_UK_AllHands_update_Dec2019
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
 
UDM (Unified Data Model) - Enabling Exchange of Comprehensive Reaction Inform...
UDM (Unified Data Model) - Enabling Exchange of Comprehensive Reaction Inform...UDM (Unified Data Model) - Enabling Exchange of Comprehensive Reaction Inform...
UDM (Unified Data Model) - Enabling Exchange of Comprehensive Reaction Inform...
 
2015-04-28 Open PHACTS at Swedish Linked Data Network Meet-up
2015-04-28 Open PHACTS at Swedish Linked Data Network Meet-up2015-04-28 Open PHACTS at Swedish Linked Data Network Meet-up
2015-04-28 Open PHACTS at Swedish Linked Data Network Meet-up
 

Ähnlich wie Patent chemisty big bang: utilities for SMEs

The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
Dr. Haxel Consult
 
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
Dr. Haxel Consult
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
Dr. Haxel Consult
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
Sunghwan Kim
 
Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...
Chris Southan
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
USUGM 2014 - Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...
USUGM 2014 -  Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...USUGM 2014 -  Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...
USUGM 2014 - Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...
ChemAxon
 

Ähnlich wie Patent chemisty big bang: utilities for SMEs (20)

The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse
 
Patents in PubChem
Patents in PubChemPatents in PubChem
Patents in PubChem
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChem
 
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
 
Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistry
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
Integrating Patents with Research Data
Integrating Patents with Research DataIntegrating Patents with Research Data
Integrating Patents with Research Data
 
PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data Chemistry
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
 
USUGM 2014 - Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...
USUGM 2014 -  Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...USUGM 2014 -  Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...
USUGM 2014 - Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...
 
Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBL
 

Mehr von Chris Southan

Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
Chris Southan
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
Chris Southan
 

Mehr von Chris Southan (20)

Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem Connectivity
 

Kürzlich hochgeladen

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 

Kürzlich hochgeladen (20)

Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 

Patent chemisty big bang: utilities for SMEs

  • 1. www.guidetopharmacology.org The open patent chemistry “big bang”: large opportunities for small enterprises Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative Physiology, University of Edinburgh ACS Mon, Mar 14 CINF: Division of Chemical Information, 79 SESSION: Chemical Information for Small Businesses & Startups 1:00 PM - 4:55 PM- Room 24C 4:25pm - 4:50pm, 1 http://www.slideshare.net/cdsouthan/patent-chemisty-big-bang-utilities-for-smes
  • 2. Abstract (will be skipped for presentation) 2 In 2012, after the first IBM open deposition of 2.5 million structures, few would have predicted that PubChem compounds that include patent-extracted submissions would approach 20 million by 2015 (PMID 26194581). The current major open patent chemistry feeds (in size order) are NextMove, SCRIPDB, Thomson Pharma, IBM and SureChEMBL. The comparative statistics of sources and the arguments that the coverage probability of lead compound prior-art structures is now very high, will be presented. The consequences are that the academic community and small companies can now patent-mine extensively in PubChem and SureChEMBL, possibly even without needing commercial sources to support their own filings. Other recent major enabling aspects for small institutions include a) the open availability of patent full-text for querying b) a range of free tools for DIY chemistry extraction (PMID 23618056) and c) automatic bioentity mark-up in patent text (e.g. protein names) from the SureChEMBL/SciBite collaboration. Examples of DIY analysis of newly published patents will be shown. Even for small enterprises not filing directly open patent chemistry presents a big expansion in accessible SAR space and aspects of mining this will be exemplified. However, open chemistry extraction does bring in a variety of artefacts that add confounding structural “noise” These include a) permutations of mixtures and chiral exemplifications, b) virtual structures c) extractions from documents cannot directly indicate IP status and d) “common chemistry” swamping. These problems and some partial solutions using PubChem filters will be discussed.
  • 4. Outline • Balancing IP against bioactivity mining • Source coverage for patent extraction • Caveats with automated extraction • The example of US9056843 • Source extraction comparisons • DIY extraction • Questions on open searching • Conclusions • References 4
  • 5. IP vs SAR from open patent mining IP assessment • Essential source of prior art chemistry • De facto adjunct to commercial sources • Improved portals (EPO, WIPO, FPOL) • SureChEMBL, TRP & BindingDB active • PubChem content is chemistry from patents, not patented chemistry • CNER brainless compared to expert IP- relevance selection • Claim section extraction often weak • Extracted artefacts confounding (e.g. mixtures & virtuals) • Dense image tables still a coverage gap • IBM and SCRIPDB static in PubChem • Asian chemistry shortfall • The “common chemistry” problem • Patent blitzing for drug candidates Bioactivity data mining • Circa 5x more SAR than literature • Patent families collapse to < 100K C07D primary documents • Advanced query options in SureChEMBL • Bulk synthesis extraction (NextMove) • Valuable intersects with papers, authors and targets via ChEMBL • Easy intersecting with DIY chemistry extraction from any document • Obfuscation in example > assay data • Challenge of judging scientific quality • Only ~ 5 mil structures potentially linkable to bioactivity data • Thus ~ 15 million have marginal utility • CNER > structural multiplexing 5
  • 6. Big chemistry: prior art statistics March 2016 snapshots • GDB-13: 907 million virtual structures (similarity search) • Google InChIKey: 120+? million (exact match search) • EBI UniChem: 110.7 million 27 sources (exact match search) • CAS: 109 million substances (commercial, similarity search) • PubChem: 89 million 390 sources (similarity search) • ChemSpider: 43 million 510 sources (similarity search) • SureChEMBL: 16.8 million (similarity search) • GVKBio: 6.2 million (commercial bioactivity capture from patents and papers, similarity search) 6
  • 7. History of patent chemistry feeds into PubChem • 2006 - Thomson (Reuters) Pharma (TRP) manual extractions from patents and papers (now 4.3 mil, ~40% patents) • 2011- IBM phase 1 chemical named entity recognition (CNER) 2.5 mil - SLING Consortium EPO extraction 0.1 mil • 2012 - SCRIPDB, CNER plus Complex Work Units (CWU) 4.0 mil • 2013 - SureChem, CNER + image, 9.0 mil • 2014 - BindingDB USPTO assay extraction (now 0.08 mil) • 2015- (CNER+images + CWU) • SureChEMBL 13.0 mil • IBM phase 2, 7.0 mil, • NextMove Software 1.4 mil synthesis mapping • 2016 - SureChEMBL 15.8 mil • CIDs from CNER extractions 19.1 mil (from 88.8 mill, 4th March) • Total patent chemistry with estimate from TRP ~ 20.5 mill 7
  • 8. CNER patent sources vs. patent and paper curation: corroboration and divergence 8 IBM + SCRIPDB + SureChembl + NextMove = 19.01 ChEMBL20 = 1.45 Thomson Pharma = 4.3 17.3 0.18 1.4 2.5 0.12 0.25 0.9 Counts are PubChem Compound Identifiers (CIDs) in millions
  • 9. CNER caveats (I) fragmentation: Mw plots 9 Can be partially ameliorated by using Mw ranking as a filter
  • 10. CNER caveats (II) the bioactivity-gap: majority of patent chemistry has no linked assay data 10
  • 11. CNER caveats (III): strange patent-unique structures 11 • Weird stuff generally non-biological chemistry (i.e. not A61) • For the record C07D = 10.9, A61K = 0.9, (C097D + A61K) = 0.81 mill CIDs
  • 12. CNER caveats (IV): mixture extractions (a mixed blessing) 12 • Mostly TFA or HCl salts • Includes combination claims and reactant mixtures • Causes sources to appear more divergent by exact match statistics • PubChem splits to component CIDs while maintaining the back-mapping • Can normalise with “CovalentUnitCount =1” filter
  • 13. An example 13 “Trifluorom ethyl- oxadiazole derivatives and their use in the treatment of disease” (Novartis) PTC for the patent family WO201300 8162, 2013-01-17
  • 14. SAR table 14 All three data sets extracted and example-numbered in BindingDB
  • 15. PubChem retrieval by patent number -> series cluster 15
  • 16. Extraction splits by source, date and isomeric connectivity: (it can get complicated….) 16 Different sources (SIDs) for same structure (CID) Different CID isomers with same core connectivity
  • 17. Impressive SureChEMBL family extraction 17 4830 rows 648 IDs mapped to 511 PubChem CIDs
  • 18. Extraction source selectivity • 151 BindingDB CIDs direct from PubChem • 93 Thomson Pharma CIDs (within the 151 above) • 296 SDFs from SciFinder > 269 CIDs • 648 SureChEMBL IDs > 511 CIDs • Numbers are not absolute because of “round tripping” mapping issues but they illustrate the selectivity and extent of open coverage 18
  • 19. Orthogonal entity mark-up (I) : Ferret (Chrome plug-in) 19
  • 20. Orthogonal entity mark-up (II) : SciBite’s Termite (within SureChEMBL) 20
  • 22. Roll-your-own extraction (I): ChemAxon chemicalize.org 22
  • 23. Recent comparative analysis • Compared SureChEMBL and IBM with SciFinder and Reaxys for a small patent set (i.e. open vs commercial) • Concluded; “50–66 % of the relevant content from the latter was also found in the former” • Equivalent comparisons executed in the latest PubChem with all patent sources would probably record a higher overlap 23 Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents, Senger, et al. J. Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL) http://www.ncbi.nlm.nih.gov/pubmed/26457120
  • 24. First 64K$ Q: can you search your novel chemistry in open dbs? • The InChIKey connectivity layer already facilitates blinded exact match (isomer-agnostic) searching anywhere, including Google • PubChem and SureChEMBL default to https; so searching is secure • There is (and never will be?) patent case law where novelty was challenged in court based on structures intercepted from public servers • Without metadata (e.g. target & disease) interception per se not much use • As for sequence data, hard evidence of serious competitive damage via query interception remains zero (after 20+ years) • Commercial dbs cannot capture all prior art, so need open check anyway 24
  • 25. Second 64K$ Q: Can you file based on open-only diligence? If convinced your novel series < billion$ drug, maybe not - but consider • Chances of completely missing an overlapping chemical series in open sources from a competing patent are diminishing • Prior art is confounded anyway by the 18-month publication shadow and Markush enumeration • Filing a 12 month provisional is low cost option • Portal queries allow you to find relevant patents (e.g. by target name) even if open chemistry extraction was limited • The searches that really count are the ones the patent examiner does for you (on payment) using all their sources (including PubChem) • However, attorney costs for drafting applications need balancing against savings on commercial patent resources 25
  • 26. Conclusions • The “Big Bang” of open chemistry and full text from patents now make these an essential part of IP and bioactivity assessments for SMEs • The combination of SureChEMBL and other sources within PubChem provide over 20 million patent-extracted structures and powerful analysis options • The gap between open and commercial has narrowed to the point you can at least consider doing without the latter • Note also the former has functionality absent from the latter • Bioactivity identification, mining and target mapping are still challenging but becoming easier • It is important to understand patent chemistry automated extraction quirks, artefacts, and pitfalls so you can filter these 26
  • 27. References and questions 27 http://cdsouthan.blogspot.com/ 19 posts have the tag “patents” http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624 (with PubMed Commons data link) www.ncbi.nlm.nih.gov/pubmed/25415348 http://www.ncbi.nlm.nih.gov/pubmed/23399051 http://www.ncbi.nlm.nih.gov/pubmed/23618056