The document discusses how the InChI identifier is used to connect chemistry databases at the Royal Society of Chemistry (RSC). It describes how InChI allows linking of over 32 million chemicals across 500 data sources in RSC's ChemSpider database. InChI has also enabled text mining of RSC publications to extract chemical data and standardize naming and structures. The RSC is working to further utilize InChIs through developing a repository for standardized chemical data and using InChIs to improve data quality in literature and databases.
How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
1. How the InChI identifier is used to
underpin our online chemistry
databases at RSC
Antony Williams, Valery Tkachenko
and Ken Karapetyan
ACS San Francisco
August 2014
9. OPSIN (chemical name to structure) http
://opsin.ch.cam.ac.uk/
• InChI support systems…
10. InChI mapping helps a lot!
• We wanted to map together chemical data on
the web
• We knew that chemical name mapping was
difficult but dictionaries were useful
• It is InChI that became the foundation
technology for our database…
• We accepted all the limitations of InChI
• We lived with the “Useful but not ideal”
• And so….
11. • ~32 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• Ongoing deposition of data from our
journals and our collaborators
• Structure centric hub for web-searching
• …and a really big dictionary!!!
20. NEW
15th
Edition
*The name THE MERCK INDEX is owned by Merck Sharp & Dohme Corp., a subsidiary of Merck & Co.,
Inc., Whitehouse Station, N.J., U.S.A., and is licensed to The Royal Society of Chemistry for use in the
U.S.A. and Canada.
Where else is RSC using InChIs
21.
22.
23.
24. Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue
25. Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue
27. Extracting our Archive
• What could we get from our archive?
• Find chemical names and generate structures
• Find chemical images and generate structures
• Find reactions
• Find data (MP, BP, LogP) and deposit
• Find figures and database them
• Find spectra (and link to structures)
• And of course InChIfy the entire collection
31. Progress to date
• We have text-mined all 21st
century articles…
>100k articles from 2000-2013
• Marked up with XML and published onto the
HTML forms of the articles
• Required multiple iterations based on
dictionaries, markup, text mining iterations
• New visualization tools in development – not
just chemical names. Add chemical and
biomedical terms markup also!
35. InChIs under our “repository”
• Scientific publications are a summary of work
• Is all work reported?
• How much science is lost to pruning?
• What of value sits in notebooks and is lost?
• Publications offering access to “real data”?
• How much data is lost?
• How many compounds never reported?
• How many syntheses fail or succeed?
• How many characterization measurements?
37. What are we building?
• We are building the “RSC Data Repository”
• Containers for compounds, reactions, analytical
data, tabular data
• Algorithms for data validation and standardization
• Flexible indexing and search technologies
• A platform for modeling data and hosting existing
models and predictive algorithms
38. New Repository Architecture
Compounds Reactions Spectra Materials Documents
Compounds
API
Reactions
API
Spectra
API
Materials
API
Documents
API
Compounds
Widgets
Reactions
Widgets
Spectra
Widgets
Materials
Widgets
Documents
Widgets
Data tier
Data access
tier
User
interface
components
tier
Analytical Laboratory application
User
interface tier
(examples) Electronic Laboratory Notebook
Paid 3rd
party integrations (various platforms – SharePoint, Google, etc)
Chemical Inventory application
44. InChIs under the repository
• All compound-based data handling will of
course connect with InChIs
• Compounds
• Reactions
• Compound-spectra matching
• Etc. etc. etc…
45. For Deposition of Data
• Developing systems that provides
feedback to users regarding data quality
• Validate/standardize chemical compounds
• Check for balanced reactions
• Checks spectral data
• EXAMPLE Future work
• Properties – compare experimental to pred.
• Automated structure verification - NMR
46. RSC Cheminformatics Projects
• RSC as a provider of support for grant-based
projects
• Utilizing ChemSpider initially as a platform
• Developing Chemical Registry Service
• Utilizing core architecture and widgets to
serve the projects
51. • ChemSpider IDs and InChIs/InChIKeys
made open and available for linking
• Exposed via the Open PHACTS RDF export
• A structure ID standard to enable further
linking across the semantic web of science
53. Electronic Notebook Data
• Development work integrating chemistry
into the Southampton Labtrove notebook
• Stoichiometry table development
• Analytical data integration
• “ChemTrove” includes chemistry widgets
and InChI as an important data field
62. What needs to happen?
• If we could validate
• Catch errors in databases (and clean)
• Proactively catch errors in publications/patents
• Reduce junk in the ether – improve QUALITY!
• If we standardized
• Interlinking should improve
66. DrugBank (ca. 6000 records)
• 38 records with InChI not matching the
structure, e.g. DB08521, DB08187
• 24 records where names (IUPAC_NAME) did
not match the structure, e.g. DB08346
• 38 records with SMILES not matching the
structure, e.g. DB08293
• 53 records with unusual valence, e.g. DB01983
with boron(V)
67. ChEMBL (1.3 million records)
• 11,020 records with 4 bonds and zero charge,
e.g. CHEMBL501101 or CHEMBL501973
• 271 records with hypervalent oxygen (e.g. ,
CHEMBL2219679), carbon (e.g. 1005895),
boron, chlorine, iodine or phosphine
• 6,177 records where direction of bond makes
no sense, e.g. CHEMBL12760 and
CHEMBL34704
68. ChemSpider Standardization
• Entire ChemSpider database will be
standardized using modified FDA rule set
• Original Molfiles will be standardized and all
properties (predicted properties, SMILES,
InChIs, Names) will all be regenerated
• CLEAN’ed database to compounds repository
• Standardization procedures automatically
applied to all future depositions
70. Internet Data
Data Repositories and InChI
Commercial Software
Pre-competitive Data
Open Science
Open Data
Publishers
Educators
Open Databases
Chemical Vendors
Small organic molecules
Undefined materials
Organometallics
Nanomaterials
Polymers
Minerals
Particle bound
Links to Biologicals
71. If InChI was not developed…
• Database linking would suffer dramatically
• The web would not be “structure searchable”
• Cheminformatics tools would likely not be
linking to public domain databases in the
same way
• We wouldn’t be here discussing….
• And ChemSpider would not have been built
72. Acknowledgments
• The InChI team
• The entire RSC cheminformatics team…
• Daniel Lowe for the text mining work
• Igor Tetko for OCHEM modeling
73. Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams
Hinweis der Redaktion
The content of the 15th Edition was produced by the Editorial team and Merck, and the book was published by us earlier this year. It is available for purchase by both individuals and libraries.
In addition to the book, we have produced and Online version of The Merck index, which is available solely through the Royal Society of Chemistry. This is available as a subscription, with one year free trial to individual purchasers of the book.