There are various online chemistry data that are free-to-access for the community. Many of these resources presently host millions of small molecule chemical compounds together with associated data and accessible via a number of search techniques. The challenges associated with providing a similar platform for “materials” are manifold but, if they could be addressed, would offer a valuable service to the materials community. This presentation will provide an overview of some of the challenges associated with data access and both closed and open identifiers (e.g. InChI), and how one particular online resource, ChemSpider, was built. The challenges faced to embrace the diverse world of materials informatics and online data access are manifold but it is possible that there is much to learn from some of the approaches taken with small molecules.
Chapter-4 Introduction to Global Distributions System and Computerized Reserv...
Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials
1. Hosting public domain chemicals
data online for the community – the
challenges of handling materials
Antony Williams
NIST Diffusion/CALPHAD Data Informatics and Tools Workshop
May 14th
, 2015
ORCID ID:0000-0002-2668-4821
3. Many challenges are the same
• What I will discuss in terms of publisher,
public domain databases, curated chemistry
challenges etc. are the same…
• Need capable tools to handle the data
• Need standards for data exchange
• Meshing data without review is dangerous!
• Quality costs – time, effort and money
• Algorithms can help clean data
4. Where is chemistry online?
• Encyclopedic articles (Wikipedia)
• Chemical vendor databases
• Metabolic pathway databases
• Property databases
• Patents with chemical structures
• Drug Discovery data
• Scientific publications
• Compound aggregators
• Blogs/Wikis and Open Notebook Science
5. Chemistry on the Internet…
• Most searching for chemistry on the internet…
• Name searching Google/Bing/Yahoo
• Name searching Wikipedia
• Name searching Wolfram Alpha
• Name, name, name, name…searching
23. Why CAS Numbers are not great
• There is no free service…like DOIs
• The resolver is a “Google Search”
• Maybe we need another “identifier”?
• And thanks to IUPAC/NIST….
28. InChI
• SINGLE code base managed by IUPAC –
integrated into drawing packages and used by
MANY databases. No variability as with
SMILES
• InChI Strings can be reversed to structures –
same problem as with SMILES – no layout
• Adopted by the community (databases, blogs,
Wikipedia) – good for searching the internet
48. Data Quality/Standardization
• MANY structures meant to be something
online are MISREPRESENTED.
• Commonly you will have better success finding
information by name searches than structure –
with many caveats of course…
• Validating chemical structure representations
is laborious work – and it’s shocking to review
data…
63. Can we MAKE Quality Data?
• Systems for everyone to validate and
standardize their data would be useful
• Would improve structure data in publications,
databases etc. and make searching across
resources better
• Collaboration to establish community rules
would be good!
69. ChemSpider limitations
• Supports “small molecules” only – no
InChI, no possibility to register a compound
• SO MUCH of chemistry is “materials”
• Severe limitation in chemistry coverage:
• Monomers but no polymers
• Inorganic and organometallic handling
• Ambiguous structures – “Markush”
• Nanomaterials
• Minerals
• Bound to beads, surfaces etc
70. ORGANICS vs. Materials
• Comment – you don’t know all of the
challenges until you start to work in the area!
• We, and cheminformatics companies, have
solved MANY, but not all of the issues
regarding organic chemistry management
• The majority of our approaches do not map to
materials
• No standard ways to represent compounds
• No InChI for materials
71. Questions to consider…
• Organics are hard enough!
• What are your best dictionaries of materials?
• We have chemical ontologies. Status for
materials?
• Is open annotation of your databases possible?
• What standards do you have for materials data
exchange?
73. Known Challenges
• Many materials are non-stoichiometric
• How to represent composite materials (e.g.
supported catalysts)?
• Methods to distinguish novelty in materials
(equivalent to diversity in organic structures)?
• Lots of challenges ahead..a curated
“community dictionary” would be of value…