SlideShare a Scribd company logo
1 of 76
Chemical Database Projects Delivered
                   by RSC eScience
           at the FDA Meeting “Development of a Freely Distributable Data
                                System for the Registration of Substances”


                                       Antony Williams
RSC eScience
 What was once just ChemSpider is much more…

   ChemSpider Reactions
   Chemicals Validation and Standardization
    Platform
   Learn Chemistry Wiki
   National Chemical Database Service
   Open PHACTS
   PharmaSea
   Global Chemistry Hub
We are known for ChemSpider…
 The Free Chemical Database

 A central hub for chemists to source information
   >28 million unique chemical records
   Aggregated from >400 data sources
   Chemicals, spectra, CIF files, movies, images,
    podcasts, links to patents, publications,
    predictions

 A central hub for chemists to deposit & curate data
We Want to Answer Questions
 Questions a chemist might ask…
   What is the melting point of n-heptanol?
   What is the chemical structure of Xanax?
   Chemically, what is phenolphthalein?
   What are the stereocenters of cholesterol?
   Where can I find publications about xylene?
   What are the different trade names for Ketoconazole?
   What is the NMR spectrum of Aspirin?
   What are the safety handling issues for Thymol Blue?
I want to know about “Vincristine”
Vincristine: Identifiers and Properties
Vincristine: Vendors and Sources
Vincristine: Articles
How did we build it?
 We deal in Molfiles or SDF files – with coordinates
 Deposit anything that has an InChI – we support
  what InChI can handle, good and bad
 Standardization based on “InChI standardization”
 InChIs aggregate (certain) tautomers
The InChI Identifier
Downsides of InChI

 Good for small molecules – but no polymers,
  issues with inorganics, organometallics, imperfect
  stereochemistry. ChemSpider is “small molecules”

 InChI used as the “deduplicator” – FIRST version
  of a compound into the database becomes THE
  structure to deduplicate against…
Side Effects of InChI Usage
SMILES by comparison…
Side Effects of InChI Usage
Searches: The INTERNET
Search by InChI
ChemSpider Google Search
http://www.chemspider.com/google/
How did we build it?
 We deal in Molfiles or SDF files – with coordinates
 Deposit anything that has an InChI – we support
  what InChI can handle, good and bad
 Standardization based on “InChI standardization”
 InChIs aggregate (certain) tautomers

 We deal with “various forms” of data
Crowdsourced “Annotations”
 Users can add
   Descriptions/Syntheses/Commentaries
   Links to PubMed articles
   Links to articles via DOIs
   Add spectral data
   Add Crystallographic Information Files
   Add photos
   Add MP3 files
   Add Videos
ChemSpider : Spectra Linked
ChemSpider ID 24528095 H1 NMR
ChemSpider ID 24528095 HHCOSY
How did we build it?
 We deal in Molfiles or SDF files – with coordinates
 Deposit anything that has an InChI – we support
  what InChI can handle, good and bad
 Standardization based on “InChI standardization”
 InChIs aggregate (certain) tautomers

 We deal with “various forms” of data
 We are challenged with the complexities of
  chemical names
Antony Williams vs Identifiers


                             Passport ID
     Dad, Tony, others

5 email addresses
                         License
ChemSpiderman (blog,                       SSN
Twitter account,
Facebook, Friendfeed)
OpenID
….
                          Green Card
Aspirin names and synonyms

            • Text searches depend on
              correct association

            • >300 suggested identifiers for
              Aspirin just on PubChem

            • Disambiguation dictionaries
              are necessary, not just for
              authors!
The Final Search Strategy
All Those Names, One Structure
Curated Dictionaries Matter
Crowdsourcing ChemSpider

 ChemSpider is crowdsourced

 Community deposition, annotation
  and curation

 Anyone can “Leave Feedback”

 Registered users can add data
“Curate” Identifiers
“Curate” Identifiers
“Curate” Identifiers
Success Depends on Dictionaries
Vincristine: Identifiers and Properties
Vincristine: Patents
Linked by Name
Validated Names for Searching…
And yes..there are challenges
http://www.cas.org/legal/infopolicy
Licensing Data is Tough…
Data Licensing, Open Data
 The use of CAS data in third party Data Mining
  Tools is permitted as long as CAS Records are
  downloaded via STN® AnaVist™. All of these new
  "freedoms" are aimed at further enabling the
  dissemination of scientific information and the
  advancement of scientific research.

 CAS does not permit the building of Databases
  that have wide and general availability and no
  longer fulfill the purpose of individual or team
  research that CAS permits but instead serve as a
  substitute for the use of CAS Databases.
A Comment on Quality
 For >28 million chemical compounds there are
  some errors:

     “Incorrect” structure representations
     Mismatched name-structure relationships
     Experimental properties (the values, the units)
     Real vs. virtual compounds – text-mining and
      conversion

   We have deprecated a LOT of data…
Identifier Dictionaries
 Reciprocal curation processes…share curation
  with each other.

 If a database has a compound already then use
  InChiKeys to match “suggested” validation
  against the compound.

 A series of “added” and “removed” synonyms
  against InChIKeys for matching.
Federated Data Curation Sharing
Who wants to work with us?
Structure Validation using feed
 Look for approved synonyms

 Compare feed InChIKey with database InChIKey

 If different, flag for inspection
Many Problems Can be Solved…
 Clean up databases – structure validation,
  structure standardization

 Warn about
   Valency, charge balance, depiction issues,
    bond types, absent stereo, and another 100
    rules (or so…)

 Standardize
   Agree community rules to “Standardize”
Structure Validation
Structure Validation - Fixed
What needs to happen?
 If we could validate
    Catch errors in databases (and clean)
    Proactively catch errors in publications/patents
    Reduce junk in the ether – improve QUALITY!

 If we standardized
    Interlinking should improve
CVSP: result of processing
NCATS Dataset
DrugBank dataset (6516 records)
 Marked as Errors (arbitrary)

    2 records with query bonds
    3 records with invalid atoms (asterisk in polymers)
    Unusual valence: ~70 (oxygen 3, sulfur 3 and 5, Mg 4, B 5, etc.)

 Warnings
   INCHI not matching structure (100+)
   SMILES not matching structure (100+)
 DrugBank ID: DB00755
 InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-
  17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,
  (H,21,22)/b9-6+,12-11+,15-8+,16-14+




 DrugBank ID: DB00614
Connecting Chemistry across the web
 So much of what is seen on ChemSpider is
  retrieved in real time using services
Connecting Chemistry across the web
Online Predictions
Web Services Open Up Collaboration
 Agilent, Bruker, Waters and Thermo all use our
  web-based services for compound lookup

 Many academic sites integrating directly –
  metabonomics, name lookup, semantic markup

 Mobile app integration

 Commercial structure drawing packages
Web Services
ChemSpider Everywhere: Spectral Game
ChemSpider Everywhere
Crowdsourced Curation of Spectra
Web Services Integrate INTERNAL Projects
 Integration between ChemSpider and…

     Our publishing platform for structure display
     ChemSpider SyntheticPages
     LearnChemistry Wiki
     National Chemical Database Service
     And….a growing list….
What ChemSpider Does Not Handle
   Polymers
   Markush structures
   Organometallics
   Many Inorganics
   Materials
   Reactions…but….
ChemSpider Reactions
DERA
 Digitally Enabling the RSC Archive…back to 1841


   Extracting data and making available via
    appropriate platform
      Chemicals
      Reactions
      Analytical Data
      Figures
Chemical Database Service
Data for life sciences
                                                      IP?
                                                       IP?
                            What’s the
                             What’s the
                            structure?
                             structure?
                                                 Are they in
                                                  Are they in
                                                  our file?
                                                   our file?
                              What’s
                               What’s
                             similar?
                              similar?
                                                 What’s the
                                                  What’s the
                          Pharmacology
                           Pharmacology           target?
                                                   target?
                              data?
                               data?

                                           Known
                                            Known
                                         Pathways?
                                          Pathways?
                         Competitors?
                          Competitors?
                                                Working On
                                                 Working On
                          Connections
                           Connections            Now?
                                                   Now?
                          to disease?
                           to disease?
                                           Expressed in
                                            Expressed in
                                          right cell type?
                                           right cell type?
OpenPHACTS
 Open PHACTS is an Innovative Medicines Initiative
  (IMI) – 3 years project

 To reduce the barriers to drug discovery in industry,
  academia and for small businesses

 To build an open platform, integrating chemistry and
  biology data from public domain resources

 Semantic web platform

 Open Standards, Open Data and Open Source
 Crowdsourcing across drug discovery
 Open PHACTS : partnership between European
  Community and European Pharma Companies
 22 partners, 8 pharmaceutical companies, 3
  biotechs working together for 3 years

 Freely accessible for knowledge discovery and
  verification.
    Data on chemistry and biology
    Pharmacological profiles
    Proprietary and public data sources.
PharmaSea
• FP7 Initiative. PharmaSea: increasing value and flow
  in the marine biodiscovery pipeline (2012-2017)

• Improve the quality, volume and value of active
  agents discovered in the marine environment and
  increase the speed at which they can be delivered

• RSC: Providing dereplication via ChemSpider,
  analytical data algorithms, integration with
  computer-assisted structure elucidation algorithms
Conclusions
 RSC eScience supporting increasing number of
  grant-based projects
 ChemSpider grows daily – community depositions
  and data from RSC Content with a focus on
  expanding data while improving quality
 ChemSpider is an integration platform for MANY
  projects through web services
 CVSP processing is available to use and provide
  feedback – will be available as a service also
 We believe in curation sharing - who wants to
  collaborate?
Thank you

Email: williamsa@rsc.org
Twitter: ChemConnector
Blog: www.chemspider.com/blog
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

More Related Content

Viewers also liked

ChemSpider compound database as one of the pillars of a semantic web for …
ChemSpider compound database as one of the pillars of a semantic web for …ChemSpider compound database as one of the pillars of a semantic web for …
ChemSpider compound database as one of the pillars of a semantic web for …Valery Tkachenko
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionIn a Rocket
 
10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer ExperienceYuan Wang
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanPost Planner
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldabaux singapore
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting PersonalKirsty Hulse
 

Viewers also liked (7)

ChemSpider compound database as one of the pillars of a semantic web for …
ChemSpider compound database as one of the pillars of a semantic web for …ChemSpider compound database as one of the pillars of a semantic web for …
ChemSpider compound database as one of the pillars of a semantic web for …
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming Convention
 
10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media Plan
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
 
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job? Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
 

Similar to Chemical Database Projects Delivered by RSC eScience

Open PHACTS for BDE SC1.1
Open PHACTS for BDE SC1.1Open PHACTS for BDE SC1.1
Open PHACTS for BDE SC1.1BigData_Europe
 
ALA 2010 -- Jabin White
ALA 2010 -- Jabin WhiteALA 2010 -- Jabin White
ALA 2010 -- Jabin Whitebisg
 
Wilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiWilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiBOSC 2010
 
Scio12 sem web_final
Scio12 sem web_finalScio12 sem web_final
Scio12 sem web_finalKristi Holmes
 

Similar to Chemical Database Projects Delivered by RSC eScience (20)

Chem spider as a chemical term resolver
Chem spider as a chemical term resolverChem spider as a chemical term resolver
Chem spider as a chemical term resolver
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
The future of scientific information & communication
The future of scientific information & communicationThe future of scientific information & communication
The future of scientific information & communication
 
ChemSpider as a chemical term resolver
ChemSpider as a chemical term resolverChemSpider as a chemical term resolver
ChemSpider as a chemical term resolver
 
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
 
Mining public domain data as a basis for drug repurposing
Mining public domain data as a basis for drug repurposingMining public domain data as a basis for drug repurposing
Mining public domain data as a basis for drug repurposing
 
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
 
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
 
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
 
Chemistry made mobile – the expanding world of chemistry in the hand
Chemistry made mobile – the expanding world of chemistry in the handChemistry made mobile – the expanding world of chemistry in the hand
Chemistry made mobile – the expanding world of chemistry in the hand
 
Connecting Chemistry Across the Internet Using ChemSpider
Connecting Chemistry Across the Internet Using ChemSpiderConnecting Chemistry Across the Internet Using ChemSpider
Connecting Chemistry Across the Internet Using ChemSpider
 
Scibite flyer 2013
Scibite flyer 2013Scibite flyer 2013
Scibite flyer 2013
 
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
 
Open PHACTS for BDE SC1.1
Open PHACTS for BDE SC1.1Open PHACTS for BDE SC1.1
Open PHACTS for BDE SC1.1
 
Integrating and curating internet based chemistry resources to serve life sci...
Integrating and curating internet based chemistry resources to serve life sci...Integrating and curating internet based chemistry resources to serve life sci...
Integrating and curating internet based chemistry resources to serve life sci...
 
ALA 2010 -- Jabin White
ALA 2010 -- Jabin WhiteALA 2010 -- Jabin White
ALA 2010 -- Jabin White
 
Wilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiWilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadi
 
Crowdsourcing Chemistry for the Community – 5 Years of Experiences
Crowdsourcing Chemistry for the Community – 5 Years of ExperiencesCrowdsourcing Chemistry for the Community – 5 Years of Experiences
Crowdsourcing Chemistry for the Community – 5 Years of Experiences
 
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
 
Scio12 sem web_final
Scio12 sem web_finalScio12 sem web_final
Scio12 sem web_final
 

Recently uploaded

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Chemical Database Projects Delivered by RSC eScience

  • 1. Chemical Database Projects Delivered by RSC eScience at the FDA Meeting “Development of a Freely Distributable Data System for the Registration of Substances” Antony Williams
  • 2. RSC eScience  What was once just ChemSpider is much more…  ChemSpider Reactions  Chemicals Validation and Standardization Platform  Learn Chemistry Wiki  National Chemical Database Service  Open PHACTS  PharmaSea  Global Chemistry Hub
  • 3. We are known for ChemSpider…  The Free Chemical Database  A central hub for chemists to source information  >28 million unique chemical records  Aggregated from >400 data sources  Chemicals, spectra, CIF files, movies, images, podcasts, links to patents, publications, predictions  A central hub for chemists to deposit & curate data
  • 4. We Want to Answer Questions  Questions a chemist might ask…  What is the melting point of n-heptanol?  What is the chemical structure of Xanax?  Chemically, what is phenolphthalein?  What are the stereocenters of cholesterol?  Where can I find publications about xylene?  What are the different trade names for Ketoconazole?  What is the NMR spectrum of Aspirin?  What are the safety handling issues for Thymol Blue?
  • 5. I want to know about “Vincristine”
  • 9. How did we build it?  We deal in Molfiles or SDF files – with coordinates  Deposit anything that has an InChI – we support what InChI can handle, good and bad  Standardization based on “InChI standardization”  InChIs aggregate (certain) tautomers
  • 11. Downsides of InChI  Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules”  InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…
  • 12. Side Effects of InChI Usage
  • 14. Side Effects of InChI Usage
  • 18. How did we build it?  We deal in Molfiles or SDF files – with coordinates  Deposit anything that has an InChI – we support what InChI can handle, good and bad  Standardization based on “InChI standardization”  InChIs aggregate (certain) tautomers  We deal with “various forms” of data
  • 19. Crowdsourced “Annotations”  Users can add  Descriptions/Syntheses/Commentaries  Links to PubMed articles  Links to articles via DOIs  Add spectral data  Add Crystallographic Information Files  Add photos  Add MP3 files  Add Videos
  • 20.
  • 24. How did we build it?  We deal in Molfiles or SDF files – with coordinates  Deposit anything that has an InChI – we support what InChI can handle, good and bad  Standardization based on “InChI standardization”  InChIs aggregate (certain) tautomers  We deal with “various forms” of data  We are challenged with the complexities of chemical names
  • 25. Antony Williams vs Identifiers Passport ID Dad, Tony, others 5 email addresses License ChemSpiderman (blog, SSN Twitter account, Facebook, Friendfeed) OpenID …. Green Card
  • 26. Aspirin names and synonyms • Text searches depend on correct association • >300 suggested identifiers for Aspirin just on PubChem • Disambiguation dictionaries are necessary, not just for authors!
  • 27.
  • 28. The Final Search Strategy
  • 29. All Those Names, One Structure
  • 31. Crowdsourcing ChemSpider  ChemSpider is crowdsourced  Community deposition, annotation and curation  Anyone can “Leave Feedback”  Registered users can add data
  • 35. Success Depends on Dictionaries
  • 38. Validated Names for Searching…
  • 39. And yes..there are challenges http://www.cas.org/legal/infopolicy
  • 40. Licensing Data is Tough…
  • 41. Data Licensing, Open Data  The use of CAS data in third party Data Mining Tools is permitted as long as CAS Records are downloaded via STN® AnaVist™. All of these new "freedoms" are aimed at further enabling the dissemination of scientific information and the advancement of scientific research.  CAS does not permit the building of Databases that have wide and general availability and no longer fulfill the purpose of individual or team research that CAS permits but instead serve as a substitute for the use of CAS Databases.
  • 42. A Comment on Quality  For >28 million chemical compounds there are some errors:  “Incorrect” structure representations  Mismatched name-structure relationships  Experimental properties (the values, the units)  Real vs. virtual compounds – text-mining and conversion  We have deprecated a LOT of data…
  • 43. Identifier Dictionaries  Reciprocal curation processes…share curation with each other.  If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.  A series of “added” and “removed” synonyms against InChIKeys for matching.
  • 44. Federated Data Curation Sharing Who wants to work with us?
  • 45. Structure Validation using feed  Look for approved synonyms  Compare feed InChIKey with database InChIKey  If different, flag for inspection
  • 46. Many Problems Can be Solved…  Clean up databases – structure validation, structure standardization  Warn about  Valency, charge balance, depiction issues, bond types, absent stereo, and another 100 rules (or so…)  Standardize  Agree community rules to “Standardize”
  • 49. What needs to happen?  If we could validate  Catch errors in databases (and clean)  Proactively catch errors in publications/patents  Reduce junk in the ether – improve QUALITY!  If we standardized  Interlinking should improve
  • 50. CVSP: result of processing
  • 52.
  • 53. DrugBank dataset (6516 records)  Marked as Errors (arbitrary)  2 records with query bonds  3 records with invalid atoms (asterisk in polymers)  Unusual valence: ~70 (oxygen 3, sulfur 3 and 5, Mg 4, B 5, etc.)  Warnings  INCHI not matching structure (100+)  SMILES not matching structure (100+)
  • 54.  DrugBank ID: DB00755  InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18- 17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3, (H,21,22)/b9-6+,12-11+,15-8+,16-14+  DrugBank ID: DB00614
  • 55. Connecting Chemistry across the web  So much of what is seen on ChemSpider is retrieved in real time using services
  • 58. Web Services Open Up Collaboration  Agilent, Bruker, Waters and Thermo all use our web-based services for compound lookup  Many academic sites integrating directly – metabonomics, name lookup, semantic markup  Mobile app integration  Commercial structure drawing packages
  • 62. Web Services Integrate INTERNAL Projects  Integration between ChemSpider and…  Our publishing platform for structure display  ChemSpider SyntheticPages  LearnChemistry Wiki  National Chemical Database Service  And….a growing list….
  • 63. What ChemSpider Does Not Handle  Polymers  Markush structures  Organometallics  Many Inorganics  Materials  Reactions…but….
  • 65.
  • 66. DERA  Digitally Enabling the RSC Archive…back to 1841  Extracting data and making available via appropriate platform  Chemicals  Reactions  Analytical Data  Figures
  • 67.
  • 69. Data for life sciences IP? IP? What’s the What’s the structure? structure? Are they in Are they in our file? our file? What’s What’s similar? similar? What’s the What’s the Pharmacology Pharmacology target? target? data? data? Known Known Pathways? Pathways? Competitors? Competitors? Working On Working On Connections Connections Now? Now? to disease? to disease? Expressed in Expressed in right cell type? right cell type?
  • 70. OpenPHACTS  Open PHACTS is an Innovative Medicines Initiative (IMI) – 3 years project  To reduce the barriers to drug discovery in industry, academia and for small businesses  To build an open platform, integrating chemistry and biology data from public domain resources  Semantic web platform  Open Standards, Open Data and Open Source
  • 71.  Crowdsourcing across drug discovery  Open PHACTS : partnership between European Community and European Pharma Companies  22 partners, 8 pharmaceutical companies, 3 biotechs working together for 3 years  Freely accessible for knowledge discovery and verification.  Data on chemistry and biology  Pharmacological profiles  Proprietary and public data sources.
  • 72.
  • 73.
  • 74. PharmaSea • FP7 Initiative. PharmaSea: increasing value and flow in the marine biodiscovery pipeline (2012-2017) • Improve the quality, volume and value of active agents discovered in the marine environment and increase the speed at which they can be delivered • RSC: Providing dereplication via ChemSpider, analytical data algorithms, integration with computer-assisted structure elucidation algorithms
  • 75. Conclusions  RSC eScience supporting increasing number of grant-based projects  ChemSpider grows daily – community depositions and data from RSC Content with a focus on expanding data while improving quality  ChemSpider is an integration platform for MANY projects through web services  CVSP processing is available to use and provide feedback – will be available as a service also  We believe in curation sharing - who wants to collaborate?
  • 76. Thank you Email: williamsa@rsc.org Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

Editor's Notes

  1. The result of processing is a list of records with validation messages in the middle. If record was standardized then “Standardized” column is present with the structure.
  2. Here is the bigger DrugBank dataset we have processed. Some warnings are shown in the dropdown list. Warnings about metals, stereo, enol presence, etc.
  3. DrugBank data set was used as Guinea pig in this case and passed through CVSP and several issues were found. OF course, issues are marked as warnings or errors arbitrarily, but they correlate with the level of attention that depositors need to pay attention when reviewing their submission. There were 2 records where non aromatic query bonds were used, 3 records with invalid atoms (e.g. ”A” in polymers). About 70 records had issues: mostly radicals, inorganics, and coordination compounds have unusual valences. But these are records that experts in inorganic chemistry or coordination chemistry need to look at and have their verdict.
  4. First example is where InChI shows known stereo on double bonds but structure doesn’t. Second example is where neither the InChI nor the name match the structure. This is an example where the primary data – in this case original DrugBank dataset – with these inconsistencies have propagated to few big databases. It would be beneficial if original dataset owners pass their data through validation and fix them. We would be happy to assist in this adventure to anybody that is interested in quality of their data sets.