SlideShare a Scribd company logo
1 of 19
So I have an SD File …
What do I do next?
Rajarshi Guha & Noel O’Boyle
NCATS & NextMove Software
ACS National Meeting, Boston 2015
What do you want to do?
What is the core issue?
• What you see on a
screen isn’t necessarily
what you get in a file
• Need to be aware of
how certain chemical
concepts are handled in
software
Tasks to be considered
• Searching for structures
• Managing inventory
• Linking / merging
structure data to other
data
• Predicting properties or
analysis of bioactivity
data
Which file format for data storage?
● The answer to this question is never XYZ or PDB
o Don’t use a file format that throws away parts of
your chemical structure (connectivity, bond orders
or formal charges)
o Software has to guess the missing information
● And probably not InChI
o Without the ‘AuxInfo’, the chemical structure
obtained from an InChI is not necessarily the same
as the original (e.g. amides to imidic acids)
● SMILES and MOL are your go-to formats
● Widely supported (i.e. portable), can recreate the
original structure
The question of identity
● A file format is not the same as an identifier
o The same molecule can be represented in different
ways, even in the same format
● A “canonical” representation is required
○ To check identity, find or avoid duplicates, find overlap
of two databases or check that a structure remains
unchanged (e.g. after some transformation)
● Only InChI (and IUPAC names) are canonical by
definition, but canonical versions of other
formats can be generated
C C O C C O
Ethanol can be represented in SMILES format as CCO or OCC (among others)
Canonical SMILES
● Atom order is the same whatever the input
● BUT, every toolkit has its own canonicalization
algorithm (which may change over time)
○ Consistent within the toolkit, not neccesarily
outside
● Don’t assume that a given SMILES is in a
canonical form
○ If necessary, canonicalize them yourself
Ethanol as CCO, OCC, C(O)C all converted to CCO (by Toolkit#1)
Ethanol as CCO, OCC, C(O)C all converted to OCC (by Toolkit#2)
Depictions vs computers
● Are your structures drawn for humans or computers?
○ There are 2D depictions of stereochemistry that are instantly
interpretable by a human but which are commonly
misinterpreted by software
● Chirality of (a) is opposite to (c)
○ But what is the chirality of (b)?
● Possibilities:
○ Undefined (according to InChI, if close to 180°)
○ Same as (a) or (c) depending on which side of 180°
Rings with ‘implicit’ 3D
You drew You meant You may get
Tetrahedral stereo gotchas
● R/S in IUPAC names, @/@@ in SMILES, 1/2 in
MOL files, +/- in InChIs
● None of these directly correspond to another
○ SMILES and Mol files describe stereo in terms of atom
order, but differ in where implicit hydrogens are
located
○ InChI and IUPAC names both use a complex algorithm
to determine the symbol
● Only two of these formats may always be used to
compare two structures:
○ R/S and /m layer (InChI)
○ Also @/@@, but only if canonical
Illuminating the black box
● Important to know what operations are being done
implicitly and what needs to be done explicitly
○ Are the error rates acceptable?
● Parse structure
○ Read list of atoms and bonds (incl. charges and isotopes)
○ [Mol, Mol2, Smi] Apply valence model
● Perceive aromaticity (or preserve from input)
● Perceive stereochemistry (or preserve from input)
● Optional: recognize atom / bond types, partial charges,
generate coordinates
c1ccccc1C(=O)Cl
Aromaticity
● Cheminformatics aromaticity not quite the
same as chemical aromaticity
○ Mainly a convenience for handling the fact that
the single/double bonds bonds in Kekulé systems
may be set differently
● Usually a good idea to export structures in
Kekulé form
○ More portable - tools may reject some SMILES in
aromatic form if they cannot kekulize them
○ Allows tools to apply their own aromaticity model
○ Faster if detection of aromaticity can be avoided
2D or 3D?
No Geometry
No Geometry
2D Geometry
3D Geometry
CN1C2=C(C(C3=CC=CC=C3)=NCC1=O)C=C(Cl)C=C2
Going from 2D to 3D
● Key point - easy to get a 3D structure, but is it
the 3D structure you want (or need)?
○ Do you need a single ‘reasonable’ structure or a
large number of conformations?
● Many tools to generate an acceptable 3D
structure from a 2D format
○ Usually a low energy conformation obtained via
molecular mechanics
● Conformer generators
○ Important to think about appropriate energy
and/or RMSD cutoffs
Moving from files to a database
● If you’re going beyond 100’s of molecules consider
using a chemically-aware database
○ Instant Jchem
○ MolEditor
● Not too difficult to roll your own using Open Source
but requires programming skills
● Don’t use Excel (even with ChemDraw)
○ Missing data is not handled consistently
○ Can mangle identifiers (parse them as dates)
○ Complicates workflows
○ Formatting can hinder efficient data analyses
○ Difficult to have multiple users
Verifying data quality
● This is all good if it’s your own compounds
● What about structures from someone else?
○ Need to check (& try to fix) nonsensical chemistry
● Check for
○ invalid valences, nonsense stereo, fragments
○ weird/invalid atoms, multiple radical centers
● Consider http://cvsp.chemspider.com/
Karapetyan et al, J. Cheminf, 2015
Structures are good. Are they useful?
● At this point you likely have a set of
correct (valid) structures
○ Are the structures useful for your purpose?
● A collection may have compounds with
problematic structures
○ Reactive groups, fluorophores, ADMET liabilities, …
● Consider rules & filters such as REOS, PAINS, Lilly
MedChem Rules
○ Implemented in commercial & OSS tools
○ Don’t use them blindly!
● Normalisation?
○ E.g. -N(=O)=O or –[N+][O-]=O (or doesn’t matter?)
What are you really looking for?
● Similarity searches are a common task
● What you get depends on
○ How the structure was entered
○ Normalization of structures
● But also on what you’re looking for
○ Connectivity
○ Atom & bond type
○ Shape or pharmacophore features …
● May be surprised by false
negatives
○ Test your query on structures
it should find
may not find
Because we love statistics & M/L
Alexander et al (2015)
Cherkasov et al (2014)
Huang & Fan (2013)
Chirico & Grammatica (2011)
Tropsha (2010)
Jain & Nicholls (2008)
Nicholls (2008)
Hawkins (2004)
Cronin & Schultz (2003)
• Look at your data, plot
your data
• Read up statistics
• Linear models are a
good start
• Most of this is not
about cheminformatics
• But the notion of
chemical space plays a
key role in this area
Summary
Do
1. Chose appropriate file
formats
2. Check data quality
3. Get involved in the
cheminformatics
community
4. Trust but verify
Don’t
1. Treat chemical software as
a black box
2. Assume geometry
3. Use M/L blindly
4. Did we mention Excel
already?
Acknowledgements
● John May (NextMove Software)
● Adam Yasgar, Madhu Lal-Nag (NCATS)

More Related Content

What's hot

download
downloaddownload
download
butest
 
Representation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object modelRepresentation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object model
Mihika Shah
 
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and DatabasesESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
eswcsummerschool
 

What's hot (19)

Ontology Engineering for Big Data
Ontology Engineering for Big DataOntology Engineering for Big Data
Ontology Engineering for Big Data
 
Ontology-based Data Integration
Ontology-based Data IntegrationOntology-based Data Integration
Ontology-based Data Integration
 
ontology based- data_integration.
ontology based- data_integration.ontology based- data_integration.
ontology based- data_integration.
 
Molecular Representation, Similarity and Search
Molecular Representation, Similarity and SearchMolecular Representation, Similarity and Search
Molecular Representation, Similarity and Search
 
download
downloaddownload
download
 
Ontology For Data Integration
Ontology For Data IntegrationOntology For Data Integration
Ontology For Data Integration
 
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
 
Ontology Mapping
Ontology MappingOntology Mapping
Ontology Mapping
 
Reference Ontology Presentation
Reference Ontology PresentationReference Ontology Presentation
Reference Ontology Presentation
 
from text and ontology : methodologies and tools - Text2Onto
from text and ontology : methodologies and tools - Text2Ontofrom text and ontology : methodologies and tools - Text2Onto
from text and ontology : methodologies and tools - Text2Onto
 
2.molecular modelling intro
2.molecular modelling intro2.molecular modelling intro
2.molecular modelling intro
 
Odbms concepts
Odbms conceptsOdbms concepts
Odbms concepts
 
Representation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object modelRepresentation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object model
 
Ontology integration - Heterogeneity, Techniques and more
Ontology integration - Heterogeneity, Techniques and moreOntology integration - Heterogeneity, Techniques and more
Ontology integration - Heterogeneity, Techniques and more
 
Structural weights in ontology matching
Structural weights in ontology matchingStructural weights in ontology matching
Structural weights in ontology matching
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
A Semi-Automatic Ontology Extension Method for Semantic Web Services
A Semi-Automatic Ontology Extension Method for Semantic Web ServicesA Semi-Automatic Ontology Extension Method for Semantic Web Services
A Semi-Automatic Ontology Extension Method for Semantic Web Services
 
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and DatabasesESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
 
Artificial Intelligence of the Web through Domain Ontologies
Artificial Intelligence of the Web through Domain OntologiesArtificial Intelligence of the Web through Domain Ontologies
Artificial Intelligence of the Web through Domain Ontologies
 

Similar to So I have an SD File... What do I do next?

So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
Rajarshi Guha
 
Sharing chemical structures with peer reviewed publications
Sharing chemical structures with peer reviewed publications Sharing chemical structures with peer reviewed publications
Sharing chemical structures with peer reviewed publications
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Similar to So I have an SD File... What do I do next? (20)

So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
 
Sharing chemical structures with peer reviewed publications
Sharing chemical structures with peer reviewed publications Sharing chemical structures with peer reviewed publications
Sharing chemical structures with peer reviewed publications
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Take Note of Note Taking
Take Note of Note TakingTake Note of Note Taking
Take Note of Note Taking
 
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
 
Chemical features: how do we describe a compound to a computer?
Chemical features: how do we describe a compound to a computer?Chemical features: how do we describe a compound to a computer?
Chemical features: how do we describe a compound to a computer?
 
Approaches for extraction and digital chromatography of chemical data
Approaches for extraction and digital chromatography of chemical dataApproaches for extraction and digital chromatography of chemical data
Approaches for extraction and digital chromatography of chemical data
 
A few questions about large scale machine learning
A few questions about large scale machine learningA few questions about large scale machine learning
A few questions about large scale machine learning
 
All together now: piecing together the knowledge graph of life
All together now: piecing together the knowledge graph of lifeAll together now: piecing together the knowledge graph of life
All together now: piecing together the knowledge graph of life
 
Sprint Boot & Kotlin - Meetup.pdf
Sprint Boot & Kotlin - Meetup.pdfSprint Boot & Kotlin - Meetup.pdf
Sprint Boot & Kotlin - Meetup.pdf
 
Avogadro 2 and Open Chemistry
Avogadro 2 and Open ChemistryAvogadro 2 and Open Chemistry
Avogadro 2 and Open Chemistry
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examples
 
Object Oriented Software Development revision slide
Object Oriented Software Development revision slide Object Oriented Software Development revision slide
Object Oriented Software Development revision slide
 
Software Engineering Primer
Software Engineering PrimerSoftware Engineering Primer
Software Engineering Primer
 
How to do your Advanced Level (AL) studies successfully
How to do your Advanced Level (AL) studies successfullyHow to do your Advanced Level (AL) studies successfully
How to do your Advanced Level (AL) studies successfully
 
Nautilus LIMS: Two Months to Two Hours
Nautilus LIMS: Two Months to Two HoursNautilus LIMS: Two Months to Two Hours
Nautilus LIMS: Two Months to Two Hours
 
Object Calisthenics in Objective-C
Object Calisthenics in Objective-CObject Calisthenics in Objective-C
Object Calisthenics in Objective-C
 
XAI (IIT-Patna).pdf
XAI (IIT-Patna).pdfXAI (IIT-Patna).pdf
XAI (IIT-Patna).pdf
 

More from baoilleach

Universal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES stringUniversal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES string
baoilleach
 
What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2
baoilleach
 
Large-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cellsLarge-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cells
baoilleach
 
Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...
baoilleach
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
baoilleach
 
Improving enrichment rates
Improving enrichment ratesImproving enrichment rates
Improving enrichment rates
baoilleach
 

More from baoilleach (20)

We need to talk about Kekulization, Aromaticity and SMILES
We need to talk about Kekulization, Aromaticity and SMILESWe need to talk about Kekulization, Aromaticity and SMILES
We need to talk about Kekulization, Aromaticity and SMILES
 
Open Babel project overview
Open Babel project overviewOpen Babel project overview
Open Babel project overview
 
Chemistrify the Web
Chemistrify the WebChemistrify the Web
Chemistrify the Web
 
Universal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES stringUniversal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES string
 
What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2
 
Intro to Open Babel
Intro to Open BabelIntro to Open Babel
Intro to Open Babel
 
Protein-ligand docking
Protein-ligand dockingProtein-ligand docking
Protein-ligand docking
 
Making the most of a QM calculation
Making the most of a QM calculationMaking the most of a QM calculation
Making the most of a QM calculation
 
Data Analysis in QSAR
Data Analysis in QSARData Analysis in QSAR
Data Analysis in QSAR
 
Large-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cellsLarge-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cells
 
My Open Access papers
My Open Access papersMy Open Access papers
My Open Access papers
 
Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...
 
De novo design of molecular wires with optimal properties for solar energy co...
De novo design of molecular wires with optimal properties for solar energy co...De novo design of molecular wires with optimal properties for solar energy co...
De novo design of molecular wires with optimal properties for solar energy co...
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tune
 
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
 
Application of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling MicroscopyApplication of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling Microscopy
 
Towards Practical Molecular Devices
Towards Practical Molecular DevicesTowards Practical Molecular Devices
Towards Practical Molecular Devices
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
 
Improving enrichment rates
Improving enrichment ratesImproving enrichment rates
Improving enrichment rates
 

Recently uploaded

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 

Recently uploaded (20)

Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 

So I have an SD File... What do I do next?

  • 1. So I have an SD File … What do I do next? Rajarshi Guha & Noel O’Boyle NCATS & NextMove Software ACS National Meeting, Boston 2015
  • 2. What do you want to do? What is the core issue? • What you see on a screen isn’t necessarily what you get in a file • Need to be aware of how certain chemical concepts are handled in software Tasks to be considered • Searching for structures • Managing inventory • Linking / merging structure data to other data • Predicting properties or analysis of bioactivity data
  • 3. Which file format for data storage? ● The answer to this question is never XYZ or PDB o Don’t use a file format that throws away parts of your chemical structure (connectivity, bond orders or formal charges) o Software has to guess the missing information ● And probably not InChI o Without the ‘AuxInfo’, the chemical structure obtained from an InChI is not necessarily the same as the original (e.g. amides to imidic acids) ● SMILES and MOL are your go-to formats ● Widely supported (i.e. portable), can recreate the original structure
  • 4. The question of identity ● A file format is not the same as an identifier o The same molecule can be represented in different ways, even in the same format ● A “canonical” representation is required ○ To check identity, find or avoid duplicates, find overlap of two databases or check that a structure remains unchanged (e.g. after some transformation) ● Only InChI (and IUPAC names) are canonical by definition, but canonical versions of other formats can be generated C C O C C O Ethanol can be represented in SMILES format as CCO or OCC (among others)
  • 5. Canonical SMILES ● Atom order is the same whatever the input ● BUT, every toolkit has its own canonicalization algorithm (which may change over time) ○ Consistent within the toolkit, not neccesarily outside ● Don’t assume that a given SMILES is in a canonical form ○ If necessary, canonicalize them yourself Ethanol as CCO, OCC, C(O)C all converted to CCO (by Toolkit#1) Ethanol as CCO, OCC, C(O)C all converted to OCC (by Toolkit#2)
  • 6. Depictions vs computers ● Are your structures drawn for humans or computers? ○ There are 2D depictions of stereochemistry that are instantly interpretable by a human but which are commonly misinterpreted by software ● Chirality of (a) is opposite to (c) ○ But what is the chirality of (b)? ● Possibilities: ○ Undefined (according to InChI, if close to 180°) ○ Same as (a) or (c) depending on which side of 180°
  • 7. Rings with ‘implicit’ 3D You drew You meant You may get
  • 8. Tetrahedral stereo gotchas ● R/S in IUPAC names, @/@@ in SMILES, 1/2 in MOL files, +/- in InChIs ● None of these directly correspond to another ○ SMILES and Mol files describe stereo in terms of atom order, but differ in where implicit hydrogens are located ○ InChI and IUPAC names both use a complex algorithm to determine the symbol ● Only two of these formats may always be used to compare two structures: ○ R/S and /m layer (InChI) ○ Also @/@@, but only if canonical
  • 9. Illuminating the black box ● Important to know what operations are being done implicitly and what needs to be done explicitly ○ Are the error rates acceptable? ● Parse structure ○ Read list of atoms and bonds (incl. charges and isotopes) ○ [Mol, Mol2, Smi] Apply valence model ● Perceive aromaticity (or preserve from input) ● Perceive stereochemistry (or preserve from input) ● Optional: recognize atom / bond types, partial charges, generate coordinates c1ccccc1C(=O)Cl
  • 10. Aromaticity ● Cheminformatics aromaticity not quite the same as chemical aromaticity ○ Mainly a convenience for handling the fact that the single/double bonds bonds in Kekulé systems may be set differently ● Usually a good idea to export structures in Kekulé form ○ More portable - tools may reject some SMILES in aromatic form if they cannot kekulize them ○ Allows tools to apply their own aromaticity model ○ Faster if detection of aromaticity can be avoided
  • 11. 2D or 3D? No Geometry No Geometry 2D Geometry 3D Geometry CN1C2=C(C(C3=CC=CC=C3)=NCC1=O)C=C(Cl)C=C2
  • 12. Going from 2D to 3D ● Key point - easy to get a 3D structure, but is it the 3D structure you want (or need)? ○ Do you need a single ‘reasonable’ structure or a large number of conformations? ● Many tools to generate an acceptable 3D structure from a 2D format ○ Usually a low energy conformation obtained via molecular mechanics ● Conformer generators ○ Important to think about appropriate energy and/or RMSD cutoffs
  • 13. Moving from files to a database ● If you’re going beyond 100’s of molecules consider using a chemically-aware database ○ Instant Jchem ○ MolEditor ● Not too difficult to roll your own using Open Source but requires programming skills ● Don’t use Excel (even with ChemDraw) ○ Missing data is not handled consistently ○ Can mangle identifiers (parse them as dates) ○ Complicates workflows ○ Formatting can hinder efficient data analyses ○ Difficult to have multiple users
  • 14. Verifying data quality ● This is all good if it’s your own compounds ● What about structures from someone else? ○ Need to check (& try to fix) nonsensical chemistry ● Check for ○ invalid valences, nonsense stereo, fragments ○ weird/invalid atoms, multiple radical centers ● Consider http://cvsp.chemspider.com/ Karapetyan et al, J. Cheminf, 2015
  • 15. Structures are good. Are they useful? ● At this point you likely have a set of correct (valid) structures ○ Are the structures useful for your purpose? ● A collection may have compounds with problematic structures ○ Reactive groups, fluorophores, ADMET liabilities, … ● Consider rules & filters such as REOS, PAINS, Lilly MedChem Rules ○ Implemented in commercial & OSS tools ○ Don’t use them blindly! ● Normalisation? ○ E.g. -N(=O)=O or –[N+][O-]=O (or doesn’t matter?)
  • 16. What are you really looking for? ● Similarity searches are a common task ● What you get depends on ○ How the structure was entered ○ Normalization of structures ● But also on what you’re looking for ○ Connectivity ○ Atom & bond type ○ Shape or pharmacophore features … ● May be surprised by false negatives ○ Test your query on structures it should find may not find
  • 17. Because we love statistics & M/L Alexander et al (2015) Cherkasov et al (2014) Huang & Fan (2013) Chirico & Grammatica (2011) Tropsha (2010) Jain & Nicholls (2008) Nicholls (2008) Hawkins (2004) Cronin & Schultz (2003) • Look at your data, plot your data • Read up statistics • Linear models are a good start • Most of this is not about cheminformatics • But the notion of chemical space plays a key role in this area
  • 18. Summary Do 1. Chose appropriate file formats 2. Check data quality 3. Get involved in the cheminformatics community 4. Trust but verify Don’t 1. Treat chemical software as a black box 2. Assume geometry 3. Use M/L blindly 4. Did we mention Excel already?
  • 19. Acknowledgements ● John May (NextMove Software) ● Adam Yasgar, Madhu Lal-Nag (NCATS)

Editor's Notes

  1. Docking software adjusts dihedral angles to generate conformations but leaves bond angles unchanged Molecular descriptor software may compute values assuming a ‘flat’ 3D structure.
  2. Applies to inventory maintenance, integrating data from multiple sources
  3. This is more oriented towards biologists than chemists