SlideShare a Scribd company logo
1 of 21
Download to read offline
Exploring Large Chemical
        Data Sets
 Interactive Analysis and Visualization



          Kyle Lutz and Marcus D. Hanwell

                 August 21, 2012
                Skolnik Symposium
Overview
● An open-source, cross-platform
  cheminformatics tool
● A general-purpose tool for chemical data
  exploration and analysis
● Interactive, editable and queryable
  database of chemical data on the desktop
● Part of the Open Chemistry application
  suite (Avogadro and MoleQueue)
● Leverages several open-source projects:
  Qt, VTK, Chemkit, Open Babel, MongoDB
Architecture
● Native, cross-platform C++ application built with Qt
● Stores chemical data in a NoSQL MongoDB database
● Uses VTK for 2D and 3D data set visualization
Main Window
Molecule Details
Queries

Supports different
queries:
● Name
● Formula
● InChI
● InChIKey
● Structure and
   Substructure
Similarity Searching
Charts and Plots




            Scatter Plot          Histogram of logP
   of Polar Surface Area (TPSA)
      against Volume (VABC)
Multidimensional Analysis
● Provide tools for viewing and analyzing large
  amounts of data with multiple dimensions
   ○ Scatter Plot Matrix
   ○ Parallel Coordinates
   ○ K-Means Clustering
● Interactive charts supporting selection
● Easy to add new chemical descriptors
Scatter Plot Matrix




      Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
Parallel Coordinates




     Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
K-Means Clustering
● ~30 numeric molecular descriptors
● 1D, 2D, and 3D visualization
● Selection and extraction of molecules from clusters
Similarity Visualization
● Similarity Clustering
● Calculated from fingerprint similarity or structural
  similarity
Similarity Visualization




                           60%
      30%




      45%
ChemicalJSON
                                                           Example: ethane.cjson

●   JSON (JavaScript Object Notation) is
    a "lightweight data-interchange
    format"
●   Store molecular structure, geometry,
    identifiers and descriptors all as a
    single JSON object
●   Benefits:
    ○ More compact than XML/CML
    ○ Native language of MongoDB and
      JSON-RPC
    ○ Easily converted to a binary
      representation (BSON)




                  Specification avaialble at: http://wiki.openchemistry.org/Chemical_JSON
ChemicalJSON in MongoDB
● Nearly identical to what is stored in a file
   ○ A few extra fields stored
     ■ 2D diagram (as PNG)
     ■ Heavy atom count (for substructure searching)
     ■ Binary fingerprints (for similarity searching)
     ■ InChIKey for indexing and as a unique key
     ■ Mongo's OID ("_id") field
● Trivial to write out to a .cjson file:
     db.molecules.find({"name" : "ethanol"},
                       {"diagram" : 0,
                        "heavyAtomCount" : 0,
                        "fp2_fingerprint" : 0,
                        "_id" : 0})
Open Chemistry with ParaViewWeb
● Uses ParaView's client-server architecture
● Interactive 3D rendering
● Runs in any modern web browser




        URL: http://paraviewweb.kitware.com/OpenChemistry/
Open Chemistry with ParaViewWeb
    ChemData
RPC / Avogadro Integration
● Uses JSON-RPC to communicate with other
  applications (most notably Avogadro)
● Visualize data directly from the database
● Uses ChemicalJSON to represent molecular
  structures and transfer molecular information
Future Directions
● Direct integration with 3rd party databases
  (PubChem, PDB, ...)
● Broader support for storing and analyzing
  computational job results
   ○ Linked with molecular structures
   ○ Direct from CML or converted/parsed
● Plugins to facilitate extension
   ○ Descriptors
   ○ Visualization
   ○ Chemical file input/output
● Scaling studies, working with multiple data
  servers and terabytes of data
Comments/Questions?
                  Home Page
   http://wiki.openchemistry.org/ChemData

                  Source Code
 https://github.com/OpenChemistry/chemdata

              ParaViewWeb Demo
http://paraviewweb.kitware.com/OpenChemistry

More Related Content

What's hot

Analytical data processing
Analytical data processingAnalytical data processing
Analytical data processingPolad Saruxanov
 
Elasticsearch: Getting Started Part 1
Elasticsearch: Getting Started Part 1Elasticsearch: Getting Started Part 1
Elasticsearch: Getting Started Part 1Suyog Kale
 
Elasticsearch: Getting Started Part 3 Aggregations
Elasticsearch: Getting Started Part 3 AggregationsElasticsearch: Getting Started Part 3 Aggregations
Elasticsearch: Getting Started Part 3 AggregationsSuyog Kale
 
Service Composition for Mobile Ad Hoc Networks using Distributed Matching
Service Composition for Mobile Ad Hoc Networks using Distributed MatchingService Composition for Mobile Ad Hoc Networks using Distributed Matching
Service Composition for Mobile Ad Hoc Networks using Distributed MatchingUnai Aguilera
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Hajira Jabeen
 
Wikidata as a linking hub for knowledge organization systems? Integrating an ...
Wikidata as a linking hub for knowledge organization systems? Integrating an ...Wikidata as a linking hub for knowledge organization systems? Integrating an ...
Wikidata as a linking hub for knowledge organization systems? Integrating an ...Joachim Neubert
 
CHAOS Platform presentation, The Royal Library in Copenhagen.
CHAOS Platform presentation, The Royal Library in Copenhagen.CHAOS Platform presentation, The Royal Library in Copenhagen.
CHAOS Platform presentation, The Royal Library in Copenhagen.Peter Overgaard
 
Big data uservices
Big data uservicesBig data uservices
Big data uservicesFelix Crisan
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDBArangoDB Database
 
MongoDB NoSQL - Developer Guide
MongoDB NoSQL - Developer GuideMongoDB NoSQL - Developer Guide
MongoDB NoSQL - Developer GuideShiv K Sah
 
Academy PRO: D3, part 1
Academy PRO: D3, part 1Academy PRO: D3, part 1
Academy PRO: D3, part 1Binary Studio
 
Integration and Exploration of Financial Data using Semantics and Ontologies
Integration and Exploration of Financial Data using Semantics and OntologiesIntegration and Exploration of Financial Data using Semantics and Ontologies
Integration and Exploration of Financial Data using Semantics and OntologiesRoberto García
 
Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Projectmbruemmer
 
Regal - a Repository for Electronic Documents and Bibliographic Data
Regal - a Repository for Electronic Documents and Bibliographic DataRegal - a Repository for Electronic Documents and Bibliographic Data
Regal - a Repository for Electronic Documents and Bibliographic DataFelix Ostrowski
 
[scala.by] Launching new application fast
[scala.by] Launching new application fast[scala.by] Launching new application fast
[scala.by] Launching new application fastDenis Karpenko
 
Greedy Enough for the Grid?
Greedy Enough for the Grid?Greedy Enough for the Grid?
Greedy Enough for the Grid?Matteo Romanello
 
Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine
 

What's hot (19)

Analytical data processing
Analytical data processingAnalytical data processing
Analytical data processing
 
Elasticsearch: Getting Started Part 1
Elasticsearch: Getting Started Part 1Elasticsearch: Getting Started Part 1
Elasticsearch: Getting Started Part 1
 
Elasticsearch: Getting Started Part 3 Aggregations
Elasticsearch: Getting Started Part 3 AggregationsElasticsearch: Getting Started Part 3 Aggregations
Elasticsearch: Getting Started Part 3 Aggregations
 
Service Composition for Mobile Ad Hoc Networks using Distributed Matching
Service Composition for Mobile Ad Hoc Networks using Distributed MatchingService Composition for Mobile Ad Hoc Networks using Distributed Matching
Service Composition for Mobile Ad Hoc Networks using Distributed Matching
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
 
Wikidata as a linking hub for knowledge organization systems? Integrating an ...
Wikidata as a linking hub for knowledge organization systems? Integrating an ...Wikidata as a linking hub for knowledge organization systems? Integrating an ...
Wikidata as a linking hub for knowledge organization systems? Integrating an ...
 
CHAOS Platform presentation, The Royal Library in Copenhagen.
CHAOS Platform presentation, The Royal Library in Copenhagen.CHAOS Platform presentation, The Royal Library in Copenhagen.
CHAOS Platform presentation, The Royal Library in Copenhagen.
 
Big data uservices
Big data uservicesBig data uservices
Big data uservices
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDB
 
MongoDB NoSQL - Developer Guide
MongoDB NoSQL - Developer GuideMongoDB NoSQL - Developer Guide
MongoDB NoSQL - Developer Guide
 
Academy PRO: D3, part 1
Academy PRO: D3, part 1Academy PRO: D3, part 1
Academy PRO: D3, part 1
 
Integration and Exploration of Financial Data using Semantics and Ontologies
Integration and Exploration of Financial Data using Semantics and OntologiesIntegration and Exploration of Financial Data using Semantics and Ontologies
Integration and Exploration of Financial Data using Semantics and Ontologies
 
Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Project
 
Regal - a Repository for Electronic Documents and Bibliographic Data
Regal - a Repository for Electronic Documents and Bibliographic DataRegal - a Repository for Electronic Documents and Bibliographic Data
Regal - a Repository for Electronic Documents and Bibliographic Data
 
[scala.by] Launching new application fast
[scala.by] Launching new application fast[scala.by] Launching new application fast
[scala.by] Launching new application fast
 
MongoDB
MongoDBMongoDB
MongoDB
 
Greedy Enough for the Grid?
Greedy Enough for the Grid?Greedy Enough for the Grid?
Greedy Enough for the Grid?
 
Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4j
 
Legislation.gov.uk
Legislation.gov.ukLegislation.gov.uk
Legislation.gov.uk
 

Similar to Exploring Large Chemical Data Sets

Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryMarcus Hanwell
 
Data Integration Solutions Created By Koneksys
Data Integration Solutions Created By KoneksysData Integration Solutions Created By Koneksys
Data Integration Solutions Created By KoneksysKoneksys
 
The Open Chemistry Project
The Open Chemistry ProjectThe Open Chemistry Project
The Open Chemistry ProjectMarcus Hanwell
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...BigData_Europe
 
Introduction to the BioLink datamodel
Introduction to the BioLink datamodelIntroduction to the BioLink datamodel
Introduction to the BioLink datamodelChris Mungall
 
A Study of the Similarities of Entity Embeddings Learned from Different Aspec...
A Study of the Similarities of Entity Embeddings Learned from Different Aspec...A Study of the Similarities of Entity Embeddings Learned from Different Aspec...
A Study of the Similarities of Entity Embeddings Learned from Different Aspec...GUANGYUAN PIAO
 
BedCon 2013 - Java Persistenz-Frameworks für MongoDB
BedCon 2013 - Java Persistenz-Frameworks für MongoDBBedCon 2013 - Java Persistenz-Frameworks für MongoDB
BedCon 2013 - Java Persistenz-Frameworks für MongoDBTobias Trelle
 
MongoDB and Web Scrapping with the Gyes Platform
MongoDB and Web Scrapping with the Gyes PlatformMongoDB and Web Scrapping with the Gyes Platform
MongoDB and Web Scrapping with the Gyes PlatformMongoDB
 
EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...
EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...
EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...ChemAxon
 
Avogadro 2 and Open Chemistry
Avogadro 2 and Open ChemistryAvogadro 2 and Open Chemistry
Avogadro 2 and Open ChemistryMarcus Hanwell
 
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftCassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftJon Haddad
 
Towards constrained semantic web
Towards constrained semantic webTowards constrained semantic web
Towards constrained semantic web☕ Remy Rojas
 
An Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs AnalysisAn Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs AnalysisJosé Manuel Ciges Regueiro
 
MADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxMADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxChris Mungall
 
What are the major components of MongoDB and the major tools used in it.docx
What are the major components of MongoDB and the major tools used in it.docxWhat are the major components of MongoDB and the major tools used in it.docx
What are the major components of MongoDB and the major tools used in it.docxTechnogeeks
 
Big Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open WorkshopBig Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open WorkshopExtremeEarth
 

Similar to Exploring Large Chemical Data Sets (20)

Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
 
Data Integration Solutions Created By Koneksys
Data Integration Solutions Created By KoneksysData Integration Solutions Created By Koneksys
Data Integration Solutions Created By Koneksys
 
The Open Chemistry Project
The Open Chemistry ProjectThe Open Chemistry Project
The Open Chemistry Project
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
 
Introduction to the BioLink datamodel
Introduction to the BioLink datamodelIntroduction to the BioLink datamodel
Introduction to the BioLink datamodel
 
A Study of the Similarities of Entity Embeddings Learned from Different Aspec...
A Study of the Similarities of Entity Embeddings Learned from Different Aspec...A Study of the Similarities of Entity Embeddings Learned from Different Aspec...
A Study of the Similarities of Entity Embeddings Learned from Different Aspec...
 
BedCon 2013 - Java Persistenz-Frameworks für MongoDB
BedCon 2013 - Java Persistenz-Frameworks für MongoDBBedCon 2013 - Java Persistenz-Frameworks für MongoDB
BedCon 2013 - Java Persistenz-Frameworks für MongoDB
 
MongoDB and Web Scrapping with the Gyes Platform
MongoDB and Web Scrapping with the Gyes PlatformMongoDB and Web Scrapping with the Gyes Platform
MongoDB and Web Scrapping with the Gyes Platform
 
Mongo db basics
Mongo db basicsMongo db basics
Mongo db basics
 
EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...
EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...
EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...
 
MongoDB Basics Unileon
MongoDB Basics UnileonMongoDB Basics Unileon
MongoDB Basics Unileon
 
Avogadro 2 and Open Chemistry
Avogadro 2 and Open ChemistryAvogadro 2 and Open Chemistry
Avogadro 2 and Open Chemistry
 
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftCassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica Coloft
 
Mongo db
Mongo dbMongo db
Mongo db
 
Towards constrained semantic web
Towards constrained semantic webTowards constrained semantic web
Towards constrained semantic web
 
3DRepo
3DRepo3DRepo
3DRepo
 
An Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs AnalysisAn Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs Analysis
 
MADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxMADICES Mungall 2022.pptx
MADICES Mungall 2022.pptx
 
What are the major components of MongoDB and the major tools used in it.docx
What are the major components of MongoDB and the major tools used in it.docxWhat are the major components of MongoDB and the major tools used in it.docx
What are the major components of MongoDB and the major tools used in it.docx
 
Big Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open WorkshopBig Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open Workshop
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Exploring Large Chemical Data Sets

  • 1. Exploring Large Chemical Data Sets Interactive Analysis and Visualization Kyle Lutz and Marcus D. Hanwell August 21, 2012 Skolnik Symposium
  • 2. Overview ● An open-source, cross-platform cheminformatics tool ● A general-purpose tool for chemical data exploration and analysis ● Interactive, editable and queryable database of chemical data on the desktop ● Part of the Open Chemistry application suite (Avogadro and MoleQueue) ● Leverages several open-source projects: Qt, VTK, Chemkit, Open Babel, MongoDB
  • 3. Architecture ● Native, cross-platform C++ application built with Qt ● Stores chemical data in a NoSQL MongoDB database ● Uses VTK for 2D and 3D data set visualization
  • 6. Queries Supports different queries: ● Name ● Formula ● InChI ● InChIKey ● Structure and Substructure
  • 8. Charts and Plots Scatter Plot Histogram of logP of Polar Surface Area (TPSA) against Volume (VABC)
  • 9. Multidimensional Analysis ● Provide tools for viewing and analyzing large amounts of data with multiple dimensions ○ Scatter Plot Matrix ○ Parallel Coordinates ○ K-Means Clustering ● Interactive charts supporting selection ● Easy to add new chemical descriptors
  • 10. Scatter Plot Matrix Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
  • 11. Parallel Coordinates Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
  • 12. K-Means Clustering ● ~30 numeric molecular descriptors ● 1D, 2D, and 3D visualization ● Selection and extraction of molecules from clusters
  • 13. Similarity Visualization ● Similarity Clustering ● Calculated from fingerprint similarity or structural similarity
  • 15. ChemicalJSON Example: ethane.cjson ● JSON (JavaScript Object Notation) is a "lightweight data-interchange format" ● Store molecular structure, geometry, identifiers and descriptors all as a single JSON object ● Benefits: ○ More compact than XML/CML ○ Native language of MongoDB and JSON-RPC ○ Easily converted to a binary representation (BSON) Specification avaialble at: http://wiki.openchemistry.org/Chemical_JSON
  • 16. ChemicalJSON in MongoDB ● Nearly identical to what is stored in a file ○ A few extra fields stored ■ 2D diagram (as PNG) ■ Heavy atom count (for substructure searching) ■ Binary fingerprints (for similarity searching) ■ InChIKey for indexing and as a unique key ■ Mongo's OID ("_id") field ● Trivial to write out to a .cjson file: db.molecules.find({"name" : "ethanol"}, {"diagram" : 0, "heavyAtomCount" : 0, "fp2_fingerprint" : 0, "_id" : 0})
  • 17. Open Chemistry with ParaViewWeb ● Uses ParaView's client-server architecture ● Interactive 3D rendering ● Runs in any modern web browser URL: http://paraviewweb.kitware.com/OpenChemistry/
  • 18. Open Chemistry with ParaViewWeb ChemData
  • 19. RPC / Avogadro Integration ● Uses JSON-RPC to communicate with other applications (most notably Avogadro) ● Visualize data directly from the database ● Uses ChemicalJSON to represent molecular structures and transfer molecular information
  • 20. Future Directions ● Direct integration with 3rd party databases (PubChem, PDB, ...) ● Broader support for storing and analyzing computational job results ○ Linked with molecular structures ○ Direct from CML or converted/parsed ● Plugins to facilitate extension ○ Descriptors ○ Visualization ○ Chemical file input/output ● Scaling studies, working with multiple data servers and terabytes of data
  • 21. Comments/Questions? Home Page http://wiki.openchemistry.org/ChemData Source Code https://github.com/OpenChemistry/chemdata ParaViewWeb Demo http://paraviewweb.kitware.com/OpenChemistry