The document discusses the development of a Common Standard for eXchange (CSX) to standardize the reporting of computational chemistry calculation results. CSX uses XML to capture calculation metadata, details about the molecular system studied, software and methods used, and calculated results in a structured format. This will allow results from different software packages to be more easily compared and facilitate sharing and reuse of computational chemistry data. Future plans include expanding the CSX schema and engaging the computational chemistry community to adopt the standard.
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
A Standard Data Format for Computational Chemistry: CSX
1. A Standard Data Format for
Computational Chemistry: CSX
Stuart J. Chalk1,2, Neil Ostlund1, Mirek Sopek1, Bing Wang1
1) Chemical Semantics Inc., Gainesville FL
2) Department of Chemistry, University of North Florida
schalk@unf.edu
249th ACS Meeting, Denver, CO – March 2015
2. Semantic Annotation of Data
Current DOE Project
Data Transformations
Common Standard for eXchange (CSX)
CSX a Standard Data Format
The CSX Schema
CSX - Publishing Information
CSX - Molecular System Information
CSX - Calculated Result Information
Future Plans
Conclusion
Outline
3. Create a way to ‘teach’ computers what information
means – contextualize the data
Example
What is this? 904-620-1938
A computer just sees it as…
… a string
By using an appropriate semantic definition in RDF (the
Resource Description Framework) we can identify to the
computer that the text is a phone number (using the
Friend of a Friend (FOAF) specification), i.e.
Semantic Annotation of Data
RDF Specification http://www.w3.org/RDF/
FOAF Specification http://xmlns.com/foaf/spec/
<foaf:phone rdf:datatype=“#string">904-620-1938</foaf:phone>
4. RDF can be use to relate information as well as
annotate it
The following RDF/XML shows how some information is
related (XML is the eXtensible Markup Language)
Applying this technology to computational chemistry
calculations will allow integration of the calculation and
results with data about chemicals from other sources
Semantic Annotation of Data
<rdf:Description rdf:about=http://example.org/StuartChalk>
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
<foaf:knows rdf:resource="http://example.org/NeilOstlund"/>
<foaf:phone rdf:datatype=”…#string”>904-620-
1938</foaf:phone>
</rdf:Description>
5. Chemical Semantics is funded by DOE to create a web
portal to collect, organize and make searchable the
results output from computational chemistry (CC)
calculations
This will be freely available and will accept output
from all CC software packages
The intent is to capture calculation results and…
Software used to calculate the results
Input parameters used in the calculation
Methodology by which the calculation was done
Details of the molecular system studied
DOE SBIR Grant
6. The approach Chemical Semantics is taking is to
1. Add code to software packages to generate an XML file
alongside the normal output file –OR–
Parse an existing output file (using a free application) and
generate XML file
2. Send the XML file into the web portal
3. Convert the XML file into RDF into turtle format (TTL)
4. Finally, ingest TTL into a triplestore (Virtuoso)
All the data in Virtuoso can then be search using SPARQL
(SPARQL Protocol and RDF Query Language)
Data Transformations
Virtuoso http://virtuoso.openlinksw.com/
SPARQL http://www.w3.org/TR/sparql11-query/
7. Why XML?
Human readable (plain text - UTF-8)
Platform neutral
Archivable
Validatable
Why not use CML?
Inability to represent complex structures e.g. residues
No standard way to add CC results
Intermediate XML File
8. A CSX file is a text based file written in XML
It is a structured data container design to hold CC
result data and additional metadata
Version 0.x was developed by Neil Ostlund
Version 1.0 is the current stable release developed as
part of Phase 1 of the SBIR grant (limited scope)
Version 2.0 is currently under development as part of
Phase 2 of the SBIR grant
Common Standard for eXchange (CSX)
9. It is well know that the formats in which data is
reported in CC output files is:
Highly variable (software specific)
Sometimes difficult to interpret
Standardization would:
Allow data from different packages to be more easily
compared
Open up opportunities for software development to
display and reuse data for different applications
This mirrors movement in the CC community toward a
common driver base for CC software packages
CSX as a Standard Data Format
10. In order to describe the layout and allowed names of
elements and attributes, and values for both, a schema
document is available for the CSX specification
This can be used to help new users write valid CSX files
(using XML editing applications such as XML Spy and
oxygenXML) and…
… validate existing CSX files using any of a number of
XML validators (e.g. Xerces) …
… and understand the structure of the data especially
for less frequently calculated results
The CSX Schema
18. Work on CSX 2.0 is ongoing – expand to multiple systems
and sets of calculated results
Develop CSX focused website with converter
functionality, libraries, and documentation
Engage CC software users/programmers to get involved
with the project
Organize a community developer workshop over
summer 2015
Publish version 2.0 of CSX in Fall 2015
Future Plans
19. CSX started out as a stepping stone to transfer
information to the CS portal
Having a data standard for CC is an important
development in of itself
The CC community can do more with their data
Leverage XML tools to visualize, process etc…
Compare results across CC packages
Validate results
Reference basis sets (https://bse.pnl.gov/)
Conclusion