Kobeworkshop pubchemqc project

The PubChemQC Project
A big data construction by first-
principles calculations of molecules
中田真秀 (NAKATA Maho)
ACCC RIKEN
2016/2/17 15:50-16:40
Kobe workshop for material design on
strongly correlated electrons in molecules
and materials
http://www.aics.riken.jp/labs/cms/workshop/201602/index.html

Background
• Atoms and molecules are all composed of matter.
• A dream of theoretical chemist: do chemistry without
experiment!
• On computers 
• Chemical space is really huge!
– The number of candidates for drugs
1060http://onlinelibrary.wiley.com/doi/10.1002/wcms.1104/a
bstract)
• Cf. Exa: 1018
– Combinatorics problem
– Adding chemical reaction 10120

Why 2-RDM theory has been
suspended?
• Is there short cut for solving Schrodinger Eq?
– Density functional theory, reduced density matrix theory
• Using 2-particle reduced density matrices, we can
reduce the number of variables drastically.
– Journal of Chemical Physics, 114, 8282-8292 (2001).
Introduction of semidifinite programming
– Computational and Theoretical Chemistry Volume 1003, 1
January 2013, Pages 22-7 Application to 2D Hubbard model
– Journal of chemical physics 128, 16 164113 (2008). Variouls
molecules
• However it is not size-consistent, nor size-extensive.
– Phys. Chem. Chem. Phys., 2009,11, 5558-5560
– AIP Advances 2, 032125 (2012)
– Physical Review A 80, 042109 (2009)

Fundamental question to solving SCE…
• Does this problem can be solved efficiently?
– Very likely NO!
– Example. spinglass Hamiltonian is very hard to
solve: this is as hard as solving Traveling
Salesperson Problem
– Algorithms without
assumption on 2-particle
interaction are never efficient.

Results from computational complexity theory
• N-representability problem is QMA-hard
– Liu, Y.-K., Christandl, M. & Verstraete, F. Quantum computational complexity of the n-representability problem: Qma complete. Phys. Rev.
Lett. 98, 110503 (2007).
• Solving 2-local Hamiltonian is also QMA-hard
– The Complexity of the Local Hamiltonian Problem
– SIAM J. Comput., 35(5), 1070–1097. http://epubs.siam.org/doi/abs/10.1137/S0097539704445226
• finding the ground-state energy of the Hubbard model
in an external magnetic field is still QMA-hard
– http://www.nature.com/nphys/journal/v5/n10/abs/nphys
1370.html
• Good review:Computational Complexity in Electronic
Structure
– http://arxiv.org/abs/1208.3334

• What I have learned
– No algorithm to solve general 2-particle Hamiltonian
efficiently.
– No algorithm to solve electronic Hamiltonian efficiently
(maybe)
– Introduction of other conditions on 2-particle interaction
are mandatory.
Heuristics is much more important than
thinking about subtle shortcut.

Current status of computational
chemistry
• Relatively good agreements with experiments.
• Can explain chemical phenomena
– Many good quantum chemistry programs are
available!
– “DFT B3LYP 6-31G*” calculation is the golden
standard!
• We want to lead chemistry
– We usually explain what happened.
– We rarely predict something very exciting!

Difference between experiment and
calculation/theory
• Finding interesting phenomena or problem
– How we convert from CO2 to O2? N2+H2 to NH3?
– How to synthesize a compound from known ingredients?
• Design a key chemical reaction.
• Calculations
or
• Experiments
• Analysis of results
• Propose new experiments
Only One Difference

Difference between experiment and
calculation/theory
• No difference as science
• Most important thing is chemical intuition!
• Can we implement chemical intuition on
computers?
– Yes, but apparently long way to go.
– Basic strategy is : collect data and fed to computer
and process.

Can we implement chemical intuition
on computers?
• Collect facts by computer calculations.
– Many good implementations are available.
– Huge computer resources are required but
– They are still growing exponentially
• Fed them to computers.

on computers?
• Fed them to computers.
• Machine Learning (ML)
– Very successful on
Image /sound recognition,
natural language processing.
Organic chemistry is somewhat similar to language…
Cadeddu, A., Wylie, E. K., Jurczak, J., Wampler-Doty, M. and Grzybowski, B. A. (2014), Organic Chemistry as a Language and the
Implications of Chemical Linguistics for Structural and Retrosynthetic Analyses. Angew. Chem. Int. Ed., 53: 8108–8112.
doi:10.1002/anie.201403708
Recently, some research papers by using ML have been published
Big Data meets Quantum Chemistry Approximations: The Δ-Machine Learning
Approach Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, O. Anatole von
Lilienfeld http://arxiv.org/abs/1503.04987 etc..
Better results by ML, we require huge dataset

on computers?
The first step might be:
• Build a huge dataset by quantum chemistry
program packages!
– Results should agree with experiments.
– Improvements on dataset is task of QC researchers.
• Faster calculations for larger systems
• Better or sufficient treatment for electron correlations
• And build a search engine database using the
result.

Googling molecule
Gives you what you need
＋

What are needed for Googling molecule?
1. Types, kinds, variety of molecules
– # of molecules are infinity; but cover important ones
2. Required properties of molecules
– Molecular structure, energy, UV excitation energy, dipole
moment
3. Getting properties of molecules by calculation?
– Accuracy of calculation, and computer resources…
4. Coding or Encoding molecule
– IUPAC nomenclature is not suitable
– Do not think about graph theory
5. Fast calculation (with deep learning(?))
10^8 molecules/sec, as chemical space is huge.

Databases for lists of molecules
• PubChem: 50,000,000 molecules listed, made by NIH,
public domain, no curating (imported from catalogs,
etc), can obtain via ftp.
• ChemSpider : 28,000,000 entries, better curating, no
ftp. Restricted for redistribution, download
• Web-GDB13 : 900,000,000 entries, just generated by
combinatorics. No
• Zinc, CheMBL, DrugBank …
• CAS : 70,000,000 molecules, proprietary
• Nikkaji: 6,000,000, proprietary
We use for source of molecules

Ex. A molecule listed in PubChem

Database for molecular properties by
experiments
• We must do some experiments for obtaining
molecular properties.
– No free comprehensive database is known so far.
– Pharmaceutical companies do O(1,000,000)
experiments for high throughput screening.
• Experiments cost huge!
– Time consuming, large facilities, costs, hazardous
We do not do experiments!

Database for molecular properties by computer
calculation
• Golden Standard method “Density functional
theory (B3LYP functional) + 6-31g(d) basis set”
– Accuracy is quite satisfactory (1-10kcal/mol) for
biological systems, organic chemistry.
– Good implementations are available.
– Costs less (fast, just super computer, no hazardous)
– Time for calculations becomes less
• Intel Core i7 (esp. SandyBridge) is very fast.
• Still we need huge resources, though.
We calculate by computer instead!

What is a molecule?
3D coordinates
Hard to understand
but regours
Easy to understand
But many coner cases
Propionaldehyde
No rigorous definition for a molecule
wavefunction
Common name
IUPAC
nomencleature
Structure
Wikipediaより

What is a molecule?
• No rigorous definition for “what is a molecule”
• nomenclature
– 3D coordinates for nucleus
– Structural formula
– IUPAC nomenclature
– Higher abstraction or less abstraction?
• Better molecular encoding method?
– Easy to understand for human
– Easy to understand for computer as well
– Can describe most cases, and less corner cases.
– Compromise between dream and reality

Encoding molecule : SMILES
Encoding molecule
SMILES is a good encoding method for molecules
IUPAC nomenclature
tert-butyl N-[(2S,3S,5S)-5-[[4-[(1-benzyltetrazol-5-yl)
methoxy]phenyl]methyl]-3-hydroxy-6-[[(1S,2R)-
2-hydroxy-2,3-dihydro-1H-inden-1-yl]amino]-
6-oxo-1-phenylhexan-2-yl]carbamate
We can encode molecule
• SMILES
CN(C)CCOC12CCC(C3C1CCCC3)C4=CC=CC=C24
• InChI Made by IUPAC
InChI=1S/C20H29NO/c1-21(2)13-14-22-20-12-11
-15(16-7-3-5-9-18(16)20)17-8-4-6-10-19(17)20/
h3,5,7,9,15,17,19H,4,6,8,10-14H2,1-2H3
…

What is SMILES?
• Simplified Molecular Input Line Entry System
– A linear representation of molecule using ASCII.
– Conformation is also encoded
– Human readable, and also machine readable.
– Almost one-to-one mapping between a molecule and
SMILES via universal SMILES
• David Weininger at USEPA Mid-Continent Ecology Division Laboratory invented SMILES
• InChI by IUPAC
– International Chemical Identifier : open standard (non proprietary)
– NM O’Boyle invented “Universal SMILES” via InChI

Example by SMILES
http://en.wikipedia.org/wiki/SMILES
分子構造 SMILES
Nitrogen molecule N≡N N#N
copper sulfate Cu2+ SO42- [Cu+2].[O-]S(=O)(=O)[O-]
oenanthotoxin CCC[C@@H](O)CCC=CC=C
C#CC#CC=CCO
Vitamin B1 OCCc1c(C)[n+](=cs1)Cc2cnc(C
)nc(N)2
Aflatoxin B1 O1C=C[C@H]([C@H]1O2)c3c
2cc(OC)c4c3OC(=O)C5=C4CC
C(=O)5

Some corner cases
Two different SMILES for Ferrocene
• C12C3C4C5C1[Fe]23451234C5C1C2C3C45
• [CH-]1C=CC=C1.[CH-]1C=CC=C1.[Fe+2]

Construction of ab initio chemical
database
• Molecular information is from PubChem
• Properties are calculated from the first principle using
computer
– Many program packages are available
– DFT (B3LYP)
– 6-31G(d) basis set and geometry optimization
– Excited states calculation by TD-DFT 6-31G+(d)
– Best for organic molecules or bio molecules
• Molecular encoding : SMILES / InChI
• Huge computer resources
• Dream come true
– Google like search engine for chemistry

The PubChemQC Project
• http://pubchemqc.riken.jp/
• AIP Conf. Proc. 1702, 090058 (2015);
http://dx.doi.org/10.1063/1.4938866
• A public domain database for molecules
• Ab initio (The first principle) calculation of molecular
properties of PubChem
• 2014/1/15: 13,000 molecules
• 2014/7/29 : 155,792 molecules
• 2014/10/30 : 906,798 molecules
• 2014/12/3 : 1,137,286 molecules
• 2015/3/25 : 1,673,532 molecules
• 2015/5/27: 2,122,146 molecules
• 2016/2/10: 3,046,948 (2,660,218 with excited states)

The PubChemQC project
http://pubchemqc.riken.jp/
WIP: no search engine, just data

PubChemQC

Related works
• Related works
– Raghunathan Ramakrishnan, Pavlo Dral, Matthias Rupp, O.
Anatole von Lilienfeld: Quantum Chemistry Structures and
Properties of 134 kilo Molecules, Scientific Data, 1: 140022,
Nature Publishing Group, 2014.
– NIST Web Book
• http://webbook.nist.gov/chemistry/
• Small numbers of molecules. Comparing many methods
– Harvard Clean Energy Project
• http://cleanenergy.molecularspace.org/
• 25,000,000 (?), molecules for photo devices made by combinatrics
– Sugimoto et al :2013CBI symposium poster
• Almost same as our database, currently not open to the
public(now??)
Our contribution: 20 times larger

How we do?
• Generate initial 3D conformation by OpenBABEL
– SDF contains 3D conformation but we don’t use.
– OpenBABEL –h (add hydrogen) --gen3d (generation of 3d
coordinate)
• Ab initio calculation by GAMESS+firefly
– Using Gaussian can lead to a political problem(?)
– PM3 optimization
– Hartree-Fock/STO-6G geometry optimization
– Firefly+GAMESS geometry optimization in B3LYP/6-31G*
– Ten excitation energies by TDDFT/6-31G+* (no geom
optimization)

How we do?
• Heavily using OpenBABEL
• Extraction Molecular information
– Sort by molecular weight of PubChem compouds
– OpenBABEL
• Encoded by SMILES
– Isomeric smiles: 3D conformation retained
– OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@
@H](O)1
– CCC[C@@H](O)CCC=CC=CC#CC#CC=CCO
– CC(=O)OCCC(/C)=CC[C@H](C(C)=C)CCC=C

How to convert pubchem Compound
to quantum chemistry calculation
aflatoxin
O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5
Ab initio calculation by
OpenBABEL

Final results will be
• Uploaded to http://pubchemqc.riken.jp/
• Currently we upload
– input file (ground / excited state)
– Output file (ground / excited state)
– Final geometry in Mol file

Scaling of computation
• Embarrassingly parallel for each molecule
• Very roughly speaking, required time for
calculation scales like N^4
– N : molecular weight
• Problems are very hard (complexity theory)
– Hartree-Fock calculation
– DFT (b3lyp) calculation
– geometry optimization
• Practically many molecules can be solved
efficiently

Computer Resources
• RICC : Intel Xeon 5570 Westmere, 2.93GHz 8
cores/node) x 1000
– 1000-10000 molecules/day (MW 160)
– Heavily depend on conditions of other users
– Time limit: 8 hours
• Quest : Intel Core2 duo (1.6GHz/node) x 700
– 3000-8000 molecules / day (MW 160)
– 100-1000 molecules / day (MW 200-300)
– Time limit: 20 hours
• Some compounds fail to calculate are ignored for
this time.

Computer Resources
• Storage
– Approx. 500GB for 1,000,000 molecules (xz
compressed)
– Approx. 20 TB for 40,000,000 molecules (xz
compressed)

Molecular weight and Lipinski Rule
• Lipinski’s five rule (Pfizer's rule of five): rule of
thumb for drug discovery
• No more than 5 hydrogen bond donors
• Not more than 10 hydrogen bond acceptors
• A molecular mass less than 500 daltons
• An octanol-water partition coefficient log P not greater than 5
• Molecular weight should be smaller than 500 is
very good for computational chemistry
– For routine calculations without experimental data
other than molecular formula
– If larger than 500, secondary or higher structure
becomes important. E.g., protein

Molecular Weight distribution at
PubChem
We are still here
Lipinski limit MW=500
30,000,000 molecules
(excluding mixtures)

How long it will take to finish?
• For drug design, we need to calculate all
molecules of MW < 500
• Total 30,000,000 molecules
– This number may increase in the future
• Current (2014/12/4) 1,100,000 molecules
– Only 3%
• 10,000 molecules/day -> 8.2years

How long it will take to finish?
• 10+ years? No, maybe far less.
• 25 years ago (1990) computers are so slow
– Even ab initio calculations are very difficult on
486DX@25MHz or
68000@10MHz

Outlook, prospect, hope…
• Far better in silico screening
– Less or no experiment is necessary
• Even more faster calculation using machine learning
– 10,000 molecules / second ?
– Requires huge data set to learn.
– bio or organic molecules are easy to calculate.
– Already available: Raghunathan Ramakrishnan
https://scholar.google.co.jp/citations?user=jSCGozoA
AAAJ&hl=ja&oi=sra
• Database for chemical reaction
– Precise calculation is required
– GRRM method + machine learning (?)
• Geometry optimization for Protein (PDB)
– Only X ray crystal structures are available

Difficulties in this project
• Parameters needed for calculations varies by
molecules
• Properties can be different by initial guess
• Computer Resources
– Raspberry Pi? NVIDIA Jetson? Bonic?
• Molecular encoding never ends
– SMILES or InChI is not complete
– Some corner cases may be chemically interesting.

Kobeworkshop pubchemqc project

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Kobeworkshop pubchemqc project

Ähnlich wie Kobeworkshop pubchemqc project (20)

Mehr von Maho Nakata

Mehr von Maho Nakata (18)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Kobeworkshop pubchemqc project