1. The PubChemQC Project
A big data construction by first-
principles calculations of molecules
中田真秀 (NAKATA Maho)
ACCC RIKEN
2016/2/17 15:50-16:40
Kobe workshop for material design on
strongly correlated electrons in molecules
and materials
http://www.aics.riken.jp/labs/cms/workshop/201602/index.html
2. Background
• Atoms and molecules are all composed of matter.
• A dream of theoretical chemist: do chemistry without
experiment!
• On computers
• Chemical space is really huge!
– The number of candidates for drugs
1060http://onlinelibrary.wiley.com/doi/10.1002/wcms.1104/a
bstract)
• Cf. Exa: 1018
– Combinatorics problem
– Adding chemical reaction 10120
3. Why 2-RDM theory has been
suspended?
• Is there short cut for solving Schrodinger Eq?
– Density functional theory, reduced density matrix theory
• Using 2-particle reduced density matrices, we can
reduce the number of variables drastically.
– Journal of Chemical Physics, 114, 8282-8292 (2001).
Introduction of semidifinite programming
– Computational and Theoretical Chemistry Volume 1003, 1
January 2013, Pages 22-7 Application to 2D Hubbard model
– Journal of chemical physics 128, 16 164113 (2008). Variouls
molecules
• However it is not size-consistent, nor size-extensive.
– Phys. Chem. Chem. Phys., 2009,11, 5558-5560
– AIP Advances 2, 032125 (2012)
– Physical Review A 80, 042109 (2009)
4. Fundamental question to solving SCE…
• Does this problem can be solved efficiently?
– Very likely NO!
– Example. spinglass Hamiltonian is very hard to
solve: this is as hard as solving Traveling
Salesperson Problem
– Algorithms without
assumption on 2-particle
interaction are never efficient.
5. Fundamental question to solving SCE…
Results from computational complexity theory
• N-representability problem is QMA-hard
– Liu, Y.-K., Christandl, M. & Verstraete, F. Quantum computational complexity of the n-representability problem: Qma complete. Phys. Rev.
Lett. 98, 110503 (2007).
• Solving 2-local Hamiltonian is also QMA-hard
– The Complexity of the Local Hamiltonian Problem
– SIAM J. Comput., 35(5), 1070–1097. http://epubs.siam.org/doi/abs/10.1137/S0097539704445226
• finding the ground-state energy of the Hubbard model
in an external magnetic field is still QMA-hard
– http://www.nature.com/nphys/journal/v5/n10/abs/nphys
1370.html
• Good review:Computational Complexity in Electronic
Structure
– http://arxiv.org/abs/1208.3334
6. Fundamental question to solving SCE…
• What I have learned
– No algorithm to solve general 2-particle Hamiltonian
efficiently.
– No algorithm to solve electronic Hamiltonian efficiently
(maybe)
– Introduction of other conditions on 2-particle interaction
are mandatory.
Heuristics is much more important than
thinking about subtle shortcut.
7. Current status of computational
chemistry
• Relatively good agreements with experiments.
• Can explain chemical phenomena
– Many good quantum chemistry programs are
available!
– “DFT B3LYP 6-31G*” calculation is the golden
standard!
• We want to lead chemistry
– We usually explain what happened.
– We rarely predict something very exciting!
8. Difference between experiment and
calculation/theory
• Finding interesting phenomena or problem
– How we convert from CO2 to O2? N2+H2 to NH3?
– How to synthesize a compound from known ingredients?
• Design a key chemical reaction.
• Calculations
or
• Experiments
• Analysis of results
• Propose new experiments
Only One Difference
9. Difference between experiment and
calculation/theory
• No difference as science
• Most important thing is chemical intuition!
• Can we implement chemical intuition on
computers?
– Yes, but apparently long way to go.
– Basic strategy is : collect data and fed to computer
and process.
10. Can we implement chemical intuition
on computers?
• Collect facts by computer calculations.
– Many good implementations are available.
– Huge computer resources are required but
– They are still growing exponentially
• Fed them to computers.
11. Can we implement chemical intuition
on computers?
• Fed them to computers.
• Machine Learning (ML)
– Very successful on
Image /sound recognition,
natural language processing.
Organic chemistry is somewhat similar to language…
Cadeddu, A., Wylie, E. K., Jurczak, J., Wampler-Doty, M. and Grzybowski, B. A. (2014), Organic Chemistry as a Language and the
Implications of Chemical Linguistics for Structural and Retrosynthetic Analyses. Angew. Chem. Int. Ed., 53: 8108–8112.
doi:10.1002/anie.201403708
Recently, some research papers by using ML have been published
Big Data meets Quantum Chemistry Approximations: The Δ-Machine Learning
Approach Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, O. Anatole von
Lilienfeld http://arxiv.org/abs/1503.04987 etc..
Better results by ML, we require huge dataset
12. Can we implement chemical intuition
on computers?
The first step might be:
• Build a huge dataset by quantum chemistry
program packages!
– Results should agree with experiments.
– Improvements on dataset is task of QC researchers.
• Faster calculations for larger systems
• Better or sufficient treatment for electron correlations
• And build a search engine database using the
result.
14. What are needed for Googling molecule?
1. Types, kinds, variety of molecules
– # of molecules are infinity; but cover important ones
2. Required properties of molecules
– Molecular structure, energy, UV excitation energy, dipole
moment
3. Getting properties of molecules by calculation?
– Accuracy of calculation, and computer resources…
4. Coding or Encoding molecule
– IUPAC nomenclature is not suitable
– Do not think about graph theory
5. Fast calculation (with deep learning(?))
10^8 molecules/sec, as chemical space is huge.
15. Databases for lists of molecules
• PubChem: 50,000,000 molecules listed, made by NIH,
public domain, no curating (imported from catalogs,
etc), can obtain via ftp.
• ChemSpider : 28,000,000 entries, better curating, no
ftp. Restricted for redistribution, download
• Web-GDB13 : 900,000,000 entries, just generated by
combinatorics. No
• Zinc, CheMBL, DrugBank …
• CAS : 70,000,000 molecules, proprietary
• Nikkaji: 6,000,000, proprietary
We use for source of molecules
18. Database for molecular properties by
experiments
• We must do some experiments for obtaining
molecular properties.
– No free comprehensive database is known so far.
– Pharmaceutical companies do O(1,000,000)
experiments for high throughput screening.
• Experiments cost huge!
– Time consuming, large facilities, costs, hazardous
We do not do experiments!
19. Database for molecular properties by computer
calculation
• Golden Standard method “Density functional
theory (B3LYP functional) + 6-31g(d) basis set”
– Accuracy is quite satisfactory (1-10kcal/mol) for
biological systems, organic chemistry.
– Good implementations are available.
– Costs less (fast, just super computer, no hazardous)
– Time for calculations becomes less
• Intel Core i7 (esp. SandyBridge) is very fast.
• Still we need huge resources, though.
We calculate by computer instead!
20. What is a molecule?
3D coordinates
Hard to understand
but regours
Easy to understand
But many coner cases
Propionaldehyde
No rigorous definition for a molecule
wavefunction
Common name
IUPAC
nomencleature
Structure
Wikipediaより
21. What is a molecule?
• No rigorous definition for “what is a molecule”
• nomenclature
– 3D coordinates for nucleus
– Structural formula
– IUPAC nomenclature
– Higher abstraction or less abstraction?
• Better molecular encoding method?
– Easy to understand for human
– Easy to understand for computer as well
– Can describe most cases, and less corner cases.
– Compromise between dream and reality
22. Encoding molecule : SMILES
Encoding molecule
SMILES is a good encoding method for molecules
IUPAC nomenclature
tert-butyl N-[(2S,3S,5S)-5-[[4-[(1-benzyltetrazol-5-yl)
methoxy]phenyl]methyl]-3-hydroxy-6-[[(1S,2R)-
2-hydroxy-2,3-dihydro-1H-inden-1-yl]amino]-
6-oxo-1-phenylhexan-2-yl]carbamate
We can encode molecule
• SMILES
CN(C)CCOC12CCC(C3C1CCCC3)C4=CC=CC=C24
• InChI Made by IUPAC
InChI=1S/C20H29NO/c1-21(2)13-14-22-20-12-11
-15(16-7-3-5-9-18(16)20)17-8-4-6-10-19(17)20/
h3,5,7,9,15,17,19H,4,6,8,10-14H2,1-2H3
…
23. What is SMILES?
• Simplified Molecular Input Line Entry System
– A linear representation of molecule using ASCII.
– Conformation is also encoded
– Human readable, and also machine readable.
– Almost one-to-one mapping between a molecule and
SMILES via universal SMILES
• David Weininger at USEPA Mid-Continent Ecology Division Laboratory invented SMILES
• InChI by IUPAC
– International Chemical Identifier : open standard (non proprietary)
– NM O’Boyle invented “Universal SMILES” via InChI
27. Construction of ab initio chemical
database
• Molecular information is from PubChem
• Properties are calculated from the first principle using
computer
– Many program packages are available
– DFT (B3LYP)
– 6-31G(d) basis set and geometry optimization
– Excited states calculation by TD-DFT 6-31G+(d)
– Best for organic molecules or bio molecules
• Molecular encoding : SMILES / InChI
• Huge computer resources
• Dream come true
– Google like search engine for chemistry
28. The PubChemQC Project
• http://pubchemqc.riken.jp/
• AIP Conf. Proc. 1702, 090058 (2015);
http://dx.doi.org/10.1063/1.4938866
• A public domain database for molecules
• Ab initio (The first principle) calculation of molecular
properties of PubChem
• 2014/1/15: 13,000 molecules
• 2014/7/29 : 155,792 molecules
• 2014/10/30 : 906,798 molecules
• 2014/12/3 : 1,137,286 molecules
• 2015/3/25 : 1,673,532 molecules
• 2015/5/27: 2,122,146 molecules
• 2016/2/10: 3,046,948 (2,660,218 with excited states)
32. Related works
• Related works
– Raghunathan Ramakrishnan, Pavlo Dral, Matthias Rupp, O.
Anatole von Lilienfeld: Quantum Chemistry Structures and
Properties of 134 kilo Molecules, Scientific Data, 1: 140022,
Nature Publishing Group, 2014.
– NIST Web Book
• http://webbook.nist.gov/chemistry/
• Small numbers of molecules. Comparing many methods
– Harvard Clean Energy Project
• http://cleanenergy.molecularspace.org/
• 25,000,000 (?), molecules for photo devices made by combinatrics
– Sugimoto et al :2013CBI symposium poster
• Almost same as our database, currently not open to the
public(now??)
Our contribution: 20 times larger
33. How we do?
• Generate initial 3D conformation by OpenBABEL
– SDF contains 3D conformation but we don’t use.
– OpenBABEL –h (add hydrogen) --gen3d (generation of 3d
coordinate)
• Ab initio calculation by GAMESS+firefly
– Using Gaussian can lead to a political problem(?)
– PM3 optimization
– Hartree-Fock/STO-6G geometry optimization
– Firefly+GAMESS geometry optimization in B3LYP/6-31G*
– Ten excitation energies by TDDFT/6-31G+* (no geom
optimization)
34. How we do?
• Heavily using OpenBABEL
• Extraction Molecular information
– Sort by molecular weight of PubChem compouds
– OpenBABEL
• Encoded by SMILES
– Isomeric smiles: 3D conformation retained
– OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@
@H](O)1
– CCC[C@@H](O)CCC=CC=CC#CC#CC=CCO
– CC(=O)OCCC(/C)=CC[C@H](C(C)=C)CCC=C
35. How to convert pubchem Compound
to quantum chemistry calculation
aflatoxin
O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5
Ab initio calculation by
OpenBABEL
36. Final results will be
• Uploaded to http://pubchemqc.riken.jp/
• Currently we upload
– input file (ground / excited state)
– Output file (ground / excited state)
– Final geometry in Mol file
37. Scaling of computation
• Embarrassingly parallel for each molecule
• Very roughly speaking, required time for
calculation scales like N^4
– N : molecular weight
• Problems are very hard (complexity theory)
– Hartree-Fock calculation
– DFT (b3lyp) calculation
– geometry optimization
• Practically many molecules can be solved
efficiently
38. Computer Resources
• RICC : Intel Xeon 5570 Westmere, 2.93GHz 8
cores/node) x 1000
– 1000-10000 molecules/day (MW 160)
– Heavily depend on conditions of other users
– Time limit: 8 hours
• Quest : Intel Core2 duo (1.6GHz/node) x 700
– 3000-8000 molecules / day (MW 160)
– 100-1000 molecules / day (MW 200-300)
– Time limit: 20 hours
• Some compounds fail to calculate are ignored for
this time.
40. Molecular weight and Lipinski Rule
• Lipinski’s five rule (Pfizer's rule of five): rule of
thumb for drug discovery
• No more than 5 hydrogen bond donors
• Not more than 10 hydrogen bond acceptors
• A molecular mass less than 500 daltons
• An octanol-water partition coefficient log P not greater than 5
• Molecular weight should be smaller than 500 is
very good for computational chemistry
– For routine calculations without experimental data
other than molecular formula
– If larger than 500, secondary or higher structure
becomes important. E.g., protein
41. Molecular Weight distribution at
PubChem
We are still here
Lipinski limit MW=500
30,000,000 molecules
(excluding mixtures)
42. How long it will take to finish?
• For drug design, we need to calculate all
molecules of MW < 500
• Total 30,000,000 molecules
– This number may increase in the future
• Current (2014/12/4) 1,100,000 molecules
– Only 3%
• 10,000 molecules/day -> 8.2years
43. How long it will take to finish?
• 10+ years? No, maybe far less.
• 25 years ago (1990) computers are so slow
– Even ab initio calculations are very difficult on
486DX@25MHz or
68000@10MHz
44. Outlook, prospect, hope…
• Far better in silico screening
– Less or no experiment is necessary
• Even more faster calculation using machine learning
– 10,000 molecules / second ?
– Requires huge data set to learn.
– bio or organic molecules are easy to calculate.
– Already available: Raghunathan Ramakrishnan
https://scholar.google.co.jp/citations?user=jSCGozoA
AAAJ&hl=ja&oi=sra
• Database for chemical reaction
– Precise calculation is required
– GRRM method + machine learning (?)
• Geometry optimization for Protein (PDB)
– Only X ray crystal structures are available
http://pubchemqc.riken.jp/
45. Difficulties in this project
• Parameters needed for calculations varies by
molecules
• Properties can be different by initial guess
• Computer Resources
– Raspberry Pi? NVIDIA Jetson? Bonic?
• Molecular encoding never ends
– SMILES or InChI is not complete
– Some corner cases may be chemically interesting.