This document discusses challenges and opportunities in applying machine learning to materials science. It summarizes that simply applying off-the-shelf machine learning tools is insufficient for materials discovery due to unique challenges like scarce labeled data. Successful approaches require domain expertise to customize models using techniques like transfer learning that leverage relationships between different material properties. Sequential experiment design informed by machine learning can accelerate discovery by designing the next most informative experiments. While perfect prediction of new materials may not be possible, data-driven modeling can still deliver faster discovery.
2. Max Hutchinson,Max Hutchinson,
Scientific Software Eng.
ONE DOES NOT SIMPLY...ONE DOES NOT SIMPLY...
APPLY OFF THE SHELF MLAPPLY OFF THE SHELF ML
TOOLS TO MATERIALSTOOLS TO MATERIALS
DISCOVERYDISCOVERY
ARTIFICIAL INTELLIGENCE FOR MAT. SCI.ARTIFICIAL INTELLIGENCE FOR MAT. SCI.
8 AUGUST 2018, NIST8 AUGUST 2018, NIST
Bryce Meredig,Bryce Meredig,
Chief Science Officer
3. What is materials informatics?What is materials informatics?
What makes it particularly challenging?What makes it particularly challenging?
Can we do anything about it?Can we do anything about it?
4. LET'S TRY TO MACHINELET'S TRY TO MACHINE
LEARN A NOBEL PRIZELEARN A NOBEL PRIZE
5. CASE STUDY: HIGH-T SUPERCONDUCTORSCASE STUDY: HIGH-T SUPERCONDUCTORS
Pia Jensen Ray. Figure 2.4 in Master's thesis, "Structural investigation of La(2-x)Sr(x)CuO(4+y) - Following staging as a function of temperature". Niels Bohr Institute, Faculty of Science,
University of Copenhagen. Copenhagen, Denmark, November 2015. DOI:10.6084/m9.figshare.2075680.v2
7. CASE STUDY: HIGH-T SUPERCONDUCTORSCASE STUDY: HIGH-T SUPERCONDUCTORS
Cross-validated RMSE for T ≈c 10K
8. CAN WE PREDICT HIGH-TCAN WE PREDICT HIGH-T
SUPERCONDUCTIVITY?!?SUPERCONDUCTIVITY?!?
(spoiler alert:) no
9. LEAVE ONE CLUSTER OUT (LOCO) CVLEAVE ONE CLUSTER OUT (LOCO) CV
Nominal k-fold cross validations assumes independence of samples from the input space
This is almost never true in materials informatics: individual data sources have
selection biases and different data sources draw from different distributions
LOCO CV groups the data before computing train/test splits
The groups are inferred via clustering rather than being dictated by a domain expert
"Can machine learning identify the next high-temperature superconductor? Examining
extrapolation performance for materials discovery."
B. Meredig, ..., M. Hutchinson, ..., B. Gibbons, J. Hattrick-Simpers, A. Mehta, L. Ward
10. CASE STUDY: HIGH-T SUPERCONDUCTORSCASE STUDY: HIGH-T SUPERCONDUCTORS
The model can't "extrapolate" across material classes (clusters).
15. DESIGNING THE NEXT EXPERIMENTDESIGNING THE NEXT EXPERIMENT
Maximum Expected
Maximum Likelihood of
Improvement (MLI)
Maximum Uncertainty
x ∗ p(x; θ) dx∫−∞
∞
[ ]
p(x; θ) dx∫α
∞
[ ]
(x − ) dx∫−∞
∞
[ xˉ 2
]
17. REAL WORLD EXAMPLE: ADAPT @ MINESREAL WORLD EXAMPLE: ADAPT @ MINES
https://www.additivemanufacturing.media/articles/how-machine-learning-is-moving-am-beyond-trial-and-error/
20. ““SimplySimply downloading and ‘applying’downloading and ‘applying’
open-source software to your dataopen-source software to your data
won’t work. AI needs to be customizedwon’t work. AI needs to be customized
to your business context and data.”to your business context and data.”
Andrew Ng in Harvard Business Review
(Stanford, Google Brain, Coursera, Baidu)
21. MATERIALS INFORMATICS CONTEXTMATERIALS INFORMATICS CONTEXT
Labels are scarce and expensive
Typical dataset sizes are 100-1000 labels
Preparing a sample is often more difficult than measuring it
Different labels have low marginal costs
We've been doing physics, chemistry, and materials science for hundreds of years
There are (not always accurate) sources of computational data
We have some priors for which labels are related
We have some priors for what some relationships look like
24. GRAPHICAL MODELS: DOMAIN-AWARE MODELINGGRAPHICAL MODELS: DOMAIN-AWARE MODELING
Inputs & Features
Featurization
Empirical Relation
Computational Data
Machine Learning
Quantity of Interest
25. GRAPHICAL MODELS: TRANSFER LEARNINGGRAPHICAL MODELS: TRANSFER LEARNING
M. Hutchinson, E. Antono, B. Gibbons, S. Paradiso, J. Ling, B. Meredig
Overcoming data scarcity with transfer learning, https://arxiv.org/pdf/1711.05099.pdf
"B" is a plentiful latent variable
DFT band gap
Hydrogen splitting react. rate
Indentation hardness
"A" is a scarce or expensive label
Color
NO splitting reaction rate
Ultimate tensile strength
26. GRAPHICAL MODELS: TRANSFER LEARNINGGRAPHICAL MODELS: TRANSFER LEARNING
Simple example:
Adding yield strength
information to a fatigue
strength design increases
experimental efficiency
M. Hutchinson, E. Antono, B. Gibbons, S. Paradiso, J. Ling, B. Meredig
Overcoming data scarcity with transfer learning, https://arxiv.org/pdf/1711.05099.pdf
27. WHERE DOES THE UNCERTAINTY COME FROM?WHERE DOES THE UNCERTAINTY COME FROM?
Jackknife methods capture uncertainty with respect to finite sample size.
Computational cost is independent of the size of the feature space.
We add an explicit bias term trained on the out-of-bag errors
28. WHERE DOES THE UNCERTAINTY COME FROM?WHERE DOES THE UNCERTAINTY COME FROM?
29. WHERE DOES THE UNCERTAINTY COME FROM?WHERE DOES THE UNCERTAINTY COME FROM?