Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Accelerated Materials Discovery Using Theory, Optimization, and Natural Language Processing
1. Accelerated Materials Discovery Using Theory,
Optimization, and Natural Language Processing
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
MRS Fall Meeting 2019
Slides (already) posted to hackingmaterials.lbl.gov
4. 4
Materials theory is like CAD for materials –
but some of the software tools may need upgrades
Think of solution
Manually run
some
calculations
5. 5
We’ve been building a comprehensive software pipeline for
virtual materials design
Researcher
ideas
ML-based
ideas
Visual
interface for
exploring all
results
experiments
6. 6
What are the different components of the pipeline?
Researcher
ideas
ML-based
ideas
Visual
interface for
exploring all
results
experiments
7. 7
Given a search domain, the goal of our ”rocketsled” software
is to find the best solutions in as few calculations as possible
https://github.com/hackingmaterials/
rocketsled
8. 8
There exists many packages for optimization already,
but rocketsled can offload expensive calculations to HPC
BayesOpt
Scikit-optimize
9. 9
Rocketsled also allows you to insert into your own
descriptors into the optimization
At each point, you can add a vector of
physical descriptors to help the
optimizer
Search space
10. • Rocketsled uses the scikit-optimize as the default
backend, which implements:
– Gaussian Process
– Random Forest
– Gradient Boosted trees
• You can choose your choice of acquisition function
– Expected improvement
– Probability of Improvement
– Greedy algorithm
– etc…
• You can write your own custom optimizer in Python and
use it – so anything is allowed!
10
What optimizers are available in rocketsled?
11. 11
We’ve tested rocketsled on a “mock” problem in which
answers were pre-computed with density functional theory
Can rocketsled find the good solutions with
fewer calculations than a benchmark?
18,928 cubic perovskites: ABX3
A: 1 of 52 metal cations
B: 1 of 52 metal cations
X3: One of 7 anions
solarchoice.net.au/blog/news/perovskites-the-next-solar-pv-revolution-240714
*Either direct or indirect band gap can be used.
Search space ordered according to atomic no. rank.
Scores of compounds are represented by color.
Solutions: 20 possible one-photon solar water
splitters, based on:
1. Enthalpy of formation <0.2eV
2. Band gap* 1.5-3.0eV
3. Band* edges straddle H+/H2 and H2O/O2 E levels
12. • Random
– Obvious, but too easy to beat
– Let’s also try harder …
• Prior genetic algorithm study on the same problem
• Chemical rules
– Compound must (i) be charge balanced and (ii) have even
number of e- (for gap)
• This eliminates 60% of the search space outright!!
– Rank remaining compounds by distance of Goldschmidt
tolerance factor to the ideal value of 1.
12
What are some good benchmarks to compare against?
15. 15
Visualization of search space sampled with and without
optimization on a ”superhard” materials design problem
7,394 mats. with elastic tensors
calculated
Search space:
Common name K (GPa) G (GPa)
Londsdaleite 435.661 522.922
Diamond 435.686 520.267
ß-C3N4 408.925 312.428
Rhenium Nitride 379.804 253.458
Tungsten carbide 385.194 278.96
Osmium 401.328 258.697
w-BN 373.241 383.285
Diamondlike-Boron Carbide 378 347
16. • Do more with less computational budget
– e.g,. confidently find the best solutions when you have
much fewer calculations to spend than possibilities
• Get good results faster
– Even if you plan to compute everything, why not get the
best answers in week 1 instead of week 30?
• The main downside is added complexity
– If you are using our automation tools (FireWorks, atomate,
etc.) then rocketsled removes the complexity of
incorporating optimization
16
Potential benefits and downsides of optimization in
high-throughput computational searches
17. 17
More information on Rocketsled
Dunn, A., Brenneck, J. & Jain, A.
Rocketsled: a software library for
optimizing high-throughput
computational searches. J. Phys.
Mater. 2, 034002 (2019).
hackingmaterials.github.io/
rocketsled
https://discuss.matsci.org
(use FireWorks forum)
Paper Docs Support
18. 18
What are the different components of the pipeline?
Researcher
ideas
ML-based
ideas
Visual
interface for
exploring all
results
experiments
19. • I was at first interested in the potential of NLP to
save us from the tedious task of figuring out
which of our “predictions” were already studied
• For example, we would manually go through a
list of 100 predictions, doing a literature review
for every single one, need to find similar
compounds as well, etc.
– Mainly for our search for novel thermoelectrics
19
How might natural language processing help us in
computational screening?
20. 20
“Solution v1”: manually make a list of all the thermoelectrics
I could find and write an algorithm for similarity
21. 21
“Solution v1”: manually make a list of all the thermoelectrics
I could find and write an algorithm for similarity
22. 22
“Solution v1”: manually make a list of all the thermoelectrics
I could find and write an algorithm for similarity
There had to be a better way!!
23. Extracted ~2 million
abstracts of relevant
scientific articles
Use natural language
processing algorithms
to try to extract
knowledge from all this
data
23
Instead – use computers to compile the lists on our behalf
24. 24
Developed algorithms to automatically tag keywords in the
abstracts based on word2vec and LSTM networks
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
27. 27
Application: a revised materials search engine
Auto-generated summaries of materials based on text mining
28. 28
Could these techniques also be used to predict which
materials we might want to screen for an application?
papers to read “someday”
NLP algorithms
29. • We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
29
Key concept 1: the word2vec algorithm
30. • We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
30
Key concept 1: the word2vec algorithm
“You shall know a word by
the company it keeps”
- John Rupert Firth (1957)
31. • Dot product of a composition word
with the word “thermoelectric”
essentially predicts how likely that
word is to appear in an abstract with
the word thermoelectric
• Compositions with high dot products
are typically known thermoelectrics
• Sometimes, compositions have a high
dot product with “thermoelectric” but
have never been studied as a
thermoelectric
• These compositions usually have high
computed power factors! (BoltzTraP)
31
Key concept 2: vector dot products measure similarity
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from
materials science literature. Nature 571, 95–98 (2019).
32. “Go back in time”
approach:
– For every year since
2001, see which
compounds we would
have predicted using only
literature data until that
point in time
– Make predictions of what
materials are the most
promising thermoelectrics
for data until that year
– See if those materials
were actually studied as
thermoelectrics in
subsequent years 32
Can we predict future thermoelectrics discoveries with this
method?
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
33. • Thus far, 2 of our top 20 predictions made in
~August 2018 have already been reported in the
literature for the first time as thermoelectrics
– Li3Sb was the subject of a computational study
(predicted zT=2.42) in Oct 2018
– SnTe2 was experimentally found to be a moderately
good thermoelectric (expt zT=0.71) in Dec 2018
• We are working with an experimentalist on one
of the predictions (but ”spare time” project)
33
How about “forward” predictions?
[1] Yang et al. "Low lattice thermal conductivity and
excellent thermoelectric behavior in Li3Sb and Li3Bi."
Journal of Physics: Condensed Matter 30.42 (2018):
425401
[2] Wang et al. "Ultralow lattice thermal conductivity and
electronic properties of monolayer 1T phase semimetal
SiTe2 and SnTe2." Physica E: Low-dimensional Systems and
Nanostructures 108 (2019): 53-59
34. • We’ve been building many software tools for
better computer-aided materials design
• Optimization algorithms and NLP will play roles
in these next-generation tools
• Hopefully, these will further improve the
applicability of materials theory to real materials
design
34
Conclusions
5
Researcher
ideas
ML-based
ideas
Visual
interface for
exploring all
results
experiments
35. 35
Acknowledgements
Slides (already) posted to hackingmaterials.lbl.gov
• Rocketsled
– Alex Dunn
– U.S. Department of Energy, Materials Science Division
• Matscholar
– Vahe Tshitoyan, Leigh Weston, John Dagdelen, Amalie
Trewartha, Alex Dunn
– Gerbrand Ceder & Kristin Persson
– Toyota Research Institutes