SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Accelerated Materials Discovery Using Theory,
Optimization, and Natural Language Processing
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
MRS Fall Meeting 2019
Slides (already) posted to hackingmaterials.lbl.gov
2
Today, computer aided design of products is ubiquitous
3
The software for CAD has progressed by leaps and bounds
over the years
4
Materials theory is like CAD for materials –
but some of the software tools may need upgrades
Think of solution
Manually run
some
calculations
5
We’ve been building a comprehensive software pipeline for
virtual materials design
Researcher
ideas
ML-based
ideas
Visual
interface for
exploring all
results
experiments
6
What are the different components of the pipeline?
Researcher
ideas
ML-based
ideas
Visual
interface for
exploring all
results
experiments
7
Given a search domain, the goal of our ”rocketsled” software
is to find the best solutions in as few calculations as possible
https://github.com/hackingmaterials/
rocketsled
8
There exists many packages for optimization already,
but rocketsled can offload expensive calculations to HPC
BayesOpt
Scikit-optimize
9
Rocketsled also allows you to insert into your own
descriptors into the optimization
At each point, you can add a vector of
physical descriptors to help the
optimizer
Search space
• Rocketsled uses the scikit-optimize as the default
backend, which implements:
– Gaussian Process
– Random Forest
– Gradient Boosted trees
• You can choose your choice of acquisition function
– Expected improvement
– Probability of Improvement
– Greedy algorithm
– etc…
• You can write your own custom optimizer in Python and
use it – so anything is allowed!
10
What optimizers are available in rocketsled?
11
We’ve tested rocketsled on a “mock” problem in which
answers were pre-computed with density functional theory
Can rocketsled find the good solutions with
fewer calculations than a benchmark?
18,928 cubic perovskites: ABX3
A: 1 of 52 metal cations
B: 1 of 52 metal cations
X3: One of 7 anions
solarchoice.net.au/blog/news/perovskites-the-next-solar-pv-revolution-240714
*Either direct or indirect band gap can be used.
Search space ordered according to atomic no. rank.
Scores of compounds are represented by color.
Solutions: 20 possible one-photon solar water
splitters, based on:
1. Enthalpy of formation <0.2eV
2. Band gap* 1.5-3.0eV
3. Band* edges straddle H+/H2 and H2O/O2 E levels
• Random
– Obvious, but too easy to beat
– Let’s also try harder …
• Prior genetic algorithm study on the same problem
• Chemical rules
– Compound must (i) be charge balanced and (ii) have even
number of e- (for gap)
• This eliminates 60% of the search space outright!!
– Rank remaining compounds by distance of Goldschmidt
tolerance factor to the ideal value of 1.
12
What are some good benchmarks to compare against?
13
Rocketsled can find solutions much faster than other
methods
14
The “speedup” can be 15-30X faster than random
15
Visualization of search space sampled with and without
optimization on a ”superhard” materials design problem
7,394 mats. with elastic tensors
calculated
Search space:
Common name K (GPa) G (GPa)
Londsdaleite 435.661 522.922
Diamond 435.686 520.267
ß-C3N4 408.925 312.428
Rhenium Nitride 379.804 253.458
Tungsten carbide 385.194 278.96
Osmium 401.328 258.697
w-BN 373.241 383.285
Diamondlike-Boron Carbide 378 347
• Do more with less computational budget
– e.g,. confidently find the best solutions when you have
much fewer calculations to spend than possibilities
• Get good results faster
– Even if you plan to compute everything, why not get the
best answers in week 1 instead of week 30?
• The main downside is added complexity
– If you are using our automation tools (FireWorks, atomate,
etc.) then rocketsled removes the complexity of
incorporating optimization
16
Potential benefits and downsides of optimization in
high-throughput computational searches
17
More information on Rocketsled
Dunn, A., Brenneck, J. & Jain, A.
Rocketsled: a software library for
optimizing high-throughput
computational searches. J. Phys.
Mater. 2, 034002 (2019).
hackingmaterials.github.io/
rocketsled
https://discuss.matsci.org
(use FireWorks forum)
Paper Docs Support
18
What are the different components of the pipeline?
Researcher
ideas
ML-based
ideas
Visual
interface for
exploring all
results
experiments
• I was at first interested in the potential of NLP to
save us from the tedious task of figuring out
which of our “predictions” were already studied
• For example, we would manually go through a
list of 100 predictions, doing a literature review
for every single one, need to find similar
compounds as well, etc.
– Mainly for our search for novel thermoelectrics
19
How might natural language processing help us in
computational screening?
20
“Solution v1”: manually make a list of all the thermoelectrics
I could find and write an algorithm for similarity
21
“Solution v1”: manually make a list of all the thermoelectrics
I could find and write an algorithm for similarity
22
“Solution v1”: manually make a list of all the thermoelectrics
I could find and write an algorithm for similarity
There had to be a better way!!
Extracted ~2 million
abstracts of relevant
scientific articles
Use natural language
processing algorithms
to try to extract
knowledge from all this
data
23
Instead – use computers to compile the lists on our behalf
24
Developed algorithms to automatically tag keywords in the
abstracts based on word2vec and LSTM networks
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
25
Now we can search!
Live on www.matscholar.com
26
Application: materials compositions of interest …
A search for thermoelectrics that do not have Pb or Bi
27
Application: a revised materials search engine
Auto-generated summaries of materials based on text mining
28
Could these techniques also be used to predict which
materials we might want to screen for an application?
papers to read “someday”
NLP algorithms
• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
29
Key concept 1: the word2vec algorithm
• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
30
Key concept 1: the word2vec algorithm
“You shall know a word by
the company it keeps”
- John Rupert Firth (1957)
• Dot product of a composition word
with the word “thermoelectric”
essentially predicts how likely that
word is to appear in an abstract with
the word thermoelectric
• Compositions with high dot products
are typically known thermoelectrics
• Sometimes, compositions have a high
dot product with “thermoelectric” but
have never been studied as a
thermoelectric
• These compositions usually have high
computed power factors! (BoltzTraP)
31
Key concept 2: vector dot products measure similarity
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from
materials science literature. Nature 571, 95–98 (2019).
“Go back in time”
approach:
– For every year since
2001, see which
compounds we would
have predicted using only
literature data until that
point in time
– Make predictions of what
materials are the most
promising thermoelectrics
for data until that year
– See if those materials
were actually studied as
thermoelectrics in
subsequent years 32
Can we predict future thermoelectrics discoveries with this
method?
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
• Thus far, 2 of our top 20 predictions made in
~August 2018 have already been reported in the
literature for the first time as thermoelectrics
– Li3Sb was the subject of a computational study
(predicted zT=2.42) in Oct 2018
– SnTe2 was experimentally found to be a moderately
good thermoelectric (expt zT=0.71) in Dec 2018
• We are working with an experimentalist on one
of the predictions (but ”spare time” project)
33
How about “forward” predictions?
[1] Yang et al. "Low lattice thermal conductivity and
excellent thermoelectric behavior in Li3Sb and Li3Bi."
Journal of Physics: Condensed Matter 30.42 (2018):
425401
[2] Wang et al. "Ultralow lattice thermal conductivity and
electronic properties of monolayer 1T phase semimetal
SiTe2 and SnTe2." Physica E: Low-dimensional Systems and
Nanostructures 108 (2019): 53-59
• We’ve been building many software tools for
better computer-aided materials design
• Optimization algorithms and NLP will play roles
in these next-generation tools
• Hopefully, these will further improve the
applicability of materials theory to real materials
design
34
Conclusions
5
Researcher
ideas
ML-based
ideas
Visual
interface for
exploring all
results
experiments
35
Acknowledgements
Slides (already) posted to hackingmaterials.lbl.gov
• Rocketsled
– Alex Dunn
– U.S. Department of Energy, Materials Science Division
• Matscholar
– Vahe Tshitoyan, Leigh Weston, John Dagdelen, Amalie
Trewartha, Alex Dunn
– Gerbrand Ceder & Kristin Persson
– Toyota Research Institutes

Weitere ähnliche Inhalte

Was ist angesagt?

Materials Informatics Overview
Materials Informatics OverviewMaterials Informatics Overview
Materials Informatics Overview
Tony Fast
 

Was ist angesagt? (20)

The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 
Assessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data AnalysisAssessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data Analysis
 
DuraMat Data Analytics
DuraMat Data AnalyticsDuraMat Data Analytics
DuraMat Data Analytics
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Materials Informatics Overview
Materials Informatics OverviewMaterials Informatics Overview
Materials Informatics Overview
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...
 

Ähnlich wie Accelerated Materials Discovery Using Theory, Optimization, and Natural Language Processing

Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
aimsnist
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
Anubhav Jain
 

Ähnlich wie Accelerated Materials Discovery Using Theory, Optimization, and Natural Language Processing (20)

Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
RMG at the Flame Chemistry Workshop 2014
RMG at the Flame Chemistry Workshop 2014RMG at the Flame Chemistry Workshop 2014
RMG at the Flame Chemistry Workshop 2014
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
 
Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
Overview of accelerated materials design efforts in the Hacking Materials res...
Overview of accelerated materials design efforts in the Hacking Materials res...Overview of accelerated materials design efforts in the Hacking Materials res...
Overview of accelerated materials design efforts in the Hacking Materials res...
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
 
Dendral
DendralDendral
Dendral
 
ICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials ProjectICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials Project
 
Accelerating materials design through natural language processing
Accelerating materials design through natural language processingAccelerating materials design through natural language processing
Accelerating materials design through natural language processing
 
Introduction (Part I): High-throughput computation and machine learning appli...
Introduction (Part I): High-throughput computation and machine learning appli...Introduction (Part I): High-throughput computation and machine learning appli...
Introduction (Part I): High-throughput computation and machine learning appli...
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
 
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyMachine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
 
Using MongoDB for Materials Discovery
Using MongoDB for Materials DiscoveryUsing MongoDB for Materials Discovery
Using MongoDB for Materials Discovery
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
The Materials Project: An Electronic Structure Database for Community-Based M...
The Materials Project: An Electronic Structure Database for Community-Based M...The Materials Project: An Electronic Structure Database for Community-Based M...
The Materials Project: An Electronic Structure Database for Community-Based M...
 

Mehr von Anubhav Jain

Mehr von Anubhav Jain (18)

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
 
Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials Project
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
 

Kürzlich hochgeladen

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
University of Hertfordshire
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
LeenakshiTyagi
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
anilsa9823
 

Kürzlich hochgeladen (20)

Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 

Accelerated Materials Discovery Using Theory, Optimization, and Natural Language Processing

  • 1. Accelerated Materials Discovery Using Theory, Optimization, and Natural Language Processing Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA MRS Fall Meeting 2019 Slides (already) posted to hackingmaterials.lbl.gov
  • 2. 2 Today, computer aided design of products is ubiquitous
  • 3. 3 The software for CAD has progressed by leaps and bounds over the years
  • 4. 4 Materials theory is like CAD for materials – but some of the software tools may need upgrades Think of solution Manually run some calculations
  • 5. 5 We’ve been building a comprehensive software pipeline for virtual materials design Researcher ideas ML-based ideas Visual interface for exploring all results experiments
  • 6. 6 What are the different components of the pipeline? Researcher ideas ML-based ideas Visual interface for exploring all results experiments
  • 7. 7 Given a search domain, the goal of our ”rocketsled” software is to find the best solutions in as few calculations as possible https://github.com/hackingmaterials/ rocketsled
  • 8. 8 There exists many packages for optimization already, but rocketsled can offload expensive calculations to HPC BayesOpt Scikit-optimize
  • 9. 9 Rocketsled also allows you to insert into your own descriptors into the optimization At each point, you can add a vector of physical descriptors to help the optimizer Search space
  • 10. • Rocketsled uses the scikit-optimize as the default backend, which implements: – Gaussian Process – Random Forest – Gradient Boosted trees • You can choose your choice of acquisition function – Expected improvement – Probability of Improvement – Greedy algorithm – etc… • You can write your own custom optimizer in Python and use it – so anything is allowed! 10 What optimizers are available in rocketsled?
  • 11. 11 We’ve tested rocketsled on a “mock” problem in which answers were pre-computed with density functional theory Can rocketsled find the good solutions with fewer calculations than a benchmark? 18,928 cubic perovskites: ABX3 A: 1 of 52 metal cations B: 1 of 52 metal cations X3: One of 7 anions solarchoice.net.au/blog/news/perovskites-the-next-solar-pv-revolution-240714 *Either direct or indirect band gap can be used. Search space ordered according to atomic no. rank. Scores of compounds are represented by color. Solutions: 20 possible one-photon solar water splitters, based on: 1. Enthalpy of formation <0.2eV 2. Band gap* 1.5-3.0eV 3. Band* edges straddle H+/H2 and H2O/O2 E levels
  • 12. • Random – Obvious, but too easy to beat – Let’s also try harder … • Prior genetic algorithm study on the same problem • Chemical rules – Compound must (i) be charge balanced and (ii) have even number of e- (for gap) • This eliminates 60% of the search space outright!! – Rank remaining compounds by distance of Goldschmidt tolerance factor to the ideal value of 1. 12 What are some good benchmarks to compare against?
  • 13. 13 Rocketsled can find solutions much faster than other methods
  • 14. 14 The “speedup” can be 15-30X faster than random
  • 15. 15 Visualization of search space sampled with and without optimization on a ”superhard” materials design problem 7,394 mats. with elastic tensors calculated Search space: Common name K (GPa) G (GPa) Londsdaleite 435.661 522.922 Diamond 435.686 520.267 ß-C3N4 408.925 312.428 Rhenium Nitride 379.804 253.458 Tungsten carbide 385.194 278.96 Osmium 401.328 258.697 w-BN 373.241 383.285 Diamondlike-Boron Carbide 378 347
  • 16. • Do more with less computational budget – e.g,. confidently find the best solutions when you have much fewer calculations to spend than possibilities • Get good results faster – Even if you plan to compute everything, why not get the best answers in week 1 instead of week 30? • The main downside is added complexity – If you are using our automation tools (FireWorks, atomate, etc.) then rocketsled removes the complexity of incorporating optimization 16 Potential benefits and downsides of optimization in high-throughput computational searches
  • 17. 17 More information on Rocketsled Dunn, A., Brenneck, J. & Jain, A. Rocketsled: a software library for optimizing high-throughput computational searches. J. Phys. Mater. 2, 034002 (2019). hackingmaterials.github.io/ rocketsled https://discuss.matsci.org (use FireWorks forum) Paper Docs Support
  • 18. 18 What are the different components of the pipeline? Researcher ideas ML-based ideas Visual interface for exploring all results experiments
  • 19. • I was at first interested in the potential of NLP to save us from the tedious task of figuring out which of our “predictions” were already studied • For example, we would manually go through a list of 100 predictions, doing a literature review for every single one, need to find similar compounds as well, etc. – Mainly for our search for novel thermoelectrics 19 How might natural language processing help us in computational screening?
  • 20. 20 “Solution v1”: manually make a list of all the thermoelectrics I could find and write an algorithm for similarity
  • 21. 21 “Solution v1”: manually make a list of all the thermoelectrics I could find and write an algorithm for similarity
  • 22. 22 “Solution v1”: manually make a list of all the thermoelectrics I could find and write an algorithm for similarity There had to be a better way!!
  • 23. Extracted ~2 million abstracts of relevant scientific articles Use natural language processing algorithms to try to extract knowledge from all this data 23 Instead – use computers to compile the lists on our behalf
  • 24. 24 Developed algorithms to automatically tag keywords in the abstracts based on word2vec and LSTM networks Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 25. 25 Now we can search! Live on www.matscholar.com
  • 26. 26 Application: materials compositions of interest … A search for thermoelectrics that do not have Pb or Bi
  • 27. 27 Application: a revised materials search engine Auto-generated summaries of materials based on text mining
  • 28. 28 Could these techniques also be used to predict which materials we might want to screen for an application? papers to read “someday” NLP algorithms
  • 29. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 29 Key concept 1: the word2vec algorithm
  • 30. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 30 Key concept 1: the word2vec algorithm “You shall know a word by the company it keeps” - John Rupert Firth (1957)
  • 31. • Dot product of a composition word with the word “thermoelectric” essentially predicts how likely that word is to appear in an abstract with the word thermoelectric • Compositions with high dot products are typically known thermoelectrics • Sometimes, compositions have a high dot product with “thermoelectric” but have never been studied as a thermoelectric • These compositions usually have high computed power factors! (BoltzTraP) 31 Key concept 2: vector dot products measure similarity Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 32. “Go back in time” approach: – For every year since 2001, see which compounds we would have predicted using only literature data until that point in time – Make predictions of what materials are the most promising thermoelectrics for data until that year – See if those materials were actually studied as thermoelectrics in subsequent years 32 Can we predict future thermoelectrics discoveries with this method? Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 33. • Thus far, 2 of our top 20 predictions made in ~August 2018 have already been reported in the literature for the first time as thermoelectrics – Li3Sb was the subject of a computational study (predicted zT=2.42) in Oct 2018 – SnTe2 was experimentally found to be a moderately good thermoelectric (expt zT=0.71) in Dec 2018 • We are working with an experimentalist on one of the predictions (but ”spare time” project) 33 How about “forward” predictions? [1] Yang et al. "Low lattice thermal conductivity and excellent thermoelectric behavior in Li3Sb and Li3Bi." Journal of Physics: Condensed Matter 30.42 (2018): 425401 [2] Wang et al. "Ultralow lattice thermal conductivity and electronic properties of monolayer 1T phase semimetal SiTe2 and SnTe2." Physica E: Low-dimensional Systems and Nanostructures 108 (2019): 53-59
  • 34. • We’ve been building many software tools for better computer-aided materials design • Optimization algorithms and NLP will play roles in these next-generation tools • Hopefully, these will further improve the applicability of materials theory to real materials design 34 Conclusions 5 Researcher ideas ML-based ideas Visual interface for exploring all results experiments
  • 35. 35 Acknowledgements Slides (already) posted to hackingmaterials.lbl.gov • Rocketsled – Alex Dunn – U.S. Department of Energy, Materials Science Division • Matscholar – Vahe Tshitoyan, Leigh Weston, John Dagdelen, Amalie Trewartha, Alex Dunn – Gerbrand Ceder & Kristin Persson – Toyota Research Institutes