SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Data Mining to Discovery for Inorganic Solids:
Software Tools and Applications
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
Artificial Intelligence for Materials Science
August 7, 2018
Slides (already) posted to hackingmaterials.lbl.gov
•  Three projects available now
–  Interpretable descriptors of crystal structure
–  matminer
–  atomate / Rocketsled
•  One project in progress
–  A text mining materials database
2
Overview of talk
3
I. Interpretable descriptors of
crystal structure
Machine learning: the big problem in my view is connecting
data to ML algorithms through features
4
Lots of data on
complex objects that
you want to interrelate
Clustering,	Regression,	Feature	
extraction,	Model-building,	etc.	
Well developed
data-mining routines that work
only on numbers (ideally ones
with high relevance to your
problem)
Need to transform materials science objects into a set of
physically relevant numerical data (“features” or “descriptors”)
5
The crystal structure is a core entity that
machine learning algorithms should know about
Step 1. Describe each site as a
fingerprint telling you close it is
to each of 22 known local
environments (e.g., tetrahedral,
octahedral, etc.)
Step 2: Describe each structure
as the average of its site
fingerprints*
tetrahedron
octahedron
distorted 8-coordinated cube
*(plus additional statistics like standard deviation, min, max,
etc. if desired – or split into separate cation/anion vectors)
Defining local order parameters for various environments
6
Use	a	given	local	order	parameter	
with	a	threshold	
for	motif	recognition:	
	
If	qtet	>	qthresh,	
				then	motif	is	tetrahedron.	
	
Else	
				not	(too	much)	a	tetrahedron.	
Tetrahedral order parameter, qtet, [1]:
[1] Zimmermann et al., J. Am. Chem. Soc., 2017, 10.1021/jacs.5b08098
We have now developed mathematical order parameters for
22 different local environments
7
How well do these work?
8
1. Order parameters clearly
distinguish different environments
even after thermal distortion
2. Work well in applications (defect site
finding, diffusion characterization)
[1] Zimmermann et al., Frontiers of Materials, 2017, doi: 10.3389/fmats.2017.00034
9
Structure fingerprints: can they distinguish crystal
structures?
BaAl2O4 BaZnF4 CaFe2O4 CrVO4 K2NiF4
CaB2O4-I MgUO4 Pb3O4 SbNbO4 Sr2PbO4
Tetragonal BaTiO3 Th3P4 TlAlF4 ZnSO4 α-MnMoO4
BCCAragonite Barite β-K2SO4 Calcite
Half-Heusler
FCC GarnetHCP Rocksalt Diamond
High-cristobalite Ilmenite Low-cristobalite Low-quartz
Monazite Olivine Perovskites RutilePhenacite
Tetragonal BaTiO3 Th3P4 TlAlF4 ZnSO4 α-MnMoO4
BCCAragonite Barite β-K2SO4 Calcite
Half-Heusler
FCC GarnetHCP Rocksalt Diamond
High-cristobalite Ilmenite Low-cristobalite Low-quartz
Monazite Olivine Perovskites Rutile
Scheelite Spinel Thenardite Wolframite Zircon
Phenacite
•  40 diverse crystal structure prototypes
•  Many complex examples (e.g., multi-cation, multi-anion) from each class
•  Thousands of crystal structures in the test set
•  Create structure fingerprints based on averages of local environments
•  The Euclidean distance of structure fingerprints
between structures of the same prototype is
small and different prototypes is larger
10
Local environments do distinguish prototypes!
Overlapping	coefficient:	
OVC	=	1.7%	
distance between structure fingerprint vectors
distribution
same prototype
different prototype
11
Can cluster
crystal structures
by “local
environment
similarity”
Results on MP web site, e.g. for BCC-like structures
12
https://www.materialsproject.org/materials/mp-91/!
Target: W
similar structures
(distance near 0)
Cs3Sb!
TiGaFeCo!
CeMg2Cu!
•  Test to see if machine learning problems give
better performance using structure descriptors
•  Compare performance against other site /
structure descriptors in the field for various
problems
13
Structure descriptors – next steps
Implemented in:
•  pymatgen - www.pymatgen.org
•  matminer – https://hackingmaterials.github.io/matminer
More info: talk to Nils
Zimmermann at the poster
session!
14
II. matminer
15
MATMINER
How can we make
this transformation?
Test different ideas?
Where do we get
the data?
Goal of matminer: connect materials data with data mining
algorithms and data visualization libraries
16
Ward, L. et al. Matminer: An open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60–69 (2018).
>40 featurizer classes can
generate thousands of
potential descriptors
17
Matminer contains a library of descriptors for various
materials science entities
feat	=	EwaldEnergy([options])	
y	=	feat.featurize([input_data])	
•  compatible with
scikit-learn
pipelining
•  automatically deploy
multiprocessing to
parallelize over data
•  include citations to
methodology papers
18
Interactive Jupyter notebooks demonstrate use cases
https://github.com/hackingmaterials/matminer_examples!
Many	examples	available:		
	
•  Retrieving	data	from	various	databases	
	
•  Predicting	bulk	/	shear	modulus	
•  Predicting	formation	energies:	
•  from	composition	alone	
•  with	Voronoi-based	structure	features	
included	
•  with	Coulomb	matrix	and	Orbital	Field	
matrix	descriptors	(reproducing	
previous	studies	in	the	literature)	
•  Making	interactive	visualizations	
	
•  Creating	an	ML	pipeline
•  Further increase coverage and scope of feature
extraction methods available in the literature
•  Increase the number of “standard” data sets that
can be used to benchmark different ML
approaches
•  Apply to materials problems (in progress)
19
matminer – next steps
Implemented in:
•  matminer – https://hackingmaterials.github.io/matminer
20
III. atomate / Rocketsled
Generalizable
forward solver
Supercomputing
Power
Statistical
optimization
FireWorks NERSC Various optimization libraries
(Figure: J. Mueller)
With HT-DFT, we can generate data rapidly – what to do next?
21
M. de Jong, W. Chen, H.
Geerlings, M. Asta, and K. A.
Persson, Sci. Data, 2015, 2,
150053.!
M. De Jong, W. Chen, T.
Angsten, A. Jain, R. Notestine,
A. Gamst, M. Sluiter, C. K.
Ande, S. Van Der Zwaag, J. J.
Plata, C. Toher, S. Curtarolo,
G. Ceder, K. a Persson, and M.
Asta, Sci. Data, 2015, 2, 150009.!
>4500 elastic
tensors
>900
piezoelectric
tensors
>48000
Seebeck
coefficients +
cRTA transport
Ricci, Chen, Aydemir, Snyder,
Rignanese, Jain, & Hautier (in
submission)!
Atomate is our software to easily run millions of such
calculations at supercomputing centers
22
Results!!
researcher!
Start	with	all	binary	
oxides,	replace	O->S,	
run	several	different	
properties	
Workflows to run!
ü  band structure!
ü  surface energies!
ü  elastic tensor!
q  Raman spectrum!
q  QH thermal expansion!
q  spin-orbit coupling!
Can we build a general computational optimizer?
23
Generalizable
forward solver
Supercomputing
Power
Statistical
optimization
FireWorks
/ atomate
NERSC Various optimization libraries
(Figure: J. Mueller)
Rocketsled: Automatic materials screening that selects
materials to compute AND submits them to supercomputer
24
screening space of ~20,000 potential
ABX3 perovskite combinations as
water splitting materials –
precomputed in DFT by different group
if a machine learning algorithm was in
charge of picking the next compound
based on past data, how efficient
would it be?
•  Built off the scikit-optimization package, with 10
different regressors available
•  Bootstrapped uncertainty estimates for balancing
exploration and exploitation
•  Next step: deployment for thermoelectrics search
25
Further details and next steps
Implemented in:
•  rocketsled – https://github.com/hackingmaterials/rocketsled
26
IV. A text mining materials database
Some questions that current search tools don’t answer:
these questions require materials-specific search tools!
“I’d like a list of all the chemical compositions that have been studied as
thermoelectrics, ideally weighted by research interest in them. Ok, now
filter to thermoelectric materials known to have layered structures. Now
show me some materials that are aren’t in that list but are similar in terms
of structure and electronic properties in the Materials Project database.”!
“What are all the known applications and unique properties of
NaCoO2? What techniques (computational, experimental) have
been used to study this compound in the past?”!
“I just predicted a new composition as a battery cathode. A lit search
shows no hits at all for that composition. Has anyone ever made
anything similar to that composition? I’d like to know for synthesis
ideas and also want to check against similarity to known battery
materials.”!
28
An engine to label the content of scientific abstracts
Matstract
corpus
Unlabeled
data
Data
labels
Feature engineering
Text cleaning
Tokenization
POS tag
labels
Word embeddings
(word2vec)
Text processing
Hand crafted features
Supervised learning
Neural network
(LSTM)
Logistic regression
Train/test
sets
Named
Entities
Named
Entities
“Learning” what a
scientific study is about
from >2 million
materials science
abstracts
29
Learn relationships over many abstracts
30
Application: a revised materials search engine
Auto-generated summaries of materials based on text mining
31
Application: materials compositions of interest …
A search for thermoelectrics that do not have Pb or Bi
•  Further testing
•  Similarity metrics, e.g. if a target compound
doesn’t exist, retrieve information for “similar”
compounds instead
•  Integration with Materials Project
32
Materials abstracts – next steps
Interested in being a beta tester?
Contact me
•  Our group has been working on methods and
software for various applications
–  Interpretable descriptors of crystal structure
–  matminer
–  atomate / Rocketsled
–  A text mining materials database
•  We encourage you to try the software and let us
know what you think!
–  Help lists are available for all software
33
Conclusions
•  Structure descriptors
–  N. Zimmermann (project lead)
•  Atomate / Rocketsled
–  K Matthew (project lead, atomate)
–  A. Dunn (project lead, rocketsled)
•  Matminer
–  L. Ward (project lead, U. Chicago)
•  Text mining
–  V. Tshitoyan, J. Dagdelen, L. Weston
•  All that provided feedback & contributed code to open-source software efforts!
•  Funding: DOE-BES (Early Career + Materials Project Center)
•  Computing: NERSC
34
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Overview of accelerated materials design efforts in the Hacking Materials res...
Overview of accelerated materials design efforts in the Hacking Materials res...Overview of accelerated materials design efforts in the Hacking Materials res...
Overview of accelerated materials design efforts in the Hacking Materials res...
 
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
 
Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discovery
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
 
Introduction (Part I): High-throughput computation and machine learning appli...
Introduction (Part I): High-throughput computation and machine learning appli...Introduction (Part I): High-throughput computation and machine learning appli...
Introduction (Part I): High-throughput computation and machine learning appli...
 
Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials Informatics
 

Ähnlich wie Data Mining to Discovery for Inorganic Solids: Software Tools and Applications

Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
aimsnist
 
Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...
Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...
Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...
KAMAL CHOUDHARY
 
PhD_10_2011_Abhijeet_Paul
PhD_10_2011_Abhijeet_PaulPhD_10_2011_Abhijeet_Paul
PhD_10_2011_Abhijeet_Paul
Abhijeet Paul
 

Ähnlich wie Data Mining to Discovery for Inorganic Solids: Software Tools and Applications (20)

Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
Using MongoDB for Materials Discovery
Using MongoDB for Materials DiscoveryUsing MongoDB for Materials Discovery
Using MongoDB for Materials Discovery
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
NANO266 - Lecture 12 - High-throughput computational materials design
NANO266 - Lecture 12 - High-throughput computational materials designNANO266 - Lecture 12 - High-throughput computational materials design
NANO266 - Lecture 12 - High-throughput computational materials design
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Design
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Design
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...
 
Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...
Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...
Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
 
Morgan osg user school 2016 07-29 dist
Morgan osg user school 2016 07-29 distMorgan osg user school 2016 07-29 dist
Morgan osg user school 2016 07-29 dist
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
 
PhD_10_2011_Abhijeet_Paul
PhD_10_2011_Abhijeet_PaulPhD_10_2011_Abhijeet_Paul
PhD_10_2011_Abhijeet_Paul
 
Applying tensor decompositions to author name disambiguation of common Japane...
Applying tensor decompositions to author name disambiguation of common Japane...Applying tensor decompositions to author name disambiguation of common Japane...
Applying tensor decompositions to author name disambiguation of common Japane...
 
C hi mad_phasefieldworkshop(1)
C hi mad_phasefieldworkshop(1)C hi mad_phasefieldworkshop(1)
C hi mad_phasefieldworkshop(1)
 
Nanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkNanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS Talk
 

Mehr von Anubhav Jain

Mehr von Anubhav Jain (20)

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
 
Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials Project
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
 
Assessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data AnalysisAssessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data Analysis
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
 

Kürzlich hochgeladen

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 

Kürzlich hochgeladen (20)

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 

Data Mining to Discovery for Inorganic Solids: Software Tools and Applications

  • 1. Data Mining to Discovery for Inorganic Solids: Software Tools and Applications Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA Artificial Intelligence for Materials Science August 7, 2018 Slides (already) posted to hackingmaterials.lbl.gov
  • 2. •  Three projects available now –  Interpretable descriptors of crystal structure –  matminer –  atomate / Rocketsled •  One project in progress –  A text mining materials database 2 Overview of talk
  • 3. 3 I. Interpretable descriptors of crystal structure
  • 4. Machine learning: the big problem in my view is connecting data to ML algorithms through features 4 Lots of data on complex objects that you want to interrelate Clustering, Regression, Feature extraction, Model-building, etc. Well developed data-mining routines that work only on numbers (ideally ones with high relevance to your problem) Need to transform materials science objects into a set of physically relevant numerical data (“features” or “descriptors”)
  • 5. 5 The crystal structure is a core entity that machine learning algorithms should know about Step 1. Describe each site as a fingerprint telling you close it is to each of 22 known local environments (e.g., tetrahedral, octahedral, etc.) Step 2: Describe each structure as the average of its site fingerprints* tetrahedron octahedron distorted 8-coordinated cube *(plus additional statistics like standard deviation, min, max, etc. if desired – or split into separate cation/anion vectors)
  • 6. Defining local order parameters for various environments 6 Use a given local order parameter with a threshold for motif recognition: If qtet > qthresh, then motif is tetrahedron. Else not (too much) a tetrahedron. Tetrahedral order parameter, qtet, [1]: [1] Zimmermann et al., J. Am. Chem. Soc., 2017, 10.1021/jacs.5b08098
  • 7. We have now developed mathematical order parameters for 22 different local environments 7
  • 8. How well do these work? 8 1. Order parameters clearly distinguish different environments even after thermal distortion 2. Work well in applications (defect site finding, diffusion characterization) [1] Zimmermann et al., Frontiers of Materials, 2017, doi: 10.3389/fmats.2017.00034
  • 9. 9 Structure fingerprints: can they distinguish crystal structures? BaAl2O4 BaZnF4 CaFe2O4 CrVO4 K2NiF4 CaB2O4-I MgUO4 Pb3O4 SbNbO4 Sr2PbO4 Tetragonal BaTiO3 Th3P4 TlAlF4 ZnSO4 α-MnMoO4 BCCAragonite Barite β-K2SO4 Calcite Half-Heusler FCC GarnetHCP Rocksalt Diamond High-cristobalite Ilmenite Low-cristobalite Low-quartz Monazite Olivine Perovskites RutilePhenacite Tetragonal BaTiO3 Th3P4 TlAlF4 ZnSO4 α-MnMoO4 BCCAragonite Barite β-K2SO4 Calcite Half-Heusler FCC GarnetHCP Rocksalt Diamond High-cristobalite Ilmenite Low-cristobalite Low-quartz Monazite Olivine Perovskites Rutile Scheelite Spinel Thenardite Wolframite Zircon Phenacite •  40 diverse crystal structure prototypes •  Many complex examples (e.g., multi-cation, multi-anion) from each class •  Thousands of crystal structures in the test set •  Create structure fingerprints based on averages of local environments
  • 10. •  The Euclidean distance of structure fingerprints between structures of the same prototype is small and different prototypes is larger 10 Local environments do distinguish prototypes! Overlapping coefficient: OVC = 1.7% distance between structure fingerprint vectors distribution same prototype different prototype
  • 11. 11 Can cluster crystal structures by “local environment similarity”
  • 12. Results on MP web site, e.g. for BCC-like structures 12 https://www.materialsproject.org/materials/mp-91/! Target: W similar structures (distance near 0) Cs3Sb! TiGaFeCo! CeMg2Cu!
  • 13. •  Test to see if machine learning problems give better performance using structure descriptors •  Compare performance against other site / structure descriptors in the field for various problems 13 Structure descriptors – next steps Implemented in: •  pymatgen - www.pymatgen.org •  matminer – https://hackingmaterials.github.io/matminer More info: talk to Nils Zimmermann at the poster session!
  • 15. 15 MATMINER How can we make this transformation? Test different ideas? Where do we get the data?
  • 16. Goal of matminer: connect materials data with data mining algorithms and data visualization libraries 16 Ward, L. et al. Matminer: An open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60–69 (2018).
  • 17. >40 featurizer classes can generate thousands of potential descriptors 17 Matminer contains a library of descriptors for various materials science entities feat = EwaldEnergy([options]) y = feat.featurize([input_data]) •  compatible with scikit-learn pipelining •  automatically deploy multiprocessing to parallelize over data •  include citations to methodology papers
  • 18. 18 Interactive Jupyter notebooks demonstrate use cases https://github.com/hackingmaterials/matminer_examples! Many examples available: •  Retrieving data from various databases •  Predicting bulk / shear modulus •  Predicting formation energies: •  from composition alone •  with Voronoi-based structure features included •  with Coulomb matrix and Orbital Field matrix descriptors (reproducing previous studies in the literature) •  Making interactive visualizations •  Creating an ML pipeline
  • 19. •  Further increase coverage and scope of feature extraction methods available in the literature •  Increase the number of “standard” data sets that can be used to benchmark different ML approaches •  Apply to materials problems (in progress) 19 matminer – next steps Implemented in: •  matminer – https://hackingmaterials.github.io/matminer
  • 20. 20 III. atomate / Rocketsled Generalizable forward solver Supercomputing Power Statistical optimization FireWorks NERSC Various optimization libraries (Figure: J. Mueller)
  • 21. With HT-DFT, we can generate data rapidly – what to do next? 21 M. de Jong, W. Chen, H. Geerlings, M. Asta, and K. A. Persson, Sci. Data, 2015, 2, 150053.! M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, M. Sluiter, C. K. Ande, S. Van Der Zwaag, J. J. Plata, C. Toher, S. Curtarolo, G. Ceder, K. a Persson, and M. Asta, Sci. Data, 2015, 2, 150009.! >4500 elastic tensors >900 piezoelectric tensors >48000 Seebeck coefficients + cRTA transport Ricci, Chen, Aydemir, Snyder, Rignanese, Jain, & Hautier (in submission)!
  • 22. Atomate is our software to easily run millions of such calculations at supercomputing centers 22 Results!! researcher! Start with all binary oxides, replace O->S, run several different properties Workflows to run! ü  band structure! ü  surface energies! ü  elastic tensor! q  Raman spectrum! q  QH thermal expansion! q  spin-orbit coupling!
  • 23. Can we build a general computational optimizer? 23 Generalizable forward solver Supercomputing Power Statistical optimization FireWorks / atomate NERSC Various optimization libraries (Figure: J. Mueller)
  • 24. Rocketsled: Automatic materials screening that selects materials to compute AND submits them to supercomputer 24 screening space of ~20,000 potential ABX3 perovskite combinations as water splitting materials – precomputed in DFT by different group if a machine learning algorithm was in charge of picking the next compound based on past data, how efficient would it be?
  • 25. •  Built off the scikit-optimization package, with 10 different regressors available •  Bootstrapped uncertainty estimates for balancing exploration and exploitation •  Next step: deployment for thermoelectrics search 25 Further details and next steps Implemented in: •  rocketsled – https://github.com/hackingmaterials/rocketsled
  • 26. 26 IV. A text mining materials database
  • 27. Some questions that current search tools don’t answer: these questions require materials-specific search tools! “I’d like a list of all the chemical compositions that have been studied as thermoelectrics, ideally weighted by research interest in them. Ok, now filter to thermoelectric materials known to have layered structures. Now show me some materials that are aren’t in that list but are similar in terms of structure and electronic properties in the Materials Project database.”! “What are all the known applications and unique properties of NaCoO2? What techniques (computational, experimental) have been used to study this compound in the past?”! “I just predicted a new composition as a battery cathode. A lit search shows no hits at all for that composition. Has anyone ever made anything similar to that composition? I’d like to know for synthesis ideas and also want to check against similarity to known battery materials.”!
  • 28. 28 An engine to label the content of scientific abstracts Matstract corpus Unlabeled data Data labels Feature engineering Text cleaning Tokenization POS tag labels Word embeddings (word2vec) Text processing Hand crafted features Supervised learning Neural network (LSTM) Logistic regression Train/test sets Named Entities Named Entities “Learning” what a scientific study is about from >2 million materials science abstracts
  • 29. 29 Learn relationships over many abstracts
  • 30. 30 Application: a revised materials search engine Auto-generated summaries of materials based on text mining
  • 31. 31 Application: materials compositions of interest … A search for thermoelectrics that do not have Pb or Bi
  • 32. •  Further testing •  Similarity metrics, e.g. if a target compound doesn’t exist, retrieve information for “similar” compounds instead •  Integration with Materials Project 32 Materials abstracts – next steps Interested in being a beta tester? Contact me
  • 33. •  Our group has been working on methods and software for various applications –  Interpretable descriptors of crystal structure –  matminer –  atomate / Rocketsled –  A text mining materials database •  We encourage you to try the software and let us know what you think! –  Help lists are available for all software 33 Conclusions
  • 34. •  Structure descriptors –  N. Zimmermann (project lead) •  Atomate / Rocketsled –  K Matthew (project lead, atomate) –  A. Dunn (project lead, rocketsled) •  Matminer –  L. Ward (project lead, U. Chicago) •  Text mining –  V. Tshitoyan, J. Dagdelen, L. Weston •  All that provided feedback & contributed code to open-source software efforts! •  Funding: DOE-BES (Early Career + Materials Project Center) •  Computing: NERSC 34 Thank you!