Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Evaluating Machine Learning Algorithms
for Materials Science using the
Matbench Protocol
Anubhav Jain
Staff Scientist, Law...
Outline of talk
1. A quick introduction to the Materials Project
2. Engaging the community: The MPContribs data platform
3...
A quick introduction to the
Materials Project
The core of Materials Project is a free database of
calculated materials properties and crystal structures
Free, public re...
The core data set keeps growing with time …
5
Apps give insight into data
Materials Explorer
Phase Stability Diagrams
Pourbaix Diagrams
(Aqueous Stability)
Battery Expl...
The code powering the Materials Project is
available open source (BSD/MIT licenses)
just-in-time error correction, fixing ...
Example: calculation workflows implemented in
by dozens of collaborators
Phonons
Elasticity Defects
Magnetism
Band
Structu...
Example 2: matminer allows researchers to generate
diverse feature sets for machine learning
9
>60 featurizer classes can
...
The Materials Project is used heavily by the research
community
> 180,000 registered
users
> 40,000 new users last year
~1...
A large fraction of users are from industry
Student
44%
Academia
36%
Industry
10%
Government
5%
Other
5%
3.5%
Schrodinger:...
Engaging the community:
the MPContribs data platform
How can we use Materials Project to build a
community of materials researchers?
Materials Project now has
high visibility ...
Beyond calculations: MPContribs allows the research
community to contribute their own data
A “materials detail page,”
cont...
2. Materials Project links
to your contribution
3. Your data set and
paper are linked
1. Google links to
Materials Project...
MPContribs is open for contributions
You can now apply to contribute
your data set and we will work
with you to disseminat...
Benchmarking machine learning
methods using the Matbench protocol
MP is now involved in an effort to benchmark
various machine learning algorithms
18
Model 2
Without standardized benchmarks, ML models can be difficult to compare
Model 1
Dataset 1
+
No structures
No AB2C3 ...
What’s needed –
an “ImageNet” for materials science
https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-resea...
Can we make the same
advancements in materials
as in computer vision?
One of the reasons computer science
/ machine learni...
The ingredients of the Matbench
benchmark
qStandard data sets
qStandard test splits according to nested cross-validation p...
Matbench includes 13 different ML tasks
23
Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Prope...
The tasks encompass a variety of problems
13 Ready-to-use ML tasks ranging in training size, target property, inputs, task...
Browse datasets and tasks with Materials Project MPContribsML
https://ml.materialsproject.org
The ingredients of the Matbench
benchmark
ü Standard data sets
qStandard test splits according to nested cross-validation ...
27
Most commonly used test split procedure
• Training/validation
is used for model
selection
• Test / hold-out is
used onl...
Think of it as N different “universes” – we have a different training
of the model in each universe and a different hold-o...
Think of it as N different “universes” – we have a different training
of the model in each universe and a different hold-o...
The ingredients of the Matbench
benchmark
ü Standard data sets
ü Standard test splits according to nested cross-validation...
Matbench has an online leaderboard – matbench.materialsproject.org
Complete and reproducible results on standardized ML tasks
Sample-by-sample predictions of all
algorithms on all tasks, no...
Algorithm comparison across individual tasks OR complete benchmark
Example: matbench_dielectric
Compare both specialized a...
Evaluation of ML paradigms drives research and development
Traditional paradigms:
• Traditional Models (e.g., RF + MagPie[...
Matbench compares these ML model paradigms
Traditional paradigms:
• Traditional Models (e.g., RF + MagPie[1] features)
• A...
Contribute your model to the body of knowledge
Matbench Python package
Evaluate an entire benchmark with ~10 lines of code...
Contribute your model to the body of knowledge
Matbench Python package
Evaluate an entire benchmark with ~10 lines of code...
The ingredients of the Matbench
benchmark
ü Standard data sets
ü Standard test splits according to nested cross-validation...
Results so far: graph NN for large
data sets, conventional ML for small
Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A....
Overall and upcoming goals for
Matbench
• We have introduced a method that allows researchers to evaluate
their machine le...
Concluding thoughts
The Materials Project is a free resource providing data and tools to
help perform research and develop...
Kristin Persson
MP Director
The team Intro
Thank you!
Patrick Huck
Staff Scientist
(MPContribs)
Alex Dunn
Grad Student
(Ma...
Nächste SlideShare
Wird geladen in …5
×

von

Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 1 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 2 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 3 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 4 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 5 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 6 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 7 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 8 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 9 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 10 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 11 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 12 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 13 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 14 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 15 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 16 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 17 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 18 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 19 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 20 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 21 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 22 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 23 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 24 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 25 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 26 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 27 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 28 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 29 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 30 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 31 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 32 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 33 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 34 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 35 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 36 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 37 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 38 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 39 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 40 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 41 Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Slide 42
Nächste SlideShare
What to Upload to SlideShare
Weiter
Herunterladen, um offline zu lesen und im Vollbildmodus anzuzeigen.

0 Gefällt mir

Teilen

Herunterladen, um offline zu lesen

Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol

Herunterladen, um offline zu lesen

Virtual presentation given at the FAIR-DI workshop held in Louvain-la-neuve, Belgium, Sept 30 2021

Ähnliche Bücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Ähnliche Hörbücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen
  • Gehören Sie zu den Ersten, denen das gefällt!

Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol

  1. 1. Evaluating Machine Learning Algorithms for Materials Science using the Matbench Protocol Anubhav Jain Staff Scientist, Lawrence Berkeley National Laboratory Deputy Director, Materials Project materialsproject.org The Materials Project Slides (already) uploaded to https://hackingmaterials.lbl.gov
  2. 2. Outline of talk 1. A quick introduction to the Materials Project 2. Engaging the community: The MPContribs data platform 3. Benchmarking machine learning algorithms using the Matbench protocol
  3. 3. A quick introduction to the Materials Project
  4. 4. The core of Materials Project is a free database of calculated materials properties and crystal structures Free, public resource • www.materialsproject.org Data on ~150,000 materials, including information on: • electronic structure • phonon and thermal properties • elastic / mechanical properties • magnetic properties • ferroelectric properties • piezoelectric properties • dielectric properties Powered by hundreds of millions of CPU-hours invested into high- quality calculations 4
  5. 5. The core data set keeps growing with time … 5
  6. 6. Apps give insight into data Materials Explorer Phase Stability Diagrams Pourbaix Diagrams (Aqueous Stability) Battery Explorer 6
  7. 7. The code powering the Materials Project is available open source (BSD/MIT licenses) just-in-time error correction, fixing your calculations so you don’t have to ‘recipes' for common materials science simulation tasks making materials science web apps easy workflow management software for high-throughput computing materials science analysis code: make, transform and analyze crystals, phase diagrams and more & more … MP team members also contribue to several other non-MP codes, e.g. matminer for machine learning featurization 7
  8. 8. Example: calculation workflows implemented in by dozens of collaborators Phonons Elasticity Defects Magnetism Band Structures Stability Grain Boundaries Equations of State X-ray Absorption Spectra Piezoelectric Dielectric Surfaces & more … 9 Requirements: VASP license and a big computer ABINIT planned in future w/G.-M. Rignanese 8
  9. 9. Example 2: matminer allows researchers to generate diverse feature sets for machine learning 9 >60 featurizer classes can generate thousands of potential descriptors that are described in the literature feat = EwaldEnergy([options]) y = feat.featurize([input_data]) • compatible with scikit- learn pipelining • automatically deploy multiprocessing to parallelize over data • include citations to methodology papers
  10. 10. The Materials Project is used heavily by the research community > 180,000 registered users > 40,000 new users last year ~100 new registrations/day ~5,000-10,000 users log on every day > 2M+ records downloaded through API each day; 1.8 TB of data served per month 10
  11. 11. A large fraction of users are from industry Student 44% Academia 36% Industry 10% Government 5% Other 5% 3.5% Schrodinger: Many of our customers are active users of the Materials Project and use MP databases for their projects. Enabling direct access to MP databases from within Schrödinger software is a powerful addition that will be appreciated by our users. Toyota: “Materials Project is a wonderful project. Please accept my appreciation to you to release it free and easy to access.” Hazen Research: “Amazing and well done data base. I still remember searching Landolt-Börnstein series during my PhD for similar things.” 11
  12. 12. Engaging the community: the MPContribs data platform
  13. 13. How can we use Materials Project to build a community of materials researchers? Materials Project now has high visibility (e.g., by search engines) How can we use this platform to help add value to the community of materials researchers? 13
  14. 14. Beyond calculations: MPContribs allows the research community to contribute their own data A “materials detail page,” containing all the information MP has calculated about a specific material Experimental data on a material (either specific phase, composition, or chemical system) “MPContribs” bridges the gap 14
  15. 15. 2. Materials Project links to your contribution 3. Your data set and paper are linked 1. Google links to Materials Project page 15 From Google search to your data and your research, via MP
  16. 16. MPContribs is open for contributions You can now apply to contribute your data set and we will work with you to disseminate via MP Designed for: • smaller data sets (e.g., MBs to GBs); for large data files see NOMAD or other repos • Linking to MP compositions Available via mpcontribs.org 16
  17. 17. Benchmarking machine learning methods using the Matbench protocol
  18. 18. MP is now involved in an effort to benchmark various machine learning algorithms 18
  19. 19. Model 2 Without standardized benchmarks, ML models can be difficult to compare Model 1 Dataset 1 + No structures No AB2C3 compositions 4k samples Dataset 2 + Model 3 Dataset 3 + RMSETest Set = 0.05 eV MAE5-fold CV = 0.021 eV Val. Loss = 0.005 VS. VS. Structures avail. 100k samples Eabove hull < 0.050 eV ??? ??? ??? ??? ???
  20. 20. What’s needed – an “ImageNet” for materials science https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/ 20
  21. 21. Can we make the same advancements in materials as in computer vision? One of the reasons computer science / machine learning seems to advance so quickly is that they decouple data generation from algorithm development This allows groups to focus on algorithm development without all the data generation, data cleaning, etc. that often is the majority of an end-to-end data science project Clear comparisons also move the field forward and measure progress 21
  22. 22. The ingredients of the Matbench benchmark qStandard data sets qStandard test splits according to nested cross-validation procedure qAn online leaderboard that encourages reproducible results 22
  23. 23. Matbench includes 13 different ML tasks 23 Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
  24. 24. The tasks encompass a variety of problems 13 Ready-to-use ML tasks ranging in training size, target property, inputs, task type. • Pre-cleaned datasets from literature and online repositories (such as Materials Project) • Wide range of practical solid state ML tasks • Experimental and computed properties • Standardized error evaluation (nested CV)
  25. 25. Browse datasets and tasks with Materials Project MPContribsML https://ml.materialsproject.org
  26. 26. The ingredients of the Matbench benchmark ü Standard data sets qStandard test splits according to nested cross-validation procedure qAn online leaderboard that encourages reproducible results 26
  27. 27. 27 Most commonly used test split procedure • Training/validation is used for model selection • Test / hold-out is used only for error estimation (Test set should not inform model selection, i.e. “final answer”)
  28. 28. Think of it as N different “universes” – we have a different training of the model in each universe and a different hold-out. 28 Nested CV – like hold-out, but varies the hold-out set
  29. 29. Think of it as N different “universes” – we have a different training of the model in each universe and a different hold-out. 29 Nested CV – like hold-out, but varies the hold-out set “A nested CV procedure provides an almost unbiased estimate of the true error.” Varma and Simon, Bias in error estimation when using cross-validation for model selection (2006)
  30. 30. The ingredients of the Matbench benchmark ü Standard data sets ü Standard test splits according to nested cross-validation procedure qAn online leaderboard that encourages reproducible results 30
  31. 31. Matbench has an online leaderboard – matbench.materialsproject.org
  32. 32. Complete and reproducible results on standardized ML tasks Sample-by-sample predictions of all algorithms on all tasks, notebooks and scripts for reproduction Aggregate scores across nested CV folds Complete model metadata, hyperparameters, required compute, academic references .json .ipynb .py
  33. 33. Algorithm comparison across individual tasks OR complete benchmark Example: matbench_dielectric Compare both specialized and general-purpose algorithms across multiple error metrics
  34. 34. Evaluation of ML paradigms drives research and development Traditional paradigms: • Traditional Models (e.g., RF + MagPie[1] features) • AutoML inside “traditional ML” space (Automatminer) Advancements in deep neural networks: 1. doi.org/10.1038/npjcompumats.2016.28 Attention Networks (e.g., CRABNet [2]) Optimal Descriptor Networks (e.g, MODNet [3]) Crystal Graph Networks (e.g, CGCNN, MEGNet [4]) 2. doi.org/10.1038/s41524-021-00545-1 3. doi.org/10.1038/s41524-021-00552-2 4. doi.org/10.1021/acs.chemmater.9b01294
  35. 35. Matbench compares these ML model paradigms Traditional paradigms: • Traditional Models (e.g., RF + MagPie[1] features) • AutoML inside “traditional ML” space (Automatminer) Advancements in deep neural networks: 1. doi.org/10.1038/npjcompumats.2016.28 Attention Networks (e.g., CRABNet [2]) Optimal Descriptor Networks (e.g, MODNet [3]) Crystal Graph Networks (e.g, CGCNN, MEGNet [4]) 2. doi.org/10.1038/s41524-021-00545-1 3. doi.org/10.1038/s41524-021-00552-2 4. doi.org/10.1021/acs.chemmater.9b01294 ✓ - in Matbench ✓ - in Matbench ✓ - in Matbench ✓ - CGCNN in Matbench ✓ - MEGNET in progress ✓ - PR in review
  36. 36. Contribute your model to the body of knowledge Matbench Python package Evaluate an entire benchmark with ~10 lines of code $: pip install matbench from matbench.bench import MatbenchBenchmark mb = MatbenchBenchmark(autoload=False) for task in mb.tasks: task.load() for fold in task.folds: train_inputs, train_outputs = task.get_train_and_val_data(fold) my_model.train_and_validate(train_inputs, train_outputs) test_inputs = task.get_test_data(fold, include_target=False) predictions = my_model.predict(test_inputs) task.record(fold, predictions) mb.to_file("my_models_benchmark.json.gz") Your model needs to have: • a function that trains it based on training data • makes a prediction based on the trained model
  37. 37. Contribute your model to the body of knowledge Matbench Python package Evaluate an entire benchmark with ~10 lines of code $: pip install matbench from matbench.bench import MatbenchBenchmark mb = MatbenchBenchmark(autoload=False) for task in mb.tasks: task.load() for fold in task.folds: train_inputs, train_outputs = task.get_train_and_val_data(fold) my_model.train_and_validate(train_inputs, train_outputs) test_inputs = task.get_test_data(fold, include_target=False) predictions = my_model.predict(test_inputs) task.record(fold, predictions) mb.to_file("my_models_benchmark.json.gz") Submit model file along with your desired model metadata via Github PR
  38. 38. The ingredients of the Matbench benchmark ü Standard data sets ü Standard test splits according to nested cross-validation procedure ü An online leaderboard that encourages reproducible results 38
  39. 39. Results so far: graph NN for large data sets, conventional ML for small Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3. 39
  40. 40. Overall and upcoming goals for Matbench • We have introduced a method that allows researchers to evaluate their machine learning models on a standard benchmark, if they so choose • The “Matbench” resource also provides metadata and code examples that allows others to reproduce and use community ML models more easily, as well as discover new ML models • In the future, we hope to do expand the type of tasks, perform meta- analyses on what kinds of algorithms work best for certain problems, and plot progress on these tasks over time 40
  41. 41. Concluding thoughts The Materials Project is a free resource providing data and tools to help perform research and development of new materials Even more can be accomplished as a unified community to push forward data dissemination as well as the capabilities of machine learning 41 We encourage you to give Matbench a try, and look forward to seeing your algorithm on the leaderboard!
  42. 42. Kristin Persson MP Director The team Intro Thank you! Patrick Huck Staff Scientist (MPContribs) Alex Dunn Grad Student (Matbench / matminer) Slides (already) uploaded to https://hackingmaterials.lbl.gov

Virtual presentation given at the FAIR-DI workshop held in Louvain-la-neuve, Belgium, Sept 30 2021

Aufrufe

Aufrufe insgesamt

74

Auf Slideshare

0

Aus Einbettungen

0

Anzahl der Einbettungen

0

Befehle

Downloads

2

Geteilt

0

Kommentare

0

Likes

0

×