Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Accelerating materials design through natural
language processing
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley ...
• Often, materials are known for several decades
before their functional applications are known
– MgB2 sitting on lab shel...
What constrains traditional approaches to materials design?
3
“[The Chevrel] discovery resulted from a lot of
unsuccessful...
4
Researchers are starting to fundamentally re-think how we
invent the materials that make up our devices
Next-
generation...
Outline
5
① Natural language processing - where are
we right now?
② What’s next for the NLP work?
6
Can ML help us work through our backlog of information we
need to assimilate from text sources?
papers to read “someday”...
• It is difficult to look up all information any given material
due to the many different ways chemical compositions
are w...
What is matscholar?
• Matscholar is an attempt to organize the world’s
information on materials science, connecting
togeth...
One of our main projects concerns named entity
recognition, or automatically labeling text
9
This allows for search
and is...
1
0
> 4 million
Papers Collected
31 million
Properties
19 million
Materials Mentions
8.8 million
Characterization Methods
...
11
Now we can search!
Live on www.matscholar.com
12
Another example …
• The publication data set is not complete
• Currently analyzing abstracts only
• The algorithms are not perfect
• The sea...
14
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
...
Extracted 4 million abstracts
of relevant scientific articles
using various APIs from
journal publishers
Some are more dif...
16
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-...
• First split the text into sentences
– Seems simple, but remember edge cases like ”et al.” or
“etc.” does not necessarily...
18
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-...
• Part A is marking abstracts
as relevant / non-relevant
to inorganic materials
science
• Part B is tediously labeling
~60...
20
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-...
• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vec...
• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vec...
• The classic example is:
– “king” - “man” + “woman” = ? → “queen”
23
Word embeddings trained on ”normal” text learns
rela...
24
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-...
• If you read this sentence:
“The band gap of ___ is 4.5 eV”
It is clear that the blank should be filled in with a
materia...
26
Step 4b.An LSTM neural net classifies words by reading
word sequences
Weston, L. et al Named Entity
Recognition and Nor...
27
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-...
28
Step 5. Sit back and let the model label things for you!
Named Entity Recognition
X
• Custom machine learning models to...
29
Could these techniques also be used to predict which
materials we might want to screen for an application?
papers to re...
• The classic example is:
– “king” - “man” + “woman” = ? → “queen”
30
Remember that word embeddings seem to learn
relation...
31
For scientific text, it learns scientific concepts as well
crystal structures of the elements
Tshitoyan, V. et al. Unsu...
32
There seems to be materials knowledge encoded in the
word vectors
Tshitoyan, V. et al. Unsupervised word embeddings cap...
33
Note that more data is not always better!
We want relevance
Tshitoyan, V. et al. Unsupervised word embeddings capture l...
34
Word embeddings also have the periodic table encoded in it
with no prior knowledge
“word embedding”
periodic table
Tshi...
• Dot product of a composition word with
the word “thermoelectric” essentially
predicts how likely that word is to appear
...
36
Try ”going back in time” and ranking materials, and follow
what happens in later years
Tshitoyan, V. et al.
Unsupervise...
– For every year since
2001, see which
compounds we would
have predicted using
only literature data until
that point in ti...
38
We also published a list of potential new thermoelectrics
Tshitoyan, V. et al. Unsupervised word embeddings capture
lat...
39
Two were studied between submission and publication of
manuscript
Tshitoyan, V. et al. Unsupervised word embeddings cap...
40
More were studied since then (mainly computationally)
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent ...
41
More were studied since then (mainly computationally)
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent ...
42
More were studied since then (mainly computationally)
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent ...
43
Our collaborators also synthesized a prediction, finding a
moderate zT of 0.14
Tshitoyan, V. et al. Unsupervised word e...
44
How is this working?
“Context
words” link
together
information
from different
sources
Outline
45
① Natural language processing - where are
we right now?
② What’s next for the NLP work?
46
1.Automatic creation of structured materials databases from
the literature, e.g. doping database
Sentence Base
Material...
47
2. Learning representations of materials
● Mat2vec suggested that embeddings contain chemical information
● Can we make...
48
AnyMat2Vec expands word embeddings beyond materials
seen explicitly by the algorithm
49
Initial results – predicting experimental band gap from
composition (~3000 data points)
50
3. Creating a comprehensive software library for materials
science NLP research (multiple LBNL research groups)
https:/...
51
4. D2S2 - data driven synthesis science
(in progress, larger LBNL collaboration)
Can we combine natural language proces...
Title auto-generated from abstract Published Title
Dynamics of molecular hydrogen
confined in narrow nanopores
Restricted ...
53
The Matscholar team
Kristin Persson
Anubhav Jain
Gerbrand Ceder
John
Dagdelen
Leigh
Weston
Vahe
Tshitoyan
Amalie
Trewar...
Nächste SlideShare
Wird geladen in …5
×

von

Accelerating materials design through natural language processing Slide 1 Accelerating materials design through natural language processing Slide 2 Accelerating materials design through natural language processing Slide 3 Accelerating materials design through natural language processing Slide 4 Accelerating materials design through natural language processing Slide 5 Accelerating materials design through natural language processing Slide 6 Accelerating materials design through natural language processing Slide 7 Accelerating materials design through natural language processing Slide 8 Accelerating materials design through natural language processing Slide 9 Accelerating materials design through natural language processing Slide 10 Accelerating materials design through natural language processing Slide 11 Accelerating materials design through natural language processing Slide 12 Accelerating materials design through natural language processing Slide 13 Accelerating materials design through natural language processing Slide 14 Accelerating materials design through natural language processing Slide 15 Accelerating materials design through natural language processing Slide 16 Accelerating materials design through natural language processing Slide 17 Accelerating materials design through natural language processing Slide 18 Accelerating materials design through natural language processing Slide 19 Accelerating materials design through natural language processing Slide 20 Accelerating materials design through natural language processing Slide 21 Accelerating materials design through natural language processing Slide 22 Accelerating materials design through natural language processing Slide 23 Accelerating materials design through natural language processing Slide 24 Accelerating materials design through natural language processing Slide 25 Accelerating materials design through natural language processing Slide 26 Accelerating materials design through natural language processing Slide 27 Accelerating materials design through natural language processing Slide 28 Accelerating materials design through natural language processing Slide 29 Accelerating materials design through natural language processing Slide 30 Accelerating materials design through natural language processing Slide 31 Accelerating materials design through natural language processing Slide 32 Accelerating materials design through natural language processing Slide 33 Accelerating materials design through natural language processing Slide 34 Accelerating materials design through natural language processing Slide 35 Accelerating materials design through natural language processing Slide 36 Accelerating materials design through natural language processing Slide 37 Accelerating materials design through natural language processing Slide 38 Accelerating materials design through natural language processing Slide 39 Accelerating materials design through natural language processing Slide 40 Accelerating materials design through natural language processing Slide 41 Accelerating materials design through natural language processing Slide 42 Accelerating materials design through natural language processing Slide 43 Accelerating materials design through natural language processing Slide 44 Accelerating materials design through natural language processing Slide 45 Accelerating materials design through natural language processing Slide 46 Accelerating materials design through natural language processing Slide 47 Accelerating materials design through natural language processing Slide 48 Accelerating materials design through natural language processing Slide 49 Accelerating materials design through natural language processing Slide 50 Accelerating materials design through natural language processing Slide 51 Accelerating materials design through natural language processing Slide 52 Accelerating materials design through natural language processing Slide 53
Nächste SlideShare
What to Upload to SlideShare
Weiter
Herunterladen, um offline zu lesen und im Vollbildmodus anzuzeigen.

0 Gefällt mir

Teilen

Herunterladen, um offline zu lesen

Accelerating materials design through natural language processing

Herunterladen, um offline zu lesen

Virtual Seminar given at Kansas State University, Feb 23 2021

Ähnliche Bücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Ähnliche Hörbücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen
  • Gehören Sie zu den Ersten, denen das gefällt!

Accelerating materials design through natural language processing

  1. 1. Accelerating materials design through natural language processing Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA KSU Virtual Seminar, Feb 23 2021 Slides (already) posted to hackingmaterials.lbl.gov
  2. 2. • Often, materials are known for several decades before their functional applications are known – MgB2 sitting on lab shelves for 50 years before its identification as a superconductor in 2001 – LiFePO4 known since 1938, only identified as a Li-ion battery cathode in 1997 • Even after discovery, optimization and commercialization still take decades • How is this typically done? 2 Typically, both new materials discovery and optimization take decades
  3. 3. What constrains traditional approaches to materials design? 3 “[The Chevrel] discovery resulted from a lot of unsuccessful experiments of Mg ions insertion into well-known hosts for Li+ ions insertion, as well as from the thorough literature analysis concerning the possibility of divalent ions intercalation into inorganic materials.” -Aurbach group, on discovery of Chevrel cathode for multivalent (e.g., Mg2+) batteries Levi, Levi, Chasid, Aurbach J. Electroceramics (2009)
  4. 4. 4 Researchers are starting to fundamentally re-think how we invent the materials that make up our devices Next- generation materials design Computer- aided materials design Natural language processing “Self-driving laboratories”
  5. 5. Outline 5 ① Natural language processing - where are we right now? ② What’s next for the NLP work?
  6. 6. 6 Can ML help us work through our backlog of information we need to assimilate from text sources? papers to read “someday” NLP algorithms
  7. 7. • It is difficult to look up all information any given material due to the many different ways chemical compositions are written – a search for “TiNiSn” will give different results than “NiTiSn” – a search for “GaSb” won’t match text that reads “Ga0.5Sb0.5” – a search for “SnBi4Te7” won’t match text that reads “we studied SnBi4X7 (X=S, Se, Te)”. – a search for “AgCrSe2”, if it doesn’t have any hits, won’t suggest “CuCrSe2” as a similar result • It is difficult to ask questions or compile summaries, e.g.: – What is the band gap of “Si”? – What are all the known dopants into GaAs? – What are all materials studied as thermoelectrics? 7 Traditional search doesn’t answer the questions we want
  8. 8. What is matscholar? • Matscholar is an attempt to organize the world’s information on materials science, connecting together topics of study, synthesis and characterization methods, and specific materials compositions • It is also an effort to use state-of-the-art natural language processing to make collective use of the information in millions of articles
  9. 9. One of our main projects concerns named entity recognition, or automatically labeling text 9 This allows for search and is crucial to downstream tasks
  10. 10. 1 0 > 4 million Papers Collected 31 million Properties 19 million Materials Mentions 8.8 million Characterization Methods 7.5 million Applications 5 million Synthesis Methods •Data Collection: Over 4 million papers collected from more than 2100 journals. Note – entities are currently extracted only from the abstracts of the papers
  11. 11. 11 Now we can search! Live on www.matscholar.com
  12. 12. 12 Another example …
  13. 13. • The publication data set is not complete • Currently analyzing abstracts only • The algorithms are not perfect • The search interface could be improved further • We would like to hear from you if you try this! 13 Limitations (it is not perfect)
  14. 14. 14 How does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  15. 15. Extracted 4 million abstracts of relevant scientific articles using various APIs from journal publishers Some are more difficult than others to obtain. Data cleaning is often needed (e.g., stray HTML tags, copyright statements) Abstract collection continues … 15 Step 1 – data collection
  16. 16. 16 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  17. 17. • First split the text into sentences – Seems simple, but remember edge cases like ”et al.” or “etc.” does not necessarily signify end of sentence despite the period • Then split the sentences into words – Tricky things are detecting and normalizing chemical formulas, selective lowercasing (“Battery” vs “battery” or “BaS” vs “BAs”), homogenizing numbers, etc. • Done largely with the ChemDataExtractor* with some custom improvements – We may move to a fully custom tokenizer soon 17 Step 2 - tokenization *http://chemdataextractor.org
  18. 18. 18 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  19. 19. • Part A is marking abstracts as relevant / non-relevant to inorganic materials science • Part B is tediously labeling ~600 abstracts – Largely done by one person – Spot-check of 25 abstracts by a second person gave 87.4% agreement 19 Step 3 – hand label abstracts
  20. 20. 20 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  21. 21. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 21 Step 4a: the word2vec algorithm is used to “featurize” words Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
  22. 22. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 22 Step 4a: the word2vec algorithm is used to “featurize” words Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017 “You shall know a word by the company it keeps” - John Rupert Firth (1957)
  23. 23. • The classic example is: – “king” - “man” + “woman” = ? → “queen” 23 Word embeddings trained on ”normal” text learns relationships between words
  24. 24. 24 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  25. 25. • If you read this sentence: “The band gap of ___ is 4.5 eV” It is clear that the blank should be filled in with a material word (not a synthesis method, characterization method, etc.) How do we get a neural network to take into account context (as well as properties of the word itself)? 25 Step 4b: How do we train a model to recognize context?
  26. 26. 26 Step 4b.An LSTM neural net classifies words by reading word sequences Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  27. 27. 27 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  28. 28. 28 Step 5. Sit back and let the model label things for you! Named Entity Recognition X • Custom machine learning models to extract the most valuable materials-related information. • Utilizes a long short-term memory (LSTM) network trained on ~1000 hand-annotated abstracts. • f1 scores of ~0.9. f1 score for inorganic materials extraction is >0.9. Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  29. 29. 29 Could these techniques also be used to predict which materials we might want to screen for an application? papers to read “someday” NLP algorithms
  30. 30. • The classic example is: – “king” - “man” + “woman” = ? → “queen” 30 Remember that word embeddings seem to learn relationships in text
  31. 31. 31 For scientific text, it learns scientific concepts as well crystal structures of the elements Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  32. 32. 32 There seems to be materials knowledge encoded in the word vectors Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  33. 33. 33 Note that more data is not always better! We want relevance Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  34. 34. 34 Word embeddings also have the periodic table encoded in it with no prior knowledge “word embedding” periodic table Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  35. 35. • Dot product of a composition word with the word “thermoelectric” essentially predicts how likely that word is to appear in an abstract with the word thermoelectric • Compositions with high dot products are typically known thermoelectrics • Sometimes, compositions have a high dot product with “thermoelectric” but have never been studied as a thermoelectric • These compositions usually have high computed power factors! (DFT+BoltzTraP) 35 Making predictions: dot products measure likelihood for words to co-occur Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  36. 36. 36 Try ”going back in time” and ranking materials, and follow what happens in later years Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  37. 37. – For every year since 2001, see which compounds we would have predicted using only literature data until that point in time – Make predictions of what materials are the most promising thermoelectrics for data until that year – See if those materials were actually studied as thermoelectrics in subsequent years 37 A more comprehensive “back in time” test Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  38. 38. 38 We also published a list of potential new thermoelectrics Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019). It is one thing to retroactively test, but perhaps another to see how things go after publication
  39. 39. 39 Two were studied between submission and publication of manuscript Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  40. 40. 40 More were studied since then (mainly computationally) Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  41. 41. 41 More were studied since then (mainly computationally) Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  42. 42. 42 More were studied since then (mainly computationally) Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019). https://arxiv.org/abs/2010.08461
  43. 43. 43 Our collaborators also synthesized a prediction, finding a moderate zT of 0.14 Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  44. 44. 44 How is this working? “Context words” link together information from different sources
  45. 45. Outline 45 ① Natural language processing - where are we right now? ② What’s next for the NLP work?
  46. 46. 46 1.Automatic creation of structured materials databases from the literature, e.g. doping database Sentence Base Material Dopant Doping Concentr. …the influence of yttrium doping (0-10mol%) on BSCF… BSCF Yttrium 0-10 mol% undoped, anion-doped(Sb,Bi) and cation-doped(Ca,Zn) solid sln. of Mg10Si2Sn3… Mg10Si2Sn3 Sb, Bi, Ca, Zn The zT of As2Cd3 with electron doping is found to be ~ with n=10^20cm-3 As2Cd3 electron n=10^20cm-3 This leads to zT=0.5 obtained at 500K (p=10^20cm-3) in p-type As2Cd3T As2Cd3 p-type p=10^20cm-3 The undoped and 0.25wt% La doped CdO films show 111… …however, …. for doping concentrations greater than 0.50wt%. CdO La 0.25wt%, >0.5% Will allow you to answer questions like “what are all the materials known to be doped with Eu3+” ?
  47. 47. 47 2. Learning representations of materials ● Mat2vec suggested that embeddings contain chemical information ● Can we make embeddings for arbitrary materials as material descriptors? ● i.e., word embeddings for materials not in the literature ● Descriptors could be used for direct classification for application (link prediction), or quantitative property prediction (regression features)
  48. 48. 48 AnyMat2Vec expands word embeddings beyond materials seen explicitly by the algorithm
  49. 49. 49 Initial results – predicting experimental band gap from composition (~3000 data points)
  50. 50. 50 3. Creating a comprehensive software library for materials science NLP research (multiple LBNL research groups) https://github.com/lbnlp
  51. 51. 51 4. D2S2 - data driven synthesis science (in progress, larger LBNL collaboration) Can we combine natural language processing with theory and experiments to control synthesis?
  52. 52. Title auto-generated from abstract Published Title Dynamics of molecular hydrogen confined in narrow nanopores Restricted dynamics of molecular hydrogen confined in activated carbon nanopores Microfluidic Generation of Polydisperse Solid Foams Generation of Solid Foams with Controlled Polydispersity Using Microfluidics Minimum variance unbiased estimator of product performance Assessing the lifetime performance index of gamma lifetime products in the manufacturing industry Angle resolved ultraviolet photoemission study of fluorescein films on Ag 110 The growth of thin fluorescein films on Ag 110” 52 ... and also some fun things, like automatic title generation Also have results in suggesting journals to submit a new article to, etc.
  53. 53. 53 The Matscholar team Kristin Persson Anubhav Jain Gerbrand Ceder John Dagdelen Leigh Weston Vahe Tshitoyan Amalie Trewartha Alex Dunn Viktoriia Baibakova Funding from (now at Google) (now at Medium) Slides (already) posted to hackingmaterials.lbl.gov

Virtual Seminar given at Kansas State University, Feb 23 2021

Aufrufe

Aufrufe insgesamt

93

Auf Slideshare

0

Aus Einbettungen

0

Anzahl der Einbettungen

0

Befehle

Downloads

2

Geteilt

0

Kommentare

0

Likes

0

×