Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Discovering advanced materials for energy
applications by mining the scientific literature
Anubhav Jain
Energy Technologie...
• Often, materials are known for several decades
before their functional applications are known
– MgB2 sitting on lab shel...
What constrains traditional approaches to materials design?
3
“[The Chevrel] discovery resulted from a lot of
unsuccessful...
4
Researchers are starting to fundamentally re-think how we
invent the materials that make up our devices
Next-
generation...
Outline
5
① Natural language processing - where are
we right now?
② What’s next for the NLP work?
6
Can ML help us work through our backlog of information we
need to assimilate from text sources?
papers to read “someday”...
• It is difficult to look up all information any given material
due to the many different ways chemical compositions
are w...
What is matscholar?
• Matscholar is an attempt to organize the world’s
information on materials science, connecting
togeth...
One of our main projects concerns named entity
recognition, or automatically labeling text
9
1
0
> 4 million
Papers Collected
31 million
Properties
19 million
Materials Mentions
8.8 million
Characterization Methods
...
11
Now we can search!
Live on www.matscholar.com
12
Another example …
13
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-...
Extracted 4 million
abstracts of relevant
scientific articles using
various APIs from
journal publishers
Some are more dif...
15
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-...
• First split the text into sentences
– Seems simple, but remember edge cases like ”et al.” or
“etc.” does not necessarily...
17
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-...
• Part A is marking abstracts
as relevant / non-relevant
to inorganic materials
science
• Part B is tediously labeling
~60...
19
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-...
• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vec...
• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vec...
• The classic example is:
– “king” - “man” + “woman” = ? → “queen”
22
Word embeddings trained on ”normal” text learns
rela...
23
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-...
• If you read this sentence:
“The band gap of ___ is 4.5 eV”
It is clear that the blank should be filled in with a
materia...
25
Step 4b.An LSTM neural net classifies words by reading
word sequences
Weston, L. et al Named Entity
Recognition and Nor...
26
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-...
27
Step 5. Sit back and let the model label things for you!Named Entity Recognition
X
• Custom machine learning models to
...
28
Live online …
29
Could these techniques also be used to predict which
materials we might want to screen for an application?
papers to re...
• The classic example is:
– “king” - “man” + “woman” = ? → “queen”
30
Remember that word embeddings seem to learn
relation...
31
For scientific text, it learns scientific concepts as well
crystal structures of the elements
Tshitoyan, V. et al. Unsu...
32
There seems to be materials knowledge encoded in the
word vectors
Tshitoyan, V. et al. Unsupervised word embeddings cap...
33
Note that more data is not always better!
We want relevance
Tshitoyan, V. et al. Unsupervised word embeddings capture l...
34
Word embeddings also have the periodic table encoded in it
with no prior knowledge
“word embedding”
periodic table
• Dot product of a composition word with
the word “thermoelectric” essentially
predicts how likely that word is to appear
...
36
Try ”going back in time” and ranking materials, and follow
what happens in later years
Tshitoyan, V. et al.
Unsupervise...
– For every year since
2001, see which
compounds we would
have predicted using
only literature data until
that point in ti...
• Thus far, 2 of our top 20 predictions made in
~August 2018 have already been reported in the
literature for the first ti...
39
How is this working?
“Context
words” link
together
information
from different
sources
Outline
40
① Natural language processing - where are
we right now?
② What’s next for the NLP work?
• Currently, we only have word vectors for
compositions that explicitly appear in abstracts
• We can rank known materials ...
42
“Hidden representation learning”
43
Initial results – predicting experimental band gap from
composition (~3000 data points)
44
Going beyond entity recognition towards relationship
extraction
45
Current approach is not good enough
• E.g., automatically generate databases from the
literature
– Materials and their numerical band gaps (or thermal
conduct...
47
D2S2 - data driven synthesis science (just starting)
Can we combine natural language processing with theory
and experim...
Title auto-generated from abstract Published Title
Dynamics of molecular hydrogen
confined in narrow nanopores
Restricted ...
49
Acknowledgements
Slides (already) posted to hackingmaterials.lbl.gov
• High-throughput DFT
– Gerbrand Ceder and “BURP” ...
50
The Matscholar team
Kristin PerssonAnubhav JainGerbrand Ceder
John
Dagdelen
Leigh
Weston
Vahe
Tshitoyan
Amalie
Trewarth...
Nächste SlideShare
Wird geladen in …5
×

von

Discovering advanced materials for energy applications by mining the scientific literature Slide 1 Discovering advanced materials for energy applications by mining the scientific literature Slide 2 Discovering advanced materials for energy applications by mining the scientific literature Slide 3 Discovering advanced materials for energy applications by mining the scientific literature Slide 4 Discovering advanced materials for energy applications by mining the scientific literature Slide 5 Discovering advanced materials for energy applications by mining the scientific literature Slide 6 Discovering advanced materials for energy applications by mining the scientific literature Slide 7 Discovering advanced materials for energy applications by mining the scientific literature Slide 8 Discovering advanced materials for energy applications by mining the scientific literature Slide 9 Discovering advanced materials for energy applications by mining the scientific literature Slide 10 Discovering advanced materials for energy applications by mining the scientific literature Slide 11 Discovering advanced materials for energy applications by mining the scientific literature Slide 12 Discovering advanced materials for energy applications by mining the scientific literature Slide 13 Discovering advanced materials for energy applications by mining the scientific literature Slide 14 Discovering advanced materials for energy applications by mining the scientific literature Slide 15 Discovering advanced materials for energy applications by mining the scientific literature Slide 16 Discovering advanced materials for energy applications by mining the scientific literature Slide 17 Discovering advanced materials for energy applications by mining the scientific literature Slide 18 Discovering advanced materials for energy applications by mining the scientific literature Slide 19 Discovering advanced materials for energy applications by mining the scientific literature Slide 20 Discovering advanced materials for energy applications by mining the scientific literature Slide 21 Discovering advanced materials for energy applications by mining the scientific literature Slide 22 Discovering advanced materials for energy applications by mining the scientific literature Slide 23 Discovering advanced materials for energy applications by mining the scientific literature Slide 24 Discovering advanced materials for energy applications by mining the scientific literature Slide 25 Discovering advanced materials for energy applications by mining the scientific literature Slide 26 Discovering advanced materials for energy applications by mining the scientific literature Slide 27 Discovering advanced materials for energy applications by mining the scientific literature Slide 28 Discovering advanced materials for energy applications by mining the scientific literature Slide 29 Discovering advanced materials for energy applications by mining the scientific literature Slide 30 Discovering advanced materials for energy applications by mining the scientific literature Slide 31 Discovering advanced materials for energy applications by mining the scientific literature Slide 32 Discovering advanced materials for energy applications by mining the scientific literature Slide 33 Discovering advanced materials for energy applications by mining the scientific literature Slide 34 Discovering advanced materials for energy applications by mining the scientific literature Slide 35 Discovering advanced materials for energy applications by mining the scientific literature Slide 36 Discovering advanced materials for energy applications by mining the scientific literature Slide 37 Discovering advanced materials for energy applications by mining the scientific literature Slide 38 Discovering advanced materials for energy applications by mining the scientific literature Slide 39 Discovering advanced materials for energy applications by mining the scientific literature Slide 40 Discovering advanced materials for energy applications by mining the scientific literature Slide 41 Discovering advanced materials for energy applications by mining the scientific literature Slide 42 Discovering advanced materials for energy applications by mining the scientific literature Slide 43 Discovering advanced materials for energy applications by mining the scientific literature Slide 44 Discovering advanced materials for energy applications by mining the scientific literature Slide 45 Discovering advanced materials for energy applications by mining the scientific literature Slide 46 Discovering advanced materials for energy applications by mining the scientific literature Slide 47 Discovering advanced materials for energy applications by mining the scientific literature Slide 48 Discovering advanced materials for energy applications by mining the scientific literature Slide 49 Discovering advanced materials for energy applications by mining the scientific literature Slide 50
Nächste SlideShare
What to Upload to SlideShare
Weiter
Herunterladen, um offline zu lesen und im Vollbildmodus anzuzeigen.

0 Gefällt mir

Teilen

Herunterladen, um offline zu lesen

Discovering advanced materials for energy applications by mining the scientific literature

Herunterladen, um offline zu lesen

Presentation given at Air Force Research Lab, Jan 2020

Ähnliche Bücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Ähnliche Hörbücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen
  • Gehören Sie zu den Ersten, denen das gefällt!

Discovering advanced materials for energy applications by mining the scientific literature

  1. 1. Discovering advanced materials for energy applications by mining the scientific literature Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA AFRL meeting, Jan 2020 Slides (already) posted to hackingmaterials.lbl.gov
  2. 2. • Often, materials are known for several decades before their functional applications are known – MgB2 sitting on lab shelves for 50 years before its identification as a superconductor in 2001 – LiFePO4 known since 1938, only identified as a Li-ion battery cathode in 1997 • Even after discovery, optimization and commercialization still take decades • To get a sense for why this is so hard, let’s look at the problem in more detail … 2 Typically, both new materials discovery and optimization take decades
  3. 3. What constrains traditional approaches to materials design? 3 “[The Chevrel] discovery resulted from a lot of unsuccessful experiments of Mg ions insertion into well-known hosts for Li+ ions insertion, as well as from the thorough literature analysis concerning the possibility of divalent ions intercalation into inorganic materials.” -Aurbach group, on discovery of Chevrel cathode for multivalent (e.g., Mg2+) batteries Levi, Levi, Chasid, Aurbach J. Electroceramics (2009)
  4. 4. 4 Researchers are starting to fundamentally re-think how we invent the materials that make up our devices Next- generation materials design Computer- aided materials design Natural language processing “Self-driving laboratories”
  5. 5. Outline 5 ① Natural language processing - where are we right now? ② What’s next for the NLP work?
  6. 6. 6 Can ML help us work through our backlog of information we need to assimilate from text sources? papers to read “someday” NLP algorithms
  7. 7. • It is difficult to look up all information any given material due to the many different ways chemical compositions are written – a search for “TiNiSn” will give different results than “NiTiSn” – a search for “GaSb” won’t match text that reads “Ga0.5Sb0.5” – a search for “SnBi4Te7” won’t match text that reads “we studied SnBi4X7 (X=S, Se, Te)”. – a search for “AgCrSe2”, if it doesn’t have any hits, won’t suggest “CuCrSe2” as a similar result • It is difficult to compile summaries, e.g.: – A list of all materials studied for an application – A list of all synthesis methods for a material 7 Traditional search doesn’t answer the questions we want
  8. 8. What is matscholar? • Matscholar is an attempt to organize the world’s information on materials science, connecting together topics of study, synthesis and characterization methods, and specific materials compositions • It is also an effort to use state-of-the-art natural language processing to make collective use of the information in millions of articles
  9. 9. One of our main projects concerns named entity recognition, or automatically labeling text 9
  10. 10. 1 0 > 4 million Papers Collected 31 million Properties 19 million Materials Mentions 8.8 million Characterization Methods 7.5 million Applications 5 million Synthesis Methods •Data Collection: Over 4 million full papers* collected from more than 2100 journals. * Entities only extracted from abstracts deemed relevant to inorganic materials science (~2M) so far.
  11. 11. 11 Now we can search! Live on www.matscholar.com
  12. 12. 12 Another example …
  13. 13. 13 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  14. 14. Extracted 4 million abstracts of relevant scientific articles using various APIs from journal publishers Some are more difficult than others to obtain. Abstract collection continues … 14 Step 1 – data collection
  15. 15. 15 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  16. 16. • First split the text into sentences – Seems simple, but remember edge cases like ”et al.” or “etc.” does not necessarily signify end of sentence despite the period • Then split the sentences into words – Tricky things are detecting and normalizing chemical formulas, selective lowercasing (“Battery” vs “battery” or “BaS” vs “BAs”), homogenizing numbers, etc. • Done largely with the ChemDataExtractor* with some custom improvements – We may move to a fully custom tokenizer soon 16 Step 2 - tokenization *http://chemdataextractor.org
  17. 17. 17 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  18. 18. • Part A is marking abstracts as relevant / non-relevant to inorganic materials science • Part B is tediously labeling ~600 abstracts – Largely done by one person – Spot-check of 25 abstracts by a second person gave 87.4% agreement 18 Step 3 – hand label abstracts
  19. 19. 19 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  20. 20. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 20 Step 4a: the word2vec algorithm is used to “featurize” words Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
  21. 21. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 21 Step 4a: the word2vec algorithm is used to “featurize” words Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017 “You shall know a word by the company it keeps” - John Rupert Firth (1957)
  22. 22. • The classic example is: – “king” - “man” + “woman” = ? → “queen” 22 Word embeddings trained on ”normal” text learns relationships between words
  23. 23. 23 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  24. 24. • If you read this sentence: “The band gap of ___ is 4.5 eV” It is clear that the blank should be filled in with a material word (not a synthesis method, characterization method, etc.) How do we get a neural network to take into account context (as well as properties of the word itself)? 24 Step 4b: How do we train a model to recognize context?
  25. 25. 25 Step 4b.An LSTM neural net classifies words by reading word sequences Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  26. 26. 26 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  27. 27. 27 Step 5. Sit back and let the model label things for you!Named Entity Recognition X • Custom machine learning models to extract the most valuable materials-related information. • Utilizes a long short-term memory (LSTM) network trained on ~1000 hand-annotated abstracts. • f1 scores of ~0.9. f1 score for inorganic materials extraction is >0.9. Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  28. 28. 28 Live online …
  29. 29. 29 Could these techniques also be used to predict which materials we might want to screen for an application? papers to read “someday” NLP algorithms
  30. 30. • The classic example is: – “king” - “man” + “woman” = ? → “queen” 30 Remember that word embeddings seem to learn relationships in text
  31. 31. 31 For scientific text, it learns scientific concepts as well crystal structures of the elements Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  32. 32. 32 There seems to be materials knowledge encoded in the word vectors Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  33. 33. 33 Note that more data is not always better! We want relevance Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  34. 34. 34 Word embeddings also have the periodic table encoded in it with no prior knowledge “word embedding” periodic table
  35. 35. • Dot product of a composition word with the word “thermoelectric” essentially predicts how likely that word is to appear in an abstract with the word thermoelectric • Compositions with high dot products are typically known thermoelectrics • Sometimes, compositions have a high dot product with “thermoelectric” but have never been studied as a thermoelectric • These compositions usually have high computed power factors! (DFT+BoltzTraP) 35 Making predictions: dot products measure likelihood for words to co-occur Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  36. 36. 36 Try ”going back in time” and ranking materials, and follow what happens in later years Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  37. 37. – For every year since 2001, see which compounds we would have predicted using only literature data until that point in time – Make predictions of what materials are the most promising thermoelectrics for data until that year – See if those materials were actually studied as thermoelectrics in subsequent years 37 A more comprehensive “back in time” test Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  38. 38. • Thus far, 2 of our top 20 predictions made in ~August 2018 have already been reported in the literature for the first time as thermoelectrics – Li3Sb was the subject of a computational study (predicted zT=2.42) in Oct 2018 – SnTe2 was experimentally found to be a moderately good thermoelectric (expt zT=0.71) in Dec 2018 • We are working with an experimentalist on one of the predictions (but ”spare time” project) 38 How about “forward” predictions? [1] Yang et al. "Low lattice thermal conductivity and excellent thermoelectric behavior in Li3Sb and Li3Bi." Journal of Physics: Condensed Matter 30.42 (2018): 425401 [2] Wang et al. "Ultralow lattice thermal conductivity and electronic properties of monolayer 1T phase semimetal SiTe2 and SnTe2." Physica E: Low-dimensional Systems and Nanostructures 108 (2019): 53-59
  39. 39. 39 How is this working? “Context words” link together information from different sources
  40. 40. Outline 40 ① Natural language processing - where are we right now? ② What’s next for the NLP work?
  41. 41. • Currently, we only have word vectors for compositions that explicitly appear in abstracts • We can rank known materials for an application, but for materials with zero or little mention in the scientific literature, we are stuck! • How do we get word embeddings for compositions that do not exist in the text? 41 Making predictions for entirely new compositions
  42. 42. 42 “Hidden representation learning”
  43. 43. 43 Initial results – predicting experimental band gap from composition (~3000 data points)
  44. 44. 44 Going beyond entity recognition towards relationship extraction
  45. 45. 45 Current approach is not good enough
  46. 46. • E.g., automatically generate databases from the literature – Materials and their numerical band gaps (or thermal conductivities, or bulk modulus, or superconducting temperature, etc.) – If materials can be made n-type, p-type, or both – Which synthesis techniques led to various sample descriptors • Will likely require more powerful techniques, e.g., attention-based algorithms (BERT, Google XLNet …) – To be investigated … 46 Once the accuracy improves, we can start to make much more powerful searches
  47. 47. 47 D2S2 - data driven synthesis science (just starting) Can we combine natural language processing with theory and experiments to control synthesis?
  48. 48. Title auto-generated from abstract Published Title Dynamics of molecular hydrogen confined in narrow nanopores Restricted dynamics of molecular hydrogen confined in activated carbon nanopores Microfluidic Generation of Polydisperse Solid Foams Generation of Solid Foams with Controlled Polydispersity Using Microfluidics Minimum variance unbiased estimator of product performance Assessing the lifetime performance index of gamma lifetime products in the manufacturing industry Angle resolved ultraviolet photoemission study of fluorescein films on Ag 110 The growth of thin fluorescein films on Ag 110” 48 ... and also some fun things, like automatic title generation
  49. 49. 49 Acknowledgements Slides (already) posted to hackingmaterials.lbl.gov • High-throughput DFT – Gerbrand Ceder and “BURP” team – Funding: Bosch / Umicore • Natural language processing – Gerbrand Ceder, Kristin Persson, and “Matscholar” team – Funding: Toyota Research Institutes • Overall work funded by US Department of Energy
  50. 50. 50 The Matscholar team Kristin PerssonAnubhav JainGerbrand Ceder John Dagdelen Leigh Weston Vahe Tshitoyan Amalie Trewartha Alex Dunn Viktoriia Baibakova Funding from (now at Google) (now at Medium)

Presentation given at Air Force Research Lab, Jan 2020

Aufrufe

Aufrufe insgesamt

324

Auf Slideshare

0

Aus Einbettungen

0

Anzahl der Einbettungen

5

Befehle

Downloads

18

Geteilt

0

Kommentare

0

Likes

0

×