Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Materials Data Facility: Streamlined and automated data sharing, discovery, access, and analysis
1. Ian Foster (foster@uchicago.edu)1,2,
Ben Blaiszik1,2 (blaiszik@uchicago.edu),
Jonathon Gaff1, Logan Ward1, Kyle Chard1, Jim Pruyne1,
Rachana Ananthakrishnan1, Steven Tuecke1
Michael Ondrejcek3, Kenton McHenry3, John Towns3
University of Chicago1, Argonne National Laboratory2, University of Illinois at Urbana-Champaign3
materialsdatafacility.org
globus.org
Materials Data Facility
Streamlined and automated data sharing,
discovery, access, and analysis
2. Team (unordered)
2
UC/Argonne
Ian Foster (PI) Ben Blaiszik Steve Tuecke
Kyle ChardJim Pruyne
Logan Ward Jonathon Gaff
Illinois (Urbana-Champaign)
Rachana
Ananthakrishnan
John Towns
(PI)
Kenton McHenry
Michal Ondrejcek
Roselyne Tchoua
Stephen Rosen
3. NIST and CHiMaD collaborators
3
• Carelyn Campbell (MDF – MML Repository)
• Chandler Becker (NIST Materials Resource Schema)
• Zach Trautt (NIST, WholeTale Materials Science Working Group)
• Trautt, Lucas Hale, Becker (MDF - NIST Interatomic Potentials
Repository)
• Alden Dima, Sharief Youssef (NIST Materials Resource Registry)
• Juan de Pablo, Deborah Aldus (PPPDB)
• Laura Bartolo (IMaD, Schema development, Polymer Group)
• Mark Hersam (data publication early adopter)
• Peter Voorhees (IMaD, data publication early adopter)
• Ankit Agrawal (Machine learning and data mining using MDF)
• Cate Brinson, Richard Zhao (MDF – NanoMine)
• Mike Bedzyk (Use case data publication of Co data - soon)
• Ken Kroenlein, Boris Wilthan (NIST TRC Alloy Database - soon)
4. To accelerate data-driven discovery,
streamline &
automate
4
Access
Integrate
Analyze
Share
Discover
MRS-TMS big data survey, responses to
question: 'Do you consider the following
items to be impediments or motivation
for you to share your data?’
[Ward, Warren, Hanisch, in Integrating Materials
and Manufacturing Innovation, 3:22, 2014.]
5. • Simplify data publication, regardless of size,
type, and location
• Automate data and metadata ingest, to enable
capture of many valuable materials datasets
• Enable unified search of disparate materials
data sources
• Deploy APIs to foster community development,
data creation, and data consumption
Streamline and automate: Four key steps
6. Publication REST APIs Discovery
• Identify datasets with
persistent identifiers (e.g.,
DOI or Handle)
• Describe datasets with
appropriate metadata and
provenance
• Verify dataset contents over
time
• Handle big (and small) data:
We have already ingested
datasets with > 1.5 M files
and > 1.5 TB in size
• Search, query, and access
datasets in modern ways
• Automatically index flexible
metadata and harvest file
contents
• Provide simple user
interfaces (c.f., Google and
Amazon)
Materialsdatafacility.
org
PPPDB
• Extract key polymer properties from literature via
natural language processing and crowdsourcing
• Build interfaces to explore curated χ and other property values
• Includes 375 polymer-polymer values and 1,014 polymer-solvent
values
Streamline and automate
7. MDF data services and tools
7
EP
EP
EP
EP
Automatic sync
EP
Deep indexing
Query
Browse
Aggregate
Publish
Databases
Datasets
APIs
etc.
Distributed
data
publication
Data
publication
service
Data
discovery
service
8. Streamline & automate data publication
7.2 TB
5.3 TB out
Data
Volumes
Utilization
Authors
94
Institutions
14
Accesses
>1000
Total
datasets
31
CHiMaD
datasets
15
Pipeline CHiMaD
datasets
+14
Total
datasets
+30
9. Published data highlights
9
~ 30 datasets
~ 6.5 TB
X-ray scattering image classification
using deep learning
Electron backscattering and
diffraction datasets for Ni, Mg, Fe, Si
Yager et al. http://dx.doi.org/10.18126/M2Z30Z
Marc De Graef et al.
http://dx.doi.org/doi:10.18126/M2D593 and others
Phase field benchmark I dataset
Jokisaari et al. http://dx.doi.org/doi:10.18126/M2101X
Grain structure,
grain-averaged
lattice strains, and
macro-scale strain
data for
superelastic
Nickel-Titanium
shape memory
alloy polycrystal
loaded in tension
Paranjape et al. http://dx.doi.org/doi:10.18126/M2NK5W
Largest dataset to date (>1.5 TB). Makes a unique
dataset discoverable for code development, analysis,
and benchmarking
10. User perspective
1 Setup endpoint at data origin
and create config (one time)
2 Collect data
3 Describe data
4 Publish! (i.e., invoke script)
Automate data publication
New Capability
Publish dataset2
Configure1
11. User perspective
Describe data
4
1 Setup endpoint at data origin
and create config (one time)
2 Collect data
3
Publish! (i.e., invoke script)
Behind the scenes
• Publication created in MDF Data Publication service
• Directory created on publication endpoint
• ACLs set on endpoint as defined by collection
• Submitted metadata added to the database and verified against
schema
• Transfer automated between origin and publication endpoint
• (optional) Curation flow started
• DOI minted
• Metadata registered in search metadata pushed to MRR
New Capability
Automate data publication
12. Enable inline viewing of data
• Continue to support big data solutions
• Leverage Globus HTTPS endpoint to simplify access to smaller files
• Checks access control lists
Preview in browser
MDF Dataset:
K. Yager, J. Lhermitte, D. Yu, B. Wang, Z. Guan, "Dataset of Synthetic X-ray Scattering Images for Classification Using
Deep Learning," 2017, http://dx.doi.org/doi:10.18126/M2Z30Z
New Capability
13. MDF data ingest and indexing
Data
index Data sources
indexed
16
Records
~1M
Metadata
index
Repositories harvested
• MDF
• NIST MML Repo
• MATIN
• Materials
Commons
• CXIDB
• NDS Materials
Resource
Registry
6
Deeply
indexed
10
~200 Datasets
>2.75 PB Identified for future indexing
~260 TBMade
discoverable
15. Indexed data highlights
15
~ 30 datasets
~ 6.5 TB
Harvard Organic Photovoltaic DB
• Mix of experimental and simulation data;
should combine nicely with other OPV datasets
• Data was available on Figshare as an oddly
formatted text file
• Now available via simple API queries in MDF
NanoMine (Brinson et al.) Ab initio diffusion database
Ingesting full metadata and also investigating
how to refine the metadata into a more easily
queryable form while preserving the rigor of
the current schema
• Nature Datasets paper had only summary values
available (~o(10kb)); MDF has entire dataset and
calculation (o(100 GB)).
• Most popular MDF dataset to date
• Building closer ties to this group through IMaD
• Indexed metadata and links to the various
potential files.
• Data are small and easy to access
• MDF index provides another avenue to
discover this data and to automate access
NIST
16. • Software as a service
• REST APIs & clients
• Custom faceting &
query boosting
• Partial matches,
range queries,
fuzzy matches
• Index data collections
(many topics)
• Deep indexing of specific datasets (e.g., atomistic modeling, image
metadata, etc.)
• Investigating multi-scale search (e.g., find high level resources,
datasets, dataset records, or even file contents) 16
Unified data search
Custom boosting
Facets
17. Data search with access control
Multiple sets of access-controlled metadata may be
attached to a subject
Subject
Metadata3
ACL: [User2]
User1:
Sees union of
Metadata1, Metadata2
User2:
Sees union of
Metadata2, Metadata3
Metadata2
ACL:
[public]
Metadata1
ACL: [User1]
Users View:
19. 19
MATIN (GT)
~ 10 datasets
Used in
education
In this example: NanoMine, PPPDB, MML Repository (NIST), MATIN, Materials Commons
(publications), CXIDB, linking with MRR…
Problem: Many materials data repositories, databases, and datasets
exist but interfaces and access to them are fragmented
Data
Discovery
Service
NIST MML
Repository
NanoMine
query
Materials CommonsPPPDB
harvest
CXIDB MATIN
Contribute a harvester (coming soon)
https://github.com/materials-data-facility/
Unified search across repositories, datasets, and DBs
20. In this example: NanoMine, PPPDB, MML Repository (NIST), MATIN, Materials Commons
(publications), CXIDB, linking with MRR…
Problem: Many materials data repositories, databases, and datasets
exist but interfaces and access to them are fragmented
Results from
PPPDB
Results from MML Repository and MDF
Unified search across repositories, datasets, and DBs
21. Finding and using interatomic potentials
21
~ 30 datasets
~ 6.5 TB
Problem: Materials data exist in disparate locations, but finding and
importing them into analysis scripts is time consuming
In this example: NIST Interatomic Potentials Repository, NIST Classic Potentials
MATIN (GT)
~ 10 datasets
Used in
education
Data
Discovery
Service
NIST Interatomic
Potentials Repository
NIST Classical Force
Fields
Harvest
…
Query
Use
22. Finding and using interatomic potentials
Problem: Materials data exist in disparate locations, but finding and
importing them into analysis scripts is tricky
22
Search for potentials in NIST Interatomic Potentials Repo
Format the discovered potentials
Load potentials into ASE and plot
In this example: NIST Interatomic Potentials Repository, NIST Classic Potentials
23. Integrating analytics tools with MDF
23
MATIN (GT)
~ 10 datasets
Used in
education
Result: Scientists connected with data, analytics tools,
and compute capability
MDF Data
Publication
MATIN (GT)
MML
Repository
(NIST)
Materials
Commons
(UM PRISMS)
Coherent X-Ray
Tomography
Database (LNL)
To End UsersTo End UsersTo Compute ResourcesFrom Data Repositories
Jetstream is a self-provisioned, scalable science and engineering cloud environment
operated by Indiana University for the National Science Foundation: jetstream-cloud.org
24. Building a machine learning model using MDF
A simple web service to simplify AGNI model building
25. Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Al data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
Method: Botu et al. JPCC. (2017)
25
Only1Dataset1DatasetBuilding a machine learning model using MDF
26. Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Al data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
Method: Botu et al. JPCC. (2017)
26
Improved fit on all three datasets by combining data
Only1Dataset2DatasetsBuilding a machine learning model using MDF
28. χDB: Database of Flory-Huggins
parameters
• Semi-automated information extraction
Crawl journals for potentially relevant publications
Pre-process the downloaded data
• Crowdsourcing
Student and expert
review/curate the
publications
29. PPPDB (χ)
• Search and browse through extracted and curated χ values at
pppdb.uchicago.edu
• Includes link to
• original publication
• Includes polymer-
solvent values
• Largest online
collection
of χ values!
• 375 polymer-polymer
and 1,014 polymer-
solvent values
• Built dictionary of polymers, solvents and other compounds
includin InChI keys & PubChem Ids…
30. Automatic extraction
of properties
The four-armed compd. (ANTH-OXA6t-OC12)
with the dodecyloxy surface group is a high
glass transition temp. (Tg: 211°) material and
exhibits good soly.
ANTH-OXA6t-OC12 211°
has_tg_of
• Goal
Automatically extract
properties from text
• Method
Use the ChemDataExtractor
(from Cambridge) with
additional rule to
automatically extract Glass
Transition Temperatures
ChemDataExtractor uses
NLP, rule-based approache
and machine learning
31. PPPDB (Tg)
Search and browse through
extracted Tg values at
pppdb.uchicago.edu
• Includes link to original
publication
• Includes option to Flag
ambiguous results (e.g.:
author’s acronyms)
• Looking at more than
1000 more results
33. Community engagement highlights
• ElectroCat
• DOE Energy Materials Network
• Using MDF to make data available internally and externally (3Q FY17)
• Integrative Materials and Design
• A Midwest Big Data Spoke affiliated with the Hub at UIUC
• Linking Materials Commons, MAST, 4CeeD and others with MDF and Globus tools. Build
data pipelines from MDF to Citrine. Community building and outreach to industry (e.g., LIFT
consortium and DMDII)
• Center for Predictive Simulation of Functional Materials
• DOE BES Computational Materials Center
• Will use MDF to publish and make data discoverable internally and externally (4Q FY17)
• Co-sponsored DOE-lab hackathon on data discoverability
and authentication at UC
• ~25 attendees (domain scientists, information systems leads, developers, postdocs, etc.)
• Built prototype interfaces and shared data schemas to make data from user facilities at
Brookhaven NSLSII, Oak Ridge SNS, Argonne APS, and LBL ALS
• Work continues…expect to share in (4Q FY17)
34. Community engagement highlights
• Blaiszik and Chard chair WholeTale Working Group in
Materials Science
• Working with various materials science researchers (representing experiment and
computation) to build tools to support reuse of data artifacts
• Leveraging this effort to link our data services, encourage adoption of MDF capabilities,
and define new platform use cases
• Advanced Photon Source at Argonne National Lab
• Working with the APS software group to make their data portal contents discoverable
• Working with researchers at beamlines to make their datasets discoverable and published:
• HEDM (along with users from AFRL and elsewhere)
• XPCS
• Tomography
Elsevier, Inc.
Working to mine
10,000s of articles
to expand PPPDB
and thus MDF Working to communicate MDF
APIs via hands-on tutorials
Globus
35. Integrative Materials and Design (IMaD)
35
MAST
Materials Commons (MC)
T2C2 (4CeeD)
• Build connections to international materials
efforts and registries (e.g., NIMS, RDA, NIST,
EUDAT, NDS)
• Promote IMaD data services, tools, and
accomplishments to the community
• Develop video tutorials, webinars, and shared
code repositories
• Interface with the Materials Accelerator
Network (MAN)
• Engage with colleges, industry, and
consortiums
• (Wisconsin) Regional Materials and
Manufacturing Network (RM2N)
• (Illinois) Digital Manufacturing and
Design Innovation Institute DMDII
• (Michigan) LIFT consortium
Engagement
Linking Software and Services
PIs: I. Foster1,2, J. Allison3, D. Morgan4, D. Trinkle5, P. Voorhees6
1 University of Chicago 2 Argonne National Laboratory 3 University of Michigan 4 University of Wisconsin-Madison 5 University
of Illinois at Urbana-Champaign, 6 Northwestern University
Overview
• NSF Midwest Big Data Spoke
36. Materials Resource Registry
36
• MDF is running an MRR instance as of ~Feb 2017,
but it is not yet populated
• MDF instance will harvest from other MRR instances
(setting this up with Chandler in the coming 2-3
weeks)
• We are also setting up the interfaces to make the
data available in our discovery service (i.e., OAI-
PMH harvesting)
• Long term this is interesting for hierarchical search
Find data resources or dig deeper into them all from the
same location
• Build reverse pipelines to be good community
citizens
37. DOE User Facility Interactions
37
• Data and experimental records generated
ORNL (SNS and HFIR)
Brookhaven (NSLS-II) via DataBroker project
Argonne (APS, various sectors and existing
data portals) – ranging from fuel spray data to
catalysis experiments and nanofabrication data
(small pending NSF proposal in nano
Argonne/MIT/UC)
LBL (ALS) CAMERA project
38. Recap and future work
To streamline materials data sharing, discovery, access,
and analysis, we have:
Simplified data publication, enabling ingest of 7 TB in 31
datasets from 11 institutions and 94 authors
Automated data and metadata capture for deep indexing of
large data collections, reaching 1M records from 16 sources
Unified search across all of these sites and collections
Deployed APIs enabling programmatic access
• We have demonstrated automated end-to-end
discovery, access, and analysis pipelines
Next steps: More data, more indexing, richer search,
more analyses
39. Thanks to our sponsors!
39
U . S . D E P A R T M E N T O F
ENERGY
Hinweis der Redaktion
Simplify interfaces for data publication regardless of data size, type, and location
Provide automation capabilities to capture data from pipelines
Deploy APIs to foster community development and integration
Encourage data re-use
Incentivize data sharing and
Support Open Science in materials research
PPPDB = Polymer Property Predictor and Database
*"Enable inline viewing of data”* : Ben will have a browser open to a dataset (probably the same dataset, and show the navigation and opening of the preview) [very quick]
Unified search across repositories, datasets, and DBs”*: Ben will demo search showing how the facets work and how the free text querying works [30s
Important as it allows for publication and search of access-controlled data.
Simplify interfaces for data publication regardless of data size, type, and location
Provide automation capabilities to capture data from pipelines
Deploy APIs to foster community development and integration
Encourage data re-use
Incentivize data sharing and
Support Open Science in materials research
*”Finding and using interatomic potentials”*: Ben will show how to use search to find and pull results from NIST Interatomic Potential Repo into Jupyter to complete an analysis using ASE software package (30s
General notes on presenting this:
“We’re currently integrating data analytics tools into the MDF. The goal is to have a system that will automate collecting data from different repositories, sending the data to compute resources (e.g., Jetstream),
We’re currently integrating data analytics tools into the MDF. The goal is to have a system that will automate collecting data from different repositories, sending the data to compute resources (e.g., Jetstream), and collecting the key results to send to users personal computers – with the whole process mediated using Globus transfer and Data Search.”
“This system will give scientists ready access to a rich data ecosystem, state-of-the-art tools, and computing capability all under the same platform.”
There is also a demonstration you can give for this:
“For example, we have recently implemented a tool to build machine-learning-based forcefields using the MDF.”
[ go to http://agni.demo.materialsdatafacility.org:6543/, highlight a few datasets and click “Go”]
“This will start a task on Jetstream to train a potential using the AGNI method, developed at UConn, and then send a report about the training accuracy and the model itself to another system.”
We replicated a machine learning force field from a recent publication using data available from a database at Uconn
By combining that Uconn dataset with other open datasets from the NIST MML Repository, we found that we can improve the accuracy of this model.
Animation shows the effect on the accuracy of the model after adding in one of the two dataset
Adding these new datasets even increases the accuracy of the model on the original training data
It is also important to note that the NIST datasets were not created for training potentials.
But, because they were openly available, we could utilize them for a brand new application
We replicated a machine learning force field from a recent publication using data available from a database at Uconn
By combining that Uconn dataset with other open datasets from the NIST MML Repository, we found that we can improve the accuracy of this model.
Animation shows the effect on the accuracy of the model after adding in one of the two dataset
Adding these new datasets even increases the accuracy of the model on the original training data
It is also important to note that the NIST datasets were not created for training potentials.
But, because they were openly available, we could utilize them for a brand new application
Simplify interfaces for data publication regardless of data size, type, and location
Provide automation capabilities to capture data from pipelines
Deploy APIs to foster community development and integration
Encourage data re-use
Incentivize data sharing and
Support Open Science in materials research
Recap slide. We went over the results, we annotated them with some expert’s notes and had another round of reviews.
Results are public and have been reviewed by graduate students and our expert at NIST: Debbie Audus
We added TG using rule-based but will be looking at ML approaches for the properties as well
Preliminary results. Currently looking at improving the ~1000 results.
Simplify interfaces for data publication regardless of data size, type, and location
Provide automation capabilities to capture data from pipelines
Deploy APIs to foster community development and integration
Encourage data re-use
Incentivize data sharing and
Support Open Science in materials research
Key highlights are paying customer and users training users
Key highlights are paying customer and users training users