Materials Data Facility: Streamlined and automated data sharing, discovery, access, and analysis

Ian Foster (foster@uchicago.edu)1,2,
Ben Blaiszik1,2 (blaiszik@uchicago.edu),
Jonathon Gaff1, Logan Ward1, Kyle Chard1, Jim Pruyne1,
Rachana Ananthakrishnan1, Steven Tuecke1
Michael Ondrejcek3, Kenton McHenry3, John Towns3
University of Chicago1, Argonne National Laboratory2, University of Illinois at Urbana-Champaign3
materialsdatafacility.org
globus.org
Materials Data Facility
Streamlined and automated data sharing,
discovery, access, and analysis

Team (unordered)
2
UC/Argonne
Ian Foster (PI) Ben Blaiszik Steve Tuecke
Kyle ChardJim Pruyne
Logan Ward Jonathon Gaff
Illinois (Urbana-Champaign)
Rachana
Ananthakrishnan
John Towns
(PI)
Kenton McHenry
Michal Ondrejcek
Roselyne Tchoua
Stephen Rosen

NIST and CHiMaD collaborators
3
• Carelyn Campbell (MDF – MML Repository)
• Chandler Becker (NIST Materials Resource Schema)
• Zach Trautt (NIST, WholeTale Materials Science Working Group)
• Trautt, Lucas Hale, Becker (MDF - NIST Interatomic Potentials
Repository)
• Alden Dima, Sharief Youssef (NIST Materials Resource Registry)
• Juan de Pablo, Deborah Aldus (PPPDB)
• Laura Bartolo (IMaD, Schema development, Polymer Group)
• Mark Hersam (data publication early adopter)
• Peter Voorhees (IMaD, data publication early adopter)
• Ankit Agrawal (Machine learning and data mining using MDF)
• Cate Brinson, Richard Zhao (MDF – NanoMine)
• Mike Bedzyk (Use case data publication of Co data - soon)
• Ken Kroenlein, Boris Wilthan (NIST TRC Alloy Database - soon)

To accelerate data-driven discovery,
streamline &
automate
4
Access
Integrate
Analyze
Share
Discover
MRS-TMS big data survey, responses to
question: 'Do you consider the following
items to be impediments or motivation
for you to share your data?’
[Ward, Warren, Hanisch, in Integrating Materials
and Manufacturing Innovation, 3:22, 2014.]

• Simplify data publication, regardless of size,
type, and location
• Automate data and metadata ingest, to enable
capture of many valuable materials datasets
• Enable unified search of disparate materials
data sources
• Deploy APIs to foster community development,
data creation, and data consumption
Streamline and automate: Four key steps

Publication REST APIs Discovery
• Identify datasets with
persistent identifiers (e.g.,
DOI or Handle)
• Describe datasets with
appropriate metadata and
provenance
• Verify dataset contents over
time
• Handle big (and small) data:
We have already ingested
datasets with > 1.5 M files
and > 1.5 TB in size
• Search, query, and access
datasets in modern ways
• Automatically index flexible
metadata and harvest file
contents
• Provide simple user
interfaces (c.f., Google and
Amazon)
Materialsdatafacility.
org
PPPDB
• Extract key polymer properties from literature via
natural language processing and crowdsourcing
• Build interfaces to explore curated χ and other property values
• Includes 375 polymer-polymer values and 1,014 polymer-solvent
values
Streamline and automate

MDF data services and tools
7
EP
EP
EP
EP
Automatic sync
EP
Deep indexing
Query
Browse
Aggregate
Publish
Databases
Datasets
APIs
etc.
Distributed
data
publication
Data
publication
service
Data
discovery
service

Streamline & automate data publication
7.2 TB
5.3 TB out
Data
Volumes
Utilization
Authors
94
Institutions
14
Accesses
>1000
Total
datasets
31
CHiMaD
datasets
15
Pipeline CHiMaD
datasets
+14
Total
datasets
+30

Published data highlights
9
~ 30 datasets
~ 6.5 TB
X-ray scattering image classification
using deep learning
Electron backscattering and
diffraction datasets for Ni, Mg, Fe, Si
Yager et al. http://dx.doi.org/10.18126/M2Z30Z
Marc De Graef et al.
http://dx.doi.org/doi:10.18126/M2D593 and others
Phase field benchmark I dataset
Jokisaari et al. http://dx.doi.org/doi:10.18126/M2101X
Grain structure,
grain-averaged
lattice strains, and
macro-scale strain
data for
superelastic
Nickel-Titanium
shape memory
alloy polycrystal
loaded in tension
Paranjape et al. http://dx.doi.org/doi:10.18126/M2NK5W
Largest dataset to date (>1.5 TB). Makes a unique
dataset discoverable for code development, analysis,
and benchmarking

User perspective
1 Setup endpoint at data origin
and create config (one time)
2 Collect data
3 Describe data
4 Publish! (i.e., invoke script)
Automate data publication
New Capability
Publish dataset2
Configure1

User perspective
Describe data
4
1 Setup endpoint at data origin
and create config (one time)
2 Collect data
3
Publish! (i.e., invoke script)
Behind the scenes
• Publication created in MDF Data Publication service
• Directory created on publication endpoint
• ACLs set on endpoint as defined by collection
• Submitted metadata added to the database and verified against
schema
• Transfer automated between origin and publication endpoint
• (optional) Curation flow started
• DOI minted
• Metadata registered in search  metadata pushed to MRR
New Capability
Automate data publication

Enable inline viewing of data
• Continue to support big data solutions
• Leverage Globus HTTPS endpoint to simplify access to smaller files
• Checks access control lists
Preview in browser
MDF Dataset:
K. Yager, J. Lhermitte, D. Yu, B. Wang, Z. Guan, "Dataset of Synthetic X-ray Scattering Images for Classification Using
Deep Learning," 2017, http://dx.doi.org/doi:10.18126/M2Z30Z
New Capability

MDF data ingest and indexing
Data
index Data sources
indexed
16
Records
~1M
Metadata
index
Repositories harvested
• MDF
• NIST MML Repo
• MATIN
• Materials
Commons
• CXIDB
• NDS Materials
Resource
Registry
6
Deeply
indexed
10
~200 Datasets
>2.75 PB Identified for future indexing
~260 TBMade
discoverable

MDF data ingest and indexing
Datasets/Databases
• NanoMine (CHiMaD)
• PPPDB (CHiMaD)
• Khazana Polymers
• Khazana VASP
• Ab Initio Solute Solvent
Diffusion Dataset
• JANAF (NIST)
• Harvard Organic Photovoltaic
Database
• SLUCHI (VASP)
• Crystallography Open
Database
• Classical Interatomic Potentials
(NIST)
• Interatomic Potentials
Repository (NIST)
• XAFS.org
• Lytle XAFS Dataset
• OQMD (CHiMaD)
• NREL Organic Electronic DB
Repositories
• MDF (CHiMaD)
• MATIN
• Materials Commons
• CXIDB
• MML Repository (NIST)
Suggest a dataset to index:
http://bit.ly/3nktuby

Indexed data highlights
15
~ 30 datasets
~ 6.5 TB
Harvard Organic Photovoltaic DB
• Mix of experimental and simulation data;
should combine nicely with other OPV datasets
• Data was available on Figshare as an oddly
formatted text file
• Now available via simple API queries in MDF
NanoMine (Brinson et al.) Ab initio diffusion database
Ingesting full metadata and also investigating
how to refine the metadata into a more easily
queryable form while preserving the rigor of
the current schema
• Nature Datasets paper had only summary values
available (~o(10kb)); MDF has entire dataset and
calculation (o(100 GB)).
• Most popular MDF dataset to date
• Building closer ties to this group through IMaD
• Indexed metadata and links to the various
potential files.
• Data are small and easy to access
• MDF index provides another avenue to
discover this data and to automate access
NIST

• Software as a service
• REST APIs & clients
• Custom faceting &
query boosting
• Partial matches,
range queries,
fuzzy matches
• Index data collections
(many topics)
• Deep indexing of specific datasets (e.g., atomistic modeling, image
metadata, etc.)
• Investigating multi-scale search (e.g., find high level resources,
datasets, dataset records, or even file contents) 16
Unified data search
Custom boosting
Facets

Data search with access control
Multiple sets of access-controlled metadata may be
attached to a subject
Subject
Metadata3
ACL: [User2]
User1:
Sees union of
Metadata1, Metadata2
User2:
Sees union of
Metadata2, Metadata3
Metadata2
ACL:
[public]
Metadata1
ACL: [User1]
Users View:

19
MATIN (GT)
~ 10 datasets
Used in
education
In this example: NanoMine, PPPDB, MML Repository (NIST), MATIN, Materials Commons
(publications), CXIDB, linking with MRR…
Problem: Many materials data repositories, databases, and datasets
exist but interfaces and access to them are fragmented
Data
Discovery
Service
NIST MML
Repository
NanoMine
query
Materials CommonsPPPDB
harvest
CXIDB MATIN
Contribute a harvester (coming soon)
https://github.com/materials-data-facility/
Unified search across repositories, datasets, and DBs

In this example: NanoMine, PPPDB, MML Repository (NIST), MATIN, Materials Commons
(publications), CXIDB, linking with MRR…
Problem: Many materials data repositories, databases, and datasets
exist but interfaces and access to them are fragmented
Results from
PPPDB
Results from MML Repository and MDF
Unified search across repositories, datasets, and DBs

Finding and using interatomic potentials
21
~ 30 datasets
~ 6.5 TB
Problem: Materials data exist in disparate locations, but finding and
importing them into analysis scripts is time consuming
In this example: NIST Interatomic Potentials Repository, NIST Classic Potentials
MATIN (GT)
~ 10 datasets
Used in
education
Data
Discovery
Service
NIST Interatomic
Potentials Repository
NIST Classical Force
Fields
Harvest
…
Query
Use

Finding and using interatomic potentials
Problem: Materials data exist in disparate locations, but finding and
importing them into analysis scripts is tricky
22
Search for potentials in NIST Interatomic Potentials Repo
Format the discovered potentials
Load potentials into ASE and plot
In this example: NIST Interatomic Potentials Repository, NIST Classic Potentials

Integrating analytics tools with MDF
23
MATIN (GT)
~ 10 datasets
Used in
education
Result: Scientists connected with data, analytics tools,
and compute capability
MDF Data
Publication
MATIN (GT)
MML
Repository
(NIST)
Materials
Commons
(UM PRISMS)
Coherent X-Ray
Tomography
Database (LNL)
To End UsersTo End UsersTo Compute ResourcesFrom Data Repositories
Jetstream is a self-provisioned, scalable science and engineering cloud environment
operated by Indiana University for the National Science Foundation: jetstream-cloud.org

Building a machine learning model using MDF
A simple web service to simplify AGNI model building

Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Al data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
Method: Botu et al. JPCC. (2017)
25
Only1Dataset1DatasetBuilding a machine learning model using MDF

Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Al data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
Method: Botu et al. JPCC. (2017)
26
Improved fit on all three datasets by combining data
Only1Dataset2DatasetsBuilding a machine learning model using MDF

Building a
database of
polymer
properties

χDB: Database of Flory-Huggins
parameters
• Semi-automated information extraction
 Crawl journals for potentially relevant publications
 Pre-process the downloaded data
• Crowdsourcing
 Student and expert
review/curate the
publications

PPPDB (χ)
• Search and browse through extracted and curated χ values at
pppdb.uchicago.edu
• Includes link to
• original publication
• Includes polymer-
solvent values
• Largest online
collection
of χ values!
• 375 polymer-polymer
and 1,014 polymer-
solvent values
• Built dictionary of polymers, solvents and other compounds
includin InChI keys & PubChem Ids…

Automatic extraction
of properties
The four-armed compd. (ANTH-OXA6t-OC12)
with the dodecyloxy surface group is a high
glass transition temp. (Tg: 211°) material and
exhibits good soly.
ANTH-OXA6t-OC12 211°
has_tg_of
• Goal
 Automatically extract
properties from text
• Method
 Use the ChemDataExtractor
(from Cambridge) with
additional rule to
automatically extract Glass
Transition Temperatures
 ChemDataExtractor uses
NLP, rule-based approache
and machine learning

PPPDB (Tg)
Search and browse through
extracted Tg values at
pppdb.uchicago.edu
• Includes link to original
publication
• Includes option to Flag
ambiguous results (e.g.:
author’s acronyms)
• Looking at more than
1000 more results

Community engagement highlights
• ElectroCat
• DOE Energy Materials Network
• Using MDF to make data available internally and externally (3Q FY17)
• Integrative Materials and Design
• A Midwest Big Data Spoke affiliated with the Hub at UIUC
• Linking Materials Commons, MAST, 4CeeD and others with MDF and Globus tools. Build
data pipelines from MDF to Citrine. Community building and outreach to industry (e.g., LIFT
consortium and DMDII)
• Center for Predictive Simulation of Functional Materials
• DOE BES Computational Materials Center
• Will use MDF to publish and make data discoverable internally and externally (4Q FY17)
• Co-sponsored DOE-lab hackathon on data discoverability
and authentication at UC
• ~25 attendees (domain scientists, information systems leads, developers, postdocs, etc.)
• Built prototype interfaces and shared data schemas to make data from user facilities at
Brookhaven NSLSII, Oak Ridge SNS, Argonne APS, and LBL ALS
• Work continues…expect to share in (4Q FY17)

Community engagement highlights
• Blaiszik and Chard chair WholeTale Working Group in
Materials Science
• Working with various materials science researchers (representing experiment and
computation) to build tools to support reuse of data artifacts
• Leveraging this effort to link our data services, encourage adoption of MDF capabilities,
and define new platform use cases
• Advanced Photon Source at Argonne National Lab
• Working with the APS software group to make their data portal contents discoverable
• Working with researchers at beamlines to make their datasets discoverable and published:
• HEDM (along with users from AFRL and elsewhere)
• XPCS
• Tomography
Elsevier, Inc.
Working to mine
10,000s of articles
to expand PPPDB
and thus MDF Working to communicate MDF
APIs via hands-on tutorials
Globus

Integrative Materials and Design (IMaD)
35
MAST
Materials Commons (MC)
T2C2 (4CeeD)
• Build connections to international materials
efforts and registries (e.g., NIMS, RDA, NIST,
EUDAT, NDS)
• Promote IMaD data services, tools, and
accomplishments to the community
• Develop video tutorials, webinars, and shared
code repositories
• Interface with the Materials Accelerator
Network (MAN)
• Engage with colleges, industry, and
consortiums
• (Wisconsin) Regional Materials and
Manufacturing Network (RM2N)
• (Illinois) Digital Manufacturing and
Design Innovation Institute DMDII
• (Michigan) LIFT consortium
Engagement
Linking Software and Services
PIs: I. Foster1,2, J. Allison3, D. Morgan4, D. Trinkle5, P. Voorhees6
1 University of Chicago 2 Argonne National Laboratory 3 University of Michigan 4 University of Wisconsin-Madison 5 University
of Illinois at Urbana-Champaign, 6 Northwestern University
Overview
• NSF Midwest Big Data Spoke

Materials Resource Registry
36
• MDF is running an MRR instance as of ~Feb 2017,
but it is not yet populated
• MDF instance will harvest from other MRR instances
(setting this up with Chandler in the coming 2-3
weeks)
• We are also setting up the interfaces to make the
data available in our discovery service (i.e., OAI-
PMH harvesting)
• Long term this is interesting for hierarchical search
 Find data resources or dig deeper into them all from the
same location
• Build reverse pipelines to be good community
citizens

DOE User Facility Interactions
37
• Data and experimental records generated
 ORNL (SNS and HFIR)
 Brookhaven (NSLS-II) via DataBroker project
 Argonne (APS, various sectors and existing
data portals) – ranging from fuel spray data to
catalysis experiments and nanofabrication data
(small pending NSF proposal in nano
Argonne/MIT/UC)
 LBL (ALS) CAMERA project

Recap and future work
To streamline materials data sharing, discovery, access,
and analysis, we have:
 Simplified data publication, enabling ingest of 7 TB in 31
datasets from 11 institutions and 94 authors
 Automated data and metadata capture for deep indexing of
large data collections, reaching 1M records from 16 sources
 Unified search across all of these sites and collections
 Deployed APIs enabling programmatic access
• We have demonstrated automated end-to-end
discovery, access, and analysis pipelines
Next steps: More data, more indexing, richer search,
more analyses

Thanks to our sponsors!
39
U . S . D E P A R T M E N T O F
ENERGY

Materials Data Facility: Streamlined and automated data sharing, discovery, access, and analysis

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (17)

Ähnlich wie Materials Data Facility: Streamlined and automated data sharing, discovery, access, and analysis

Ähnlich wie Materials Data Facility: Streamlined and automated data sharing, discovery, access, and analysis (20)

Mehr von Ian Foster

Mehr von Ian Foster (19)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Materials Data Facility: Streamlined and automated data sharing, discovery, access, and analysis

Hinweis der Redaktion