SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Ian Foster (foster@uchicago.edu)1,2,
Ben Blaiszik1,2 (blaiszik@uchicago.edu),
Jonathon Gaff1, Logan Ward1, Kyle Chard1, Jim Pruyne1,
Rachana Ananthakrishnan1, Steven Tuecke1
Michael Ondrejcek3, Kenton McHenry3, John Towns3
University of Chicago1, Argonne National Laboratory2, University of Illinois at Urbana-Champaign3
materialsdatafacility.org
globus.org
Materials Data Facility
Streamlined and automated data sharing,
discovery, access, and analysis
Team (unordered)
2
UC/Argonne
Ian Foster (PI) Ben Blaiszik Steve Tuecke
Kyle ChardJim Pruyne
Logan Ward Jonathon Gaff
Illinois (Urbana-Champaign)
Rachana
Ananthakrishnan
John Towns
(PI)
Kenton McHenry
Michal Ondrejcek
Roselyne Tchoua
Stephen Rosen
NIST and CHiMaD collaborators
3
• Carelyn Campbell (MDF – MML Repository)
• Chandler Becker (NIST Materials Resource Schema)
• Zach Trautt (NIST, WholeTale Materials Science Working Group)
• Trautt, Lucas Hale, Becker (MDF - NIST Interatomic Potentials
Repository)
• Alden Dima, Sharief Youssef (NIST Materials Resource Registry)
• Juan de Pablo, Deborah Aldus (PPPDB)
• Laura Bartolo (IMaD, Schema development, Polymer Group)
• Mark Hersam (data publication early adopter)
• Peter Voorhees (IMaD, data publication early adopter)
• Ankit Agrawal (Machine learning and data mining using MDF)
• Cate Brinson, Richard Zhao (MDF – NanoMine)
• Mike Bedzyk (Use case data publication of Co data - soon)
• Ken Kroenlein, Boris Wilthan (NIST TRC Alloy Database - soon)
To accelerate data-driven discovery,
streamline &
automate
4
Access
Integrate
Analyze
Share
Discover
MRS-TMS big data survey, responses to
question: 'Do you consider the following
items to be impediments or motivation
for you to share your data?’
[Ward, Warren, Hanisch, in Integrating Materials
and Manufacturing Innovation, 3:22, 2014.]
• Simplify data publication, regardless of size,
type, and location
• Automate data and metadata ingest, to enable
capture of many valuable materials datasets
• Enable unified search of disparate materials
data sources
• Deploy APIs to foster community development,
data creation, and data consumption
Streamline and automate: Four key steps
Publication REST APIs Discovery
• Identify datasets with
persistent identifiers (e.g.,
DOI or Handle)
• Describe datasets with
appropriate metadata and
provenance
• Verify dataset contents over
time
• Handle big (and small) data:
We have already ingested
datasets with > 1.5 M files
and > 1.5 TB in size
• Search, query, and access
datasets in modern ways
• Automatically index flexible
metadata and harvest file
contents
• Provide simple user
interfaces (c.f., Google and
Amazon)
Materialsdatafacility.
org
PPPDB
• Extract key polymer properties from literature via
natural language processing and crowdsourcing
• Build interfaces to explore curated χ and other property values
• Includes 375 polymer-polymer values and 1,014 polymer-solvent
values
Streamline and automate
MDF data services and tools
7
EP
EP
EP
EP
Automatic sync
EP
Deep indexing
Query
Browse
Aggregate
Publish
Databases
Datasets
APIs
etc.
Distributed
data
publication
Data
publication
service
Data
discovery
service
Streamline & automate data publication
7.2 TB
5.3 TB out
Data
Volumes
Utilization
Authors
94
Institutions
14
Accesses
>1000
Total
datasets
31
CHiMaD
datasets
15
Pipeline CHiMaD
datasets
+14
Total
datasets
+30
Published data highlights
9
~ 30 datasets
~ 6.5 TB
X-ray scattering image classification
using deep learning
Electron backscattering and
diffraction datasets for Ni, Mg, Fe, Si
Yager et al. http://dx.doi.org/10.18126/M2Z30Z
Marc De Graef et al.
http://dx.doi.org/doi:10.18126/M2D593 and others
Phase field benchmark I dataset
Jokisaari et al. http://dx.doi.org/doi:10.18126/M2101X
Grain structure,
grain-averaged
lattice strains, and
macro-scale strain
data for
superelastic
Nickel-Titanium
shape memory
alloy polycrystal
loaded in tension
Paranjape et al. http://dx.doi.org/doi:10.18126/M2NK5W
Largest dataset to date (>1.5 TB). Makes a unique
dataset discoverable for code development, analysis,
and benchmarking
User perspective
1 Setup endpoint at data origin
and create config (one time)
2 Collect data
3 Describe data
4 Publish! (i.e., invoke script)
Automate data publication
New Capability
Publish dataset2
Configure1
User perspective
Describe data
4
1 Setup endpoint at data origin
and create config (one time)
2 Collect data
3
Publish! (i.e., invoke script)
Behind the scenes
• Publication created in MDF Data Publication service
• Directory created on publication endpoint
• ACLs set on endpoint as defined by collection
• Submitted metadata added to the database and verified against
schema
• Transfer automated between origin and publication endpoint
• (optional) Curation flow started
• DOI minted
• Metadata registered in search  metadata pushed to MRR
New Capability
Automate data publication
Enable inline viewing of data
• Continue to support big data solutions
• Leverage Globus HTTPS endpoint to simplify access to smaller files
• Checks access control lists
Preview in browser
MDF Dataset:
K. Yager, J. Lhermitte, D. Yu, B. Wang, Z. Guan, "Dataset of Synthetic X-ray Scattering Images for Classification Using
Deep Learning," 2017, http://dx.doi.org/doi:10.18126/M2Z30Z
New Capability
MDF data ingest and indexing
Data
index Data sources
indexed
16
Records
~1M
Metadata
index
Repositories harvested
• MDF
• NIST MML Repo
• MATIN
• Materials
Commons
• CXIDB
• NDS Materials
Resource
Registry
6
Deeply
indexed
10
~200 Datasets
>2.75 PB Identified for future indexing
~260 TBMade
discoverable
MDF data ingest and indexing
Datasets/Databases
• NanoMine (CHiMaD)
• PPPDB (CHiMaD)
• Khazana Polymers
• Khazana VASP
• Ab Initio Solute Solvent
Diffusion Dataset
• JANAF (NIST)
• Harvard Organic Photovoltaic
Database
• SLUCHI (VASP)
• Crystallography Open
Database
• Classical Interatomic Potentials
(NIST)
• Interatomic Potentials
Repository (NIST)
• XAFS.org
• Lytle XAFS Dataset
• OQMD (CHiMaD)
• NREL Organic Electronic DB
Repositories
• MDF (CHiMaD)
• MATIN
• Materials Commons
• CXIDB
• MML Repository (NIST)
Suggest a dataset to index:
http://bit.ly/3nktuby
Indexed data highlights
15
~ 30 datasets
~ 6.5 TB
Harvard Organic Photovoltaic DB
• Mix of experimental and simulation data;
should combine nicely with other OPV datasets
• Data was available on Figshare as an oddly
formatted text file
• Now available via simple API queries in MDF
NanoMine (Brinson et al.) Ab initio diffusion database
Ingesting full metadata and also investigating
how to refine the metadata into a more easily
queryable form while preserving the rigor of
the current schema
• Nature Datasets paper had only summary values
available (~o(10kb)); MDF has entire dataset and
calculation (o(100 GB)).
• Most popular MDF dataset to date
• Building closer ties to this group through IMaD
• Indexed metadata and links to the various
potential files.
• Data are small and easy to access
• MDF index provides another avenue to
discover this data and to automate access
NIST
• Software as a service
• REST APIs & clients
• Custom faceting &
query boosting
• Partial matches,
range queries,
fuzzy matches
• Index data collections
(many topics)
• Deep indexing of specific datasets (e.g., atomistic modeling, image
metadata, etc.)
• Investigating multi-scale search (e.g., find high level resources,
datasets, dataset records, or even file contents) 16
Unified data search
Custom boosting
Facets
Data search with access control
Multiple sets of access-controlled metadata may be
attached to a subject
Subject
Metadata3
ACL: [User2]
User1:
Sees union of
Metadata1, Metadata2
User2:
Sees union of
Metadata2, Metadata3
Metadata2
ACL:
[public]
Metadata1
ACL: [User1]
Users View:
Data discovery
in practice
19
MATIN (GT)
~ 10 datasets
Used in
education
In this example: NanoMine, PPPDB, MML Repository (NIST), MATIN, Materials Commons
(publications), CXIDB, linking with MRR…
Problem: Many materials data repositories, databases, and datasets
exist but interfaces and access to them are fragmented
Data
Discovery
Service
NIST MML
Repository
NanoMine
query
Materials CommonsPPPDB
harvest
CXIDB MATIN
Contribute a harvester (coming soon)
https://github.com/materials-data-facility/
Unified search across repositories, datasets, and DBs
In this example: NanoMine, PPPDB, MML Repository (NIST), MATIN, Materials Commons
(publications), CXIDB, linking with MRR…
Problem: Many materials data repositories, databases, and datasets
exist but interfaces and access to them are fragmented
Results from
PPPDB
Results from MML Repository and MDF
Unified search across repositories, datasets, and DBs
Finding and using interatomic potentials
21
~ 30 datasets
~ 6.5 TB
Problem: Materials data exist in disparate locations, but finding and
importing them into analysis scripts is time consuming
In this example: NIST Interatomic Potentials Repository, NIST Classic Potentials
MATIN (GT)
~ 10 datasets
Used in
education
Data
Discovery
Service
NIST Interatomic
Potentials Repository
NIST Classical Force
Fields
Harvest
…
Query
Use
Finding and using interatomic potentials
Problem: Materials data exist in disparate locations, but finding and
importing them into analysis scripts is tricky
22
Search for potentials in NIST Interatomic Potentials Repo
Format the discovered potentials
Load potentials into ASE and plot
In this example: NIST Interatomic Potentials Repository, NIST Classic Potentials
Integrating analytics tools with MDF
23
MATIN (GT)
~ 10 datasets
Used in
education
Result: Scientists connected with data, analytics tools,
and compute capability
MDF Data
Publication
MATIN (GT)
MML
Repository
(NIST)
Materials
Commons
(UM PRISMS)
Coherent X-Ray
Tomography
Database (LNL)
To End UsersTo End UsersTo Compute ResourcesFrom Data Repositories
Jetstream is a self-provisioned, scalable science and engineering cloud environment
operated by Indiana University for the National Science Foundation: jetstream-cloud.org
Building a machine learning model using MDF
A simple web service to simplify AGNI model building
Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Al data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
Method: Botu et al. JPCC. (2017)
25
Only1Dataset1DatasetBuilding a machine learning model using MDF
Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Al data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
Method: Botu et al. JPCC. (2017)
26
Improved fit on all three datasets by combining data
Only1Dataset2DatasetsBuilding a machine learning model using MDF
Building a
database of
polymer
properties
χDB: Database of Flory-Huggins
parameters
• Semi-automated information extraction
 Crawl journals for potentially relevant publications
 Pre-process the downloaded data
• Crowdsourcing
 Student and expert
review/curate the
publications
PPPDB (χ)
• Search and browse through extracted and curated χ values at
pppdb.uchicago.edu
• Includes link to
• original publication
• Includes polymer-
solvent values
• Largest online
collection
of χ values!
• 375 polymer-polymer
and 1,014 polymer-
solvent values
• Built dictionary of polymers, solvents and other compounds
includin InChI keys & PubChem Ids…
Automatic extraction
of properties
The four-armed compd. (ANTH-OXA6t-OC12)
with the dodecyloxy surface group is a high
glass transition temp. (Tg: 211°) material and
exhibits good soly.
ANTH-OXA6t-OC12 211°
has_tg_of
• Goal
 Automatically extract
properties from text
• Method
 Use the ChemDataExtractor
(from Cambridge) with
additional rule to
automatically extract Glass
Transition Temperatures
 ChemDataExtractor uses
NLP, rule-based approache
and machine learning
PPPDB (Tg)
Search and browse through
extracted Tg values at
pppdb.uchicago.edu
• Includes link to original
publication
• Includes option to Flag
ambiguous results (e.g.:
author’s acronyms)
• Looking at more than
1000 more results
Building
community
Community engagement highlights
• ElectroCat
• DOE Energy Materials Network
• Using MDF to make data available internally and externally (3Q FY17)
• Integrative Materials and Design
• A Midwest Big Data Spoke affiliated with the Hub at UIUC
• Linking Materials Commons, MAST, 4CeeD and others with MDF and Globus tools. Build
data pipelines from MDF to Citrine. Community building and outreach to industry (e.g., LIFT
consortium and DMDII)
• Center for Predictive Simulation of Functional Materials
• DOE BES Computational Materials Center
• Will use MDF to publish and make data discoverable internally and externally (4Q FY17)
• Co-sponsored DOE-lab hackathon on data discoverability
and authentication at UC
• ~25 attendees (domain scientists, information systems leads, developers, postdocs, etc.)
• Built prototype interfaces and shared data schemas to make data from user facilities at
Brookhaven NSLSII, Oak Ridge SNS, Argonne APS, and LBL ALS
• Work continues…expect to share in (4Q FY17)
Community engagement highlights
• Blaiszik and Chard chair WholeTale Working Group in
Materials Science
• Working with various materials science researchers (representing experiment and
computation) to build tools to support reuse of data artifacts
• Leveraging this effort to link our data services, encourage adoption of MDF capabilities,
and define new platform use cases
• Advanced Photon Source at Argonne National Lab
• Working with the APS software group to make their data portal contents discoverable
• Working with researchers at beamlines to make their datasets discoverable and published:
• HEDM (along with users from AFRL and elsewhere)
• XPCS
• Tomography
Elsevier, Inc.
Working to mine
10,000s of articles
to expand PPPDB
and thus MDF Working to communicate MDF
APIs via hands-on tutorials
Globus
Integrative Materials and Design (IMaD)
35
MAST
Materials Commons (MC)
T2C2 (4CeeD)
• Build connections to international materials
efforts and registries (e.g., NIMS, RDA, NIST,
EUDAT, NDS)
• Promote IMaD data services, tools, and
accomplishments to the community
• Develop video tutorials, webinars, and shared
code repositories
• Interface with the Materials Accelerator
Network (MAN)
• Engage with colleges, industry, and
consortiums
• (Wisconsin) Regional Materials and
Manufacturing Network (RM2N)
• (Illinois) Digital Manufacturing and
Design Innovation Institute DMDII
• (Michigan) LIFT consortium
Engagement
Linking Software and Services
PIs: I. Foster1,2, J. Allison3, D. Morgan4, D. Trinkle5, P. Voorhees6
1 University of Chicago 2 Argonne National Laboratory 3 University of Michigan 4 University of Wisconsin-Madison 5 University
of Illinois at Urbana-Champaign, 6 Northwestern University
Overview
• NSF Midwest Big Data Spoke
Materials Resource Registry
36
• MDF is running an MRR instance as of ~Feb 2017,
but it is not yet populated
• MDF instance will harvest from other MRR instances
(setting this up with Chandler in the coming 2-3
weeks)
• We are also setting up the interfaces to make the
data available in our discovery service (i.e., OAI-
PMH harvesting)
• Long term this is interesting for hierarchical search
 Find data resources or dig deeper into them all from the
same location
• Build reverse pipelines to be good community
citizens
DOE User Facility Interactions
37
• Data and experimental records generated
 ORNL (SNS and HFIR)
 Brookhaven (NSLS-II) via DataBroker project
 Argonne (APS, various sectors and existing
data portals) – ranging from fuel spray data to
catalysis experiments and nanofabrication data
(small pending NSF proposal in nano
Argonne/MIT/UC)
 LBL (ALS) CAMERA project
Recap and future work
To streamline materials data sharing, discovery, access,
and analysis, we have:
 Simplified data publication, enabling ingest of 7 TB in 31
datasets from 11 institutions and 94 authors
 Automated data and metadata capture for deep indexing of
large data collections, reaching 1M records from 16 sources
 Unified search across all of these sites and collections
 Deployed APIs enabling programmatic access
• We have demonstrated automated end-to-end
discovery, access, and analysis pipelines
Next steps: More data, more indexing, richer search,
more analyses
Thanks to our sponsors!
39
U . S . D E P A R T M E N T O F
ENERGY

Weitere ähnliche Inhalte

Was ist angesagt?

Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
Ian Foster
 

Was ist angesagt? (20)

The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning Techniques
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitware
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource Provisioning
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASA
 

Andere mochten auch

Globus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformGlobus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management Platform
Ian Foster
 
FINAL LHTFamily Brochure
FINAL LHTFamily BrochureFINAL LHTFamily Brochure
FINAL LHTFamily Brochure
Cynthia Besson
 

Andere mochten auch (17)

Globus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformGlobus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management Platform
 
Bent philipson
Bent philipson Bent philipson
Bent philipson
 
APMag_Issue_49
APMag_Issue_49APMag_Issue_49
APMag_Issue_49
 
50 ugly truths about starting your own business, and why you should do it anyway
50 ugly truths about starting your own business, and why you should do it anyway50 ugly truths about starting your own business, and why you should do it anyway
50 ugly truths about starting your own business, and why you should do it anyway
 
Decoding the Quran using Prime Numbers
Decoding the Quranusing Prime NumbersDecoding the Quranusing Prime Numbers
Decoding the Quran using Prime Numbers
 
Quebec Budget 2017
Quebec Budget 2017Quebec Budget 2017
Quebec Budget 2017
 
Pricing Model Infographic
Pricing Model InfographicPricing Model Infographic
Pricing Model Infographic
 
1.2.2 Шкафы из фибергласа Conchiglia
1.2.2 Шкафы из фибергласа Conchiglia1.2.2 Шкафы из фибергласа Conchiglia
1.2.2 Шкафы из фибергласа Conchiglia
 
Paslanmaz Çelik İmalatı
Paslanmaz Çelik İmalatıPaslanmaz Çelik İmalatı
Paslanmaz Çelik İmalatı
 
Alternative Transformation Framework
Alternative Transformation Framework Alternative Transformation Framework
Alternative Transformation Framework
 
FINAL LHTFamily Brochure
FINAL LHTFamily BrochureFINAL LHTFamily Brochure
FINAL LHTFamily Brochure
 
1.2.1 Система навесных шкафов и щитков
1.2.1 Система навесных шкафов и щитков1.2.1 Система навесных шкафов и щитков
1.2.1 Система навесных шкафов и щитков
 
7 Technologies that will affect Trucking
7 Technologies that will affect Trucking7 Technologies that will affect Trucking
7 Technologies that will affect Trucking
 
Heli Valkeinen: Sosiaalialan mittarikeräyksen tulokset
Heli Valkeinen: Sosiaalialan mittarikeräyksen tuloksetHeli Valkeinen: Sosiaalialan mittarikeräyksen tulokset
Heli Valkeinen: Sosiaalialan mittarikeräyksen tulokset
 
R. Villano - The photos (appendix 1 en)
R. Villano - The photos (appendix 1 en)R. Villano - The photos (appendix 1 en)
R. Villano - The photos (appendix 1 en)
 
Paula Saikkonen: Arviointitieto ja vaikuttavuus - johdatus työpajatyöskentelyyn
Paula Saikkonen: Arviointitieto ja vaikuttavuus - johdatus työpajatyöskentelyynPaula Saikkonen: Arviointitieto ja vaikuttavuus - johdatus työpajatyöskentelyyn
Paula Saikkonen: Arviointitieto ja vaikuttavuus - johdatus työpajatyöskentelyyn
 
Adam's Important Irish Art 29 March 2017
Adam's Important Irish Art 29 March 2017Adam's Important Irish Art 29 March 2017
Adam's Important Irish Art 29 March 2017
 

Ähnlich wie Materials Data Facility: Streamlined and automated data sharing, discovery, access, and analysis

Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
Artificial Intelligence Institute at UofSC
 
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Artificial Intelligence Institute at UofSC
 

Ähnlich wie Materials Data Facility: Streamlined and automated data sharing, discovery, access, and analysis (20)

A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
 
20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in Materials
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and Libaries
 
RDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupRDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest Group
 
WoSC19: Serverless Workflows for Indexing Large Scientific Data
WoSC19: Serverless Workflows for Indexing Large Scientific DataWoSC19: Serverless Workflows for Indexing Large Scientific Data
WoSC19: Serverless Workflows for Indexing Large Scientific Data
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
Researh data management
Researh data managementResearh data management
Researh data management
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
 
Love Your Data Locally
Love Your Data LocallyLove Your Data Locally
Love Your Data Locally
 
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
 
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogueseROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
 
Sharing data
Sharing dataSharing data
Sharing data
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
Impact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and EducationImpact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and Education
 

Mehr von Ian Foster

Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
Ian Foster
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Ian Foster
 

Mehr von Ian Foster (19)

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart Instruments
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon Summary
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research Platform
 
Streamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchStreamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer research
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 
building global software/earthcube->sciencecloud
building global software/earthcube->sciencecloudbuilding global software/earthcube->sciencecloud
building global software/earthcube->sciencecloud
 

Kürzlich hochgeladen

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 

Kürzlich hochgeladen (20)

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 

Materials Data Facility: Streamlined and automated data sharing, discovery, access, and analysis

  • 1. Ian Foster (foster@uchicago.edu)1,2, Ben Blaiszik1,2 (blaiszik@uchicago.edu), Jonathon Gaff1, Logan Ward1, Kyle Chard1, Jim Pruyne1, Rachana Ananthakrishnan1, Steven Tuecke1 Michael Ondrejcek3, Kenton McHenry3, John Towns3 University of Chicago1, Argonne National Laboratory2, University of Illinois at Urbana-Champaign3 materialsdatafacility.org globus.org Materials Data Facility Streamlined and automated data sharing, discovery, access, and analysis
  • 2. Team (unordered) 2 UC/Argonne Ian Foster (PI) Ben Blaiszik Steve Tuecke Kyle ChardJim Pruyne Logan Ward Jonathon Gaff Illinois (Urbana-Champaign) Rachana Ananthakrishnan John Towns (PI) Kenton McHenry Michal Ondrejcek Roselyne Tchoua Stephen Rosen
  • 3. NIST and CHiMaD collaborators 3 • Carelyn Campbell (MDF – MML Repository) • Chandler Becker (NIST Materials Resource Schema) • Zach Trautt (NIST, WholeTale Materials Science Working Group) • Trautt, Lucas Hale, Becker (MDF - NIST Interatomic Potentials Repository) • Alden Dima, Sharief Youssef (NIST Materials Resource Registry) • Juan de Pablo, Deborah Aldus (PPPDB) • Laura Bartolo (IMaD, Schema development, Polymer Group) • Mark Hersam (data publication early adopter) • Peter Voorhees (IMaD, data publication early adopter) • Ankit Agrawal (Machine learning and data mining using MDF) • Cate Brinson, Richard Zhao (MDF – NanoMine) • Mike Bedzyk (Use case data publication of Co data - soon) • Ken Kroenlein, Boris Wilthan (NIST TRC Alloy Database - soon)
  • 4. To accelerate data-driven discovery, streamline & automate 4 Access Integrate Analyze Share Discover MRS-TMS big data survey, responses to question: 'Do you consider the following items to be impediments or motivation for you to share your data?’ [Ward, Warren, Hanisch, in Integrating Materials and Manufacturing Innovation, 3:22, 2014.]
  • 5. • Simplify data publication, regardless of size, type, and location • Automate data and metadata ingest, to enable capture of many valuable materials datasets • Enable unified search of disparate materials data sources • Deploy APIs to foster community development, data creation, and data consumption Streamline and automate: Four key steps
  • 6. Publication REST APIs Discovery • Identify datasets with persistent identifiers (e.g., DOI or Handle) • Describe datasets with appropriate metadata and provenance • Verify dataset contents over time • Handle big (and small) data: We have already ingested datasets with > 1.5 M files and > 1.5 TB in size • Search, query, and access datasets in modern ways • Automatically index flexible metadata and harvest file contents • Provide simple user interfaces (c.f., Google and Amazon) Materialsdatafacility. org PPPDB • Extract key polymer properties from literature via natural language processing and crowdsourcing • Build interfaces to explore curated χ and other property values • Includes 375 polymer-polymer values and 1,014 polymer-solvent values Streamline and automate
  • 7. MDF data services and tools 7 EP EP EP EP Automatic sync EP Deep indexing Query Browse Aggregate Publish Databases Datasets APIs etc. Distributed data publication Data publication service Data discovery service
  • 8. Streamline & automate data publication 7.2 TB 5.3 TB out Data Volumes Utilization Authors 94 Institutions 14 Accesses >1000 Total datasets 31 CHiMaD datasets 15 Pipeline CHiMaD datasets +14 Total datasets +30
  • 9. Published data highlights 9 ~ 30 datasets ~ 6.5 TB X-ray scattering image classification using deep learning Electron backscattering and diffraction datasets for Ni, Mg, Fe, Si Yager et al. http://dx.doi.org/10.18126/M2Z30Z Marc De Graef et al. http://dx.doi.org/doi:10.18126/M2D593 and others Phase field benchmark I dataset Jokisaari et al. http://dx.doi.org/doi:10.18126/M2101X Grain structure, grain-averaged lattice strains, and macro-scale strain data for superelastic Nickel-Titanium shape memory alloy polycrystal loaded in tension Paranjape et al. http://dx.doi.org/doi:10.18126/M2NK5W Largest dataset to date (>1.5 TB). Makes a unique dataset discoverable for code development, analysis, and benchmarking
  • 10. User perspective 1 Setup endpoint at data origin and create config (one time) 2 Collect data 3 Describe data 4 Publish! (i.e., invoke script) Automate data publication New Capability Publish dataset2 Configure1
  • 11. User perspective Describe data 4 1 Setup endpoint at data origin and create config (one time) 2 Collect data 3 Publish! (i.e., invoke script) Behind the scenes • Publication created in MDF Data Publication service • Directory created on publication endpoint • ACLs set on endpoint as defined by collection • Submitted metadata added to the database and verified against schema • Transfer automated between origin and publication endpoint • (optional) Curation flow started • DOI minted • Metadata registered in search  metadata pushed to MRR New Capability Automate data publication
  • 12. Enable inline viewing of data • Continue to support big data solutions • Leverage Globus HTTPS endpoint to simplify access to smaller files • Checks access control lists Preview in browser MDF Dataset: K. Yager, J. Lhermitte, D. Yu, B. Wang, Z. Guan, "Dataset of Synthetic X-ray Scattering Images for Classification Using Deep Learning," 2017, http://dx.doi.org/doi:10.18126/M2Z30Z New Capability
  • 13. MDF data ingest and indexing Data index Data sources indexed 16 Records ~1M Metadata index Repositories harvested • MDF • NIST MML Repo • MATIN • Materials Commons • CXIDB • NDS Materials Resource Registry 6 Deeply indexed 10 ~200 Datasets >2.75 PB Identified for future indexing ~260 TBMade discoverable
  • 14. MDF data ingest and indexing Datasets/Databases • NanoMine (CHiMaD) • PPPDB (CHiMaD) • Khazana Polymers • Khazana VASP • Ab Initio Solute Solvent Diffusion Dataset • JANAF (NIST) • Harvard Organic Photovoltaic Database • SLUCHI (VASP) • Crystallography Open Database • Classical Interatomic Potentials (NIST) • Interatomic Potentials Repository (NIST) • XAFS.org • Lytle XAFS Dataset • OQMD (CHiMaD) • NREL Organic Electronic DB Repositories • MDF (CHiMaD) • MATIN • Materials Commons • CXIDB • MML Repository (NIST) Suggest a dataset to index: http://bit.ly/3nktuby
  • 15. Indexed data highlights 15 ~ 30 datasets ~ 6.5 TB Harvard Organic Photovoltaic DB • Mix of experimental and simulation data; should combine nicely with other OPV datasets • Data was available on Figshare as an oddly formatted text file • Now available via simple API queries in MDF NanoMine (Brinson et al.) Ab initio diffusion database Ingesting full metadata and also investigating how to refine the metadata into a more easily queryable form while preserving the rigor of the current schema • Nature Datasets paper had only summary values available (~o(10kb)); MDF has entire dataset and calculation (o(100 GB)). • Most popular MDF dataset to date • Building closer ties to this group through IMaD • Indexed metadata and links to the various potential files. • Data are small and easy to access • MDF index provides another avenue to discover this data and to automate access NIST
  • 16. • Software as a service • REST APIs & clients • Custom faceting & query boosting • Partial matches, range queries, fuzzy matches • Index data collections (many topics) • Deep indexing of specific datasets (e.g., atomistic modeling, image metadata, etc.) • Investigating multi-scale search (e.g., find high level resources, datasets, dataset records, or even file contents) 16 Unified data search Custom boosting Facets
  • 17. Data search with access control Multiple sets of access-controlled metadata may be attached to a subject Subject Metadata3 ACL: [User2] User1: Sees union of Metadata1, Metadata2 User2: Sees union of Metadata2, Metadata3 Metadata2 ACL: [public] Metadata1 ACL: [User1] Users View:
  • 19. 19 MATIN (GT) ~ 10 datasets Used in education In this example: NanoMine, PPPDB, MML Repository (NIST), MATIN, Materials Commons (publications), CXIDB, linking with MRR… Problem: Many materials data repositories, databases, and datasets exist but interfaces and access to them are fragmented Data Discovery Service NIST MML Repository NanoMine query Materials CommonsPPPDB harvest CXIDB MATIN Contribute a harvester (coming soon) https://github.com/materials-data-facility/ Unified search across repositories, datasets, and DBs
  • 20. In this example: NanoMine, PPPDB, MML Repository (NIST), MATIN, Materials Commons (publications), CXIDB, linking with MRR… Problem: Many materials data repositories, databases, and datasets exist but interfaces and access to them are fragmented Results from PPPDB Results from MML Repository and MDF Unified search across repositories, datasets, and DBs
  • 21. Finding and using interatomic potentials 21 ~ 30 datasets ~ 6.5 TB Problem: Materials data exist in disparate locations, but finding and importing them into analysis scripts is time consuming In this example: NIST Interatomic Potentials Repository, NIST Classic Potentials MATIN (GT) ~ 10 datasets Used in education Data Discovery Service NIST Interatomic Potentials Repository NIST Classical Force Fields Harvest … Query Use
  • 22. Finding and using interatomic potentials Problem: Materials data exist in disparate locations, but finding and importing them into analysis scripts is tricky 22 Search for potentials in NIST Interatomic Potentials Repo Format the discovered potentials Load potentials into ASE and plot In this example: NIST Interatomic Potentials Repository, NIST Classic Potentials
  • 23. Integrating analytics tools with MDF 23 MATIN (GT) ~ 10 datasets Used in education Result: Scientists connected with data, analytics tools, and compute capability MDF Data Publication MATIN (GT) MML Repository (NIST) Materials Commons (UM PRISMS) Coherent X-Ray Tomography Database (LNL) To End UsersTo End UsersTo Compute ResourcesFrom Data Repositories Jetstream is a self-provisioned, scalable science and engineering cloud environment operated by Indiana University for the National Science Foundation: jetstream-cloud.org
  • 24. Building a machine learning model using MDF A simple web service to simplify AGNI model building
  • 25. Example: Building force-field potentials from different datasets Data resources: 3 DFT datasets with Al data 1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov Result: Improved performance by integrating data sources Method: Botu et al. JPCC. (2017) 25 Only1Dataset1DatasetBuilding a machine learning model using MDF
  • 26. Example: Building force-field potentials from different datasets Data resources: 3 DFT datasets with Al data 1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov Result: Improved performance by integrating data sources Method: Botu et al. JPCC. (2017) 26 Improved fit on all three datasets by combining data Only1Dataset2DatasetsBuilding a machine learning model using MDF
  • 28. χDB: Database of Flory-Huggins parameters • Semi-automated information extraction  Crawl journals for potentially relevant publications  Pre-process the downloaded data • Crowdsourcing  Student and expert review/curate the publications
  • 29. PPPDB (χ) • Search and browse through extracted and curated χ values at pppdb.uchicago.edu • Includes link to • original publication • Includes polymer- solvent values • Largest online collection of χ values! • 375 polymer-polymer and 1,014 polymer- solvent values • Built dictionary of polymers, solvents and other compounds includin InChI keys & PubChem Ids…
  • 30. Automatic extraction of properties The four-armed compd. (ANTH-OXA6t-OC12) with the dodecyloxy surface group is a high glass transition temp. (Tg: 211°) material and exhibits good soly. ANTH-OXA6t-OC12 211° has_tg_of • Goal  Automatically extract properties from text • Method  Use the ChemDataExtractor (from Cambridge) with additional rule to automatically extract Glass Transition Temperatures  ChemDataExtractor uses NLP, rule-based approache and machine learning
  • 31. PPPDB (Tg) Search and browse through extracted Tg values at pppdb.uchicago.edu • Includes link to original publication • Includes option to Flag ambiguous results (e.g.: author’s acronyms) • Looking at more than 1000 more results
  • 33. Community engagement highlights • ElectroCat • DOE Energy Materials Network • Using MDF to make data available internally and externally (3Q FY17) • Integrative Materials and Design • A Midwest Big Data Spoke affiliated with the Hub at UIUC • Linking Materials Commons, MAST, 4CeeD and others with MDF and Globus tools. Build data pipelines from MDF to Citrine. Community building and outreach to industry (e.g., LIFT consortium and DMDII) • Center for Predictive Simulation of Functional Materials • DOE BES Computational Materials Center • Will use MDF to publish and make data discoverable internally and externally (4Q FY17) • Co-sponsored DOE-lab hackathon on data discoverability and authentication at UC • ~25 attendees (domain scientists, information systems leads, developers, postdocs, etc.) • Built prototype interfaces and shared data schemas to make data from user facilities at Brookhaven NSLSII, Oak Ridge SNS, Argonne APS, and LBL ALS • Work continues…expect to share in (4Q FY17)
  • 34. Community engagement highlights • Blaiszik and Chard chair WholeTale Working Group in Materials Science • Working with various materials science researchers (representing experiment and computation) to build tools to support reuse of data artifacts • Leveraging this effort to link our data services, encourage adoption of MDF capabilities, and define new platform use cases • Advanced Photon Source at Argonne National Lab • Working with the APS software group to make their data portal contents discoverable • Working with researchers at beamlines to make their datasets discoverable and published: • HEDM (along with users from AFRL and elsewhere) • XPCS • Tomography Elsevier, Inc. Working to mine 10,000s of articles to expand PPPDB and thus MDF Working to communicate MDF APIs via hands-on tutorials Globus
  • 35. Integrative Materials and Design (IMaD) 35 MAST Materials Commons (MC) T2C2 (4CeeD) • Build connections to international materials efforts and registries (e.g., NIMS, RDA, NIST, EUDAT, NDS) • Promote IMaD data services, tools, and accomplishments to the community • Develop video tutorials, webinars, and shared code repositories • Interface with the Materials Accelerator Network (MAN) • Engage with colleges, industry, and consortiums • (Wisconsin) Regional Materials and Manufacturing Network (RM2N) • (Illinois) Digital Manufacturing and Design Innovation Institute DMDII • (Michigan) LIFT consortium Engagement Linking Software and Services PIs: I. Foster1,2, J. Allison3, D. Morgan4, D. Trinkle5, P. Voorhees6 1 University of Chicago 2 Argonne National Laboratory 3 University of Michigan 4 University of Wisconsin-Madison 5 University of Illinois at Urbana-Champaign, 6 Northwestern University Overview • NSF Midwest Big Data Spoke
  • 36. Materials Resource Registry 36 • MDF is running an MRR instance as of ~Feb 2017, but it is not yet populated • MDF instance will harvest from other MRR instances (setting this up with Chandler in the coming 2-3 weeks) • We are also setting up the interfaces to make the data available in our discovery service (i.e., OAI- PMH harvesting) • Long term this is interesting for hierarchical search  Find data resources or dig deeper into them all from the same location • Build reverse pipelines to be good community citizens
  • 37. DOE User Facility Interactions 37 • Data and experimental records generated  ORNL (SNS and HFIR)  Brookhaven (NSLS-II) via DataBroker project  Argonne (APS, various sectors and existing data portals) – ranging from fuel spray data to catalysis experiments and nanofabrication data (small pending NSF proposal in nano Argonne/MIT/UC)  LBL (ALS) CAMERA project
  • 38. Recap and future work To streamline materials data sharing, discovery, access, and analysis, we have:  Simplified data publication, enabling ingest of 7 TB in 31 datasets from 11 institutions and 94 authors  Automated data and metadata capture for deep indexing of large data collections, reaching 1M records from 16 sources  Unified search across all of these sites and collections  Deployed APIs enabling programmatic access • We have demonstrated automated end-to-end discovery, access, and analysis pipelines Next steps: More data, more indexing, richer search, more analyses
  • 39. Thanks to our sponsors! 39 U . S . D E P A R T M E N T O F ENERGY

Hinweis der Redaktion

  1. Simplify interfaces for data publication regardless of data size, type, and location Provide automation capabilities to capture data from pipelines Deploy APIs to foster community development and integration Encourage data re-use Incentivize data sharing and Support Open Science in materials research
  2. PPPDB = Polymer Property Predictor and Database
  3. *"Enable inline viewing of data”* : Ben will have a browser open to a dataset (probably the same dataset, and show the navigation and opening of the preview) [very quick]
  4. Unified search across repositories, datasets, and DBs”*: Ben will demo search showing how the facets work and how the free text querying works [30s
  5. Important as it allows for publication and search of access-controlled data.
  6. Simplify interfaces for data publication regardless of data size, type, and location Provide automation capabilities to capture data from pipelines Deploy APIs to foster community development and integration Encourage data re-use Incentivize data sharing and Support Open Science in materials research
  7. *”Finding and using interatomic potentials”*: Ben will show how to use search to find and pull results from NIST Interatomic Potential Repo into Jupyter to complete an analysis using ASE software package (30s
  8. General notes on presenting this: “We’re currently integrating data analytics tools into the MDF. The goal is to have a system that will automate collecting data from different repositories, sending the data to compute resources (e.g., Jetstream),
  9. We’re currently integrating data analytics tools into the MDF. The goal is to have a system that will automate collecting data from different repositories, sending the data to compute resources (e.g., Jetstream), and collecting the key results to send to users personal computers – with the whole process mediated using Globus transfer and Data Search.” “This system will give scientists ready access to a rich data ecosystem, state-of-the-art tools, and computing capability all under the same platform.” There is also a demonstration you can give for this: “For example, we have recently implemented a tool to build machine-learning-based forcefields using the MDF.” [ go to http://agni.demo.materialsdatafacility.org:6543/, highlight a few datasets and click “Go”] “This will start a task on Jetstream to train a potential using the AGNI method, developed at UConn, and then send a report about the training accuracy and the model itself to another system.”
  10. We replicated a machine learning force field from a recent publication using data available from a database at Uconn By combining that Uconn dataset with other open datasets from the NIST MML Repository, we found that we can improve the accuracy of this model. Animation shows the effect on the accuracy of the model after adding in one of the two dataset Adding these new datasets even increases the accuracy of the model on the original training data It is also important to note that the NIST datasets were not created for training potentials. But, because they were openly available, we could utilize them for a brand new application
  11. We replicated a machine learning force field from a recent publication using data available from a database at Uconn By combining that Uconn dataset with other open datasets from the NIST MML Repository, we found that we can improve the accuracy of this model. Animation shows the effect on the accuracy of the model after adding in one of the two dataset Adding these new datasets even increases the accuracy of the model on the original training data It is also important to note that the NIST datasets were not created for training potentials. But, because they were openly available, we could utilize them for a brand new application
  12. Simplify interfaces for data publication regardless of data size, type, and location Provide automation capabilities to capture data from pipelines Deploy APIs to foster community development and integration Encourage data re-use Incentivize data sharing and Support Open Science in materials research
  13. Recap slide. We went over the results, we annotated them with some expert’s notes and had another round of reviews.
  14. Results are public and have been reviewed by graduate students and our expert at NIST: Debbie Audus
  15. We added TG using rule-based but will be looking at ML approaches for the properties as well
  16. Preliminary results. Currently looking at improving the ~1000 results.
  17. Simplify interfaces for data publication regardless of data size, type, and location Provide automation capabilities to capture data from pipelines Deploy APIs to foster community development and integration Encourage data re-use Incentivize data sharing and Support Open Science in materials research
  18. Key highlights are paying customer and users training users
  19. Key highlights are paying customer and users training users