Summary of Goals, Progress, and Next steps for these three aspects of the Materials Project (materialsproject.org) infrastructure
* Validation: constantly guard against bugs in core data and imported data
* Provenance: know how data came to be
* Sandboxes: combine public and non-public data; "good fences make good neighbors"
Presenter: Dan Gunter, LBNL
2. Goals
âą Validation
â constantly guard against bugs in core data
and imported data
âą Provenance
â know how data came to be
âą Sandboxes
â Combine public and non-public data; "good
fences make good neighbors"
4. Validation runs all the time
âą Rules with "constraints" for every database (and sandbox)
âą Test constraints against entire DB every night ï email reports
âą Validation engine, etc. all open-source software in pymatgen-db
Remote
server
Validation
engine
Rules
MP Databases
Reports
(email, web pages, ..)
6. Validation summary
Easy-to-use, integrated, efficient tools to
report errors
Next steps
â Record all check results in DB
â More sophisticated checks (Map/Reduce)
â Make it easier to add new checks internally
â Make it easier to add new check for anyone
âą per-sandbox or even per-user ("MP Alerts")
8. Types of provenance in the system
1) Calculation workflows
â FireWorks records calculation inputs, .. results in great detail
2) External datasets
â Structure Notation Language standardizes the naming of data
sources and publications
3) Post-calculation data transformations
â New "builders" provides framework for tracking creation of final
database products
(1) (2)
(3)
11. Future work: unified view of
provenance
VASP
result
ICSD
VASP
result
VASP
result
Post-
processing
Material
properties
Computation
Data import
processing
e.g., Defects
14. Sandboxes = Database + Apps
Core data Core data
+
multivalent
materials
Non-
JCESR
users
JCESR
users
15. Technical challenges
âą Pre-process data for real-time search
âą Interfaces for per-user access control
â https://materialsproject.org/materials/1234?san
dbox=jcesr
â Web UI elements
and
16. Future: dynamic sandbox creation
Current:
â Large & significant
additional data / apps
âą e.g., JCESR
â Longer-term
connections to MP data
âą e.g. porous materials
â Companies
âą e.g. VW/Stanford
Future
small collab.
per-user?
CoD?
17. Summary
âą Validation
â guard against bugs by checking all data daily
and at data import/creation time
âą Provenance
â universal standard for annotating data
provenance
âą Sandboxes
â unified view of distinct databases
â onramp for new collaborations and data
Hinweis der Redaktion
Picture of 1915 Heinrich Campendonk painting, "Landscape with horses". Steve Martin paid $850K for a forged version of the painting, from a reputable art house in Paris, in 2004. He sold it at a loss of $250K before discovering it was a forgery. The forgery was performed by Wolfgang Beltracchi.
Sandboxes are a way to share preliminary data in the context of MP data and tools.