4. : Provenance in DataONE
A DataONE search (here: “grass”) yields different packages with Data Provenance
(not covered: Semantic Search)
Ludäscher: KR&R ... HoH
4
5. Exploring Provenance in DataONE
• Let’s go there è Mark Carls. 2017. Analysis of hydrocarbons following
the Exxon Valdez oil spill, Gulf of Alaska, 1989 - 2014. Gulf of Alaska
Data Portal. urn:uuid:3249ada0-afe3-4dd6-875e-0f7928a4c171.
5Ludäscher: KR&R ... HoH
8. Adding YesWorkflow to DataONE
Yaxing’s script with
inputs & output
products
Christopher’s
YesWorkflow
model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results
can be traced back all
the way to Yaxing’s
input
Ludäscher: KR&R ... HoH
8
11. Computational Provenance …
• Origin, processing history of artifacts
– data products, figures, ...
– also: underlying workflow
è understand methods, dataflow, and dependencies
è think about the role of provenance in HoH!
Ludäscher: KR&R ... HoH 11
Climate Change Impacts
in the United States
U.S. National Climate Assessment
U.S. Global Change Research Program
12. Related: Reproducibility Crisis
Watch out for links to HoH!
• Successful reproducibility study:
• increases trust in prior study J
• … but no surprises L
• Failed reproducibility study :
• decreases trust (or falsifies) prior study L
• … but surprising failure yields new info/knowledge J
• Learning from failures!
– Not really a new, revolutionary idea..
– What is a positive vs negative result anyways?
– ... fail early, fail often ...
Ludäscher: KR&R ... HoH 12
17. SKOPE: Synthesized Knowledge Of Past Environments
Bocinsky, Kohler et al. study rain-fed maize of Anasazi
– Four Corners; AD 600–1500. Climate change influenced Mesa Verde Migrations; late
13th century AD. Uses network of tree-ring chronologies to reconstruct a spatio-
temporal climate field at a fairly high resolution (~800 m) from AD 1–2000. Algorithm
estimates joint information in tree-rings and a climate signal to identify “best” tree-ring
chronologies for climate reconstructing.
K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed
maize agricultural niche in the US Southwest. Nature
Communications. doi:10.1038/ncomms6618
… implemented as an R Script …
Ludäscher: KR&R ... HoH 17
20. YW Demo Use Cases (IDCC’17)
Domain Use case Programming language Provenance methods
Climate science C3C4 MATLAB YW + MATLAB
RunManager
Astrophysics LIGO Python YW + NW (code-level)
Protein crystal samples Simulate data
collection
Python YW + NW (code-level)
Biodiversity data
curation
kurator-SPNHC Python YW-recon + YW-logging
Social network analysis Twitter Python YW + NW (file-level)
Oceanography OHIBC Howe Sound
(multi-run multi-script)
R YW + R RunManager
Ludäscher: KR&R ... HoH 20
40. Non-unitary syntheses
of systematic knowledge
Nico Franz
School of Life Sciences, Arizona State University
CIRSS Seminar – Center for Informatics Research in Science and Scholarship
February 17, 2017 – iSchool, University of Illinois Urbana-Champaign
@ http://www.slideshare.net/taxonbytes/franz-2017-uiuc-cirss-non-unitary-syntheses-of-systematic-knowledge
Tracing taxonomic names (concepts!) over time …
41. The 'consensus' The
'bible'
The (formerly)
federal
'standard'
The 'best', latest
regional flora
"Controllingthetaxonomicvariable"
Expert views
are in
conflict
"Just bad"
Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610
41Ludäscher: KR&R ... HoH
42. The 'consensus' The
'bible'
The (formerly)
federal
'standard'
The 'best', latest
regional flora
Impact:
Name-based aggregation has created
a novel synthesis that nobody believes in
"Controllingthetaxonomicvariable"
"Just bad"
Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610
42Ludäscher: KR&R ... HoH
43. The 'consensus' The
'bible'
The (formerly)
federal
'standard'
The 'best', latest
regional flora
"Controllingthetaxonomicvariable"
"Just
bad"
Expert views
are
reconciled
Solution:
Instead of aggregating
an artificial 'consensus',
build translation services
Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610
43Ludäscher: KR&R ... HoH
44. Taxonomic concept alignment, Andropogon glomeratus-virginicus
complex, spanning across 11 classifications authored 1889-2015
• 36 unique taxonomic names
• 88 taxonomic concept labels
Þ name sec. author strings
• Alignment by A.S. Weakley
Þ row position = congruence
• 1/36 names with unique 1 : 1
name : meaning cardinality
across all classifications
• Andropogon virginicus
• Source: Franz et al. 20161
1 Franz et al. 2016. Names are not good enough: reasoning over taxonomic change in the Andropogon complex.
Semantic Web Journal (IOS). doi:10.3233/SW-160220
46. Use case 1.a. Aligning Microcebus + Mirza sec. MSW3 (2005)
"Taxonomic concept labels"
identify input concept regions
RCC–5 articulations provided
for each species-level concept
• Input visualization: MSW3 (2005) versus MSW2 (1993)
Source: Franz et al. 2016. Two influential primate classifications logical aligned. doi:10.1093/sysbio/syw023
47. • Alignment visualization: "grey means taxonomically congruent"
Use case 1.a. Aligning Microcebus + Mirza sec. MSW3 (2005)
48. One name &
congruent region
Many names &
congruent region
One name &
non-congruent regions
Many names &
non-congruent regions
New names &
exclusive regions
• Application of coverage constraint: parent-to-parent articulations (><) are fully
defined by alignment signal propagated from their respective children.
è Sensible when complete sampling of children is intended.
• Alignment visualization: "grey means taxonomically congruent"
Use case 1.a. Aligning Microcebus + Mirza sec. MSW3 (2005)
51. Two Taxonomies: NDC vs CEN
“…in the face of incompatible information or data structures among users or among those
specifying the system, attempts to create unitary knowledge categories are futile. Rather, parallel
or multiple representational forms are required” [Bowker & Star, 2000, p.159]
West
Southwest Southeast
Midwest North-
east
West
South
Midwest North-
east
National Diversity Council map (NDC) US Census Buero map (CEN)
Source: Yi-Yun (Jessica) Cheng (PhD student, iSchool @ Illinois)
Ludäscher: KR&R ... HoH 51
57. How we align two taxonomies T1 and T2
• Step 1. Supply input taxonomies T1
and T2
• Step 2. Describe the relationships
between T1 and T2
• Step 3. Iteratively edit articulations
in Euler/X
T1
T2
T1
T2
Inconsistent (N=0)
Ambiguous (N>1)
T3
Add/Edit
Articulations A
Euler/X
N Possible Worlds
N=1 N=0 or N>1
• … but where do the articulations
come from??
– expert opinion
– automatically derived from data
Ludäscher: KR&R ... HoH 57
69. EulerX: Some Implications
• Logic-based taxonomy alignment approach
– Disambiguate name-based taxonomy alignment over time
• 40% of the concepts in biology taxonomies undergoes
name change over time (Franz et al., 2016)
– May mitigate problems in equivalent crosswalking
• Membership condition problem that was often criticized in
crosswalking
– Preserves the original taxonomies while providing an
alignment view
• Solve data integration problems that happen in the more
coarse-grained relative crosswalking
Ludäscher: KR&R ... HoH
https://github.com/EulerProject/ASIST17
69
72. Inconsistent Alignment
Example
• Here: N = 10 taxa in T1, T2
• Euler/X finds:
inconsistent!
• è diagnostic lattice of 210
= 1024 nodes
è Find minimal inconsistent
subset (MIS)
è maximal consistent subset
(MCS) ..
è show to user!
Ludäscher: KR&R ... HoH 72
76. All-in-One (Summary)
• R&D “teaser” on Workflows, Provenance, Reproducibility, KR&R
• EulerX Tools
– Reconciling multiple taxonomic perspectives (hypotheses)
– Logic-based Diagnosis
– Reasoning with Incomplete Knowledge, Possible Worlds
• One size doesn’t fit all:
– … formalize & specialize HoH approach
– … may employ some of (or combination of) …
• Workflow & Provenance Modeling
• PRIMAD model
• Answer Set Programming
• Argumentations Frameworks / Games
• ….
• Invitation to collaborate!
– DataONE, SKOPE, Kurator
– New Whole Tale Biodiversity Informatics Working Group
• Transparency, Provenance, Reproducibility
– Reasoning with multiple taxonomies
Ludäscher: KR&R ... HoH 76
77. • … Aristotle …
• … Euler …
• …
• … Greg Whitbread …
• [BPB93] J. H. Beach, S. Pramanik, and J. H. Beaman. Hierarchic
taxonomic databases.,Advances in Computer Methods for Systematic
Biology: Artificial Intelligence, Databases, Computer Vision, 1993
• [Ber95] Walter G. Berendsohn. The concept of “potential taxa” in
databases. Taxon, 44:207–212, 1995.
• [Ber03] Walter G. Berendsohn. MoReTax – Handling Factual Information
Linked to Taxonomic Concepts in Biology. No. 39 in Schriftenreihe für
Vegetationskunde. Bundesamt für Naturschutz, 2003.
• [GG03] M. Geoffroy and A. Güntsch. Assembling and navigating the
potential taxon graph. In [Ber03], pages 71–82, 2003.
• [TL07] Thau, D., & Ludäscher, B. (2007). Reasoning about taxonomies in
first-order logic. Ecological Informatics, 2(3), 195-209.
• [FP09] Franz, N. M., & Peet, R. K. (2009). Perspectives: towards a
language for mapping relationships among taxonomic concepts.
Systematics and Biodiversity, 7(1), 5-20.
• … 77
Some EulerX
(Pre-)History
Ludäscher: KR&R ... HoH
78. • SKOPE: system and tools to discover, access,
analyze, visualize paleoenvironmental data
– unprecedented ability to explore provenance
(detailed, comprehensible record of computational
derivation of results)
– for researchers, tinkerers, and modelers
• Whole Tale:
– leverage & contribute to existing CI to support the
whole tale (“living paper”), from workflow run to
scholarly publication
– integrate tools & CI (DataONE, Globus, iRODS,
NDS, ...) to simplify use and promote best
practices.
– driven by science WGs (Archaeology/SKOPE,
materials science, astro, bio ..)
Project Vignettes
Ludäscher: KR&R ... HoH 78
79. HoH candidates: Argumentation
Frameworks & Game Provenance
a
b
1
c
3
d e
f
1
g
3
m
h
1
k
l
oo
n
oo
oo
oo
2 2
2
Ludäscher: KR&R ... HoH
79
• Query evaluation and logic-
based argumentation can be
understood as a game!
• One logic rule to rule them all …
win(X) :- move(X,Y), not win(Y)
• node color => edge color
– good vs bad moves
• good moves = natural, new
notion of provenance!
• Implement, e.g. using Answer
Set Programming (~ EulerX)
Aside: Games ~ Argumentation Frameworks
win(X) :- move(X,Y), not win(Y)
def(X) :- attacks(Y,X), not def(Y)