From: https://doi.org/10.3897/tdwgproceedings.1.20380
The YesWorkflow McPhillips et al. 2015b, McPhillips et al. 2015a toolkit was designed to annotate data curation workflows in conventional scripts (e.g., Python, R, Java) but it can also be used to annotate YAML-based Kurator workflow configuration files. From just a file that has been annotated by YesWorkflow, YesWorkflow is able to render a top-level graphical view of the workflow structure (prospective provenance), including system inputs and outputs, actors, connections among those actors, and expected data to be passed on those connections.
YesWorkflow also supports dynamic analysis and reporting on the results of the workflow (retrospective provenance) at various levels of granularity (e.g., at the actor level, script level, data level, record level, file level, function level), provided that it has been configured at each. YesWorkflow includes an @Log annotation, which describes the semantic structure of a log message within some actor in the workflow and allows the log message to be linked to the actor within which it was created, and for parts of that log message to be linked to the data passed between actors. YesWorkflow can be used to analyze the log messages after a run of the workflow and construct a store of facts, which can be queried and reasoned upon to make statements about the evolving paths taken by particular data elements through the workflow and assertions made about those data elements within the workflow.
Provenance, like other metadata, appears to be rarely actionable or immediately useful for those who are expected to provide it. However, by refactoring and integrating runtime observables generated from retrospective provenance and context information from prospective provenance analysis into hybrid queries, we show how both elements can yield hybrid visualizations that reveal “the plot” of the whole execution. In this way, a comprehensive workflow graph and a customizable data lineage report are made actionable for a workflow run with meaningful provenance artifacts. Queries run on a set of facts extracted from log messages by YesWorkflow after a workflow run, in combination with the facts extracted from the annotated workflow itself, allow for powerful visualizations of the retrospective provenance of a workflow run and of particular data records within a branching workflow.
REFERENCES
- McPhillips T, Bowers S, Belhajjame K, Ludäscher B (2015a) Retrospective Provenance Without a Runtime Provenance Recorder. 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP'15). URL: https://www.usenix.org/conference/tapp15/workshop-program/presentation/mcphillips
- McPhillips T, Song T, Kolisnik T, Aulenbach S, Belhajjame K, Bocinsky RK, Cao Y, Cheney J, Chirigati F, Dey S, Freire J, Jones C, Hanken J, Kintigh K, Kohler T, Koop
3. Provenance is
everywhere …
At the airport:
- “So where did you come from?”
- Depending on where you came
from determines where you go
YW provenance
3
Oxford English Dictionary: Provenance =
- the place of origin or earliest known
history of something:
- the beginning of something’s
existence; its origin:
- a record of ownership used as a guide
to authenticity or quality
6. "The government are very keen on
amassing statistics. They collect them,
add them, raise them to the nth power,
take the cube root and prepare
wonderful diagrams.
But you must never forget that every one
of these figures comes in the first
instance from the village watchman,
who just puts down what he damn
pleases.”
YW provenance
6
Why we need data lineage and
computational provenance
7. Computational Provenance …
• Origin, processing history of artifacts
– data products, figures, ...
– also: workflow/script evolution …
è understand methods, dataflow, and dependencies
YW provenance
7
Climate Change Impacts
in the United States
U.S. National Climate Assessment
U.S. Global Change Research Program
8. Rewind: Data Curation Workflows
(Filtered-Push … Kepler … Kurator projects)
YW provenance
8
15. Hybrid Provenance:
YW Model + Runtime
Observables (file level)
YW provenance
15
�����������������
�����
���������
��������������
����������������
����������
�����������������
����������������
�������
����������
������������������
����������������
�����������������
�������������������
�����������
������������������
����������
�����������������
�����������
������������
�������������
���������������������
�������������������������������������������������������������������
�����������������
�������������������������������������������������������������������������
• The YW model can be connected
with runtime observables
• è YW recon (for provenance
reconstruction)
• Here:
• What specific files were read,
written and where do they
occur in the workflow?
28. LIGO example: What strain_L1_whitenbp depends on …
Overall workflow
Upstream of
strain_L1_whitenbp
(prospective)
GRAVITATIONAL_WAVE_DETECTION
LOAD_DATA
Load hdf5 data.
strain_H1strain_L1 strain_16 strain_4
AMPLITUDE_SPECTRAL_DENSITY
Amplitude spectral density.
ASDs
file:GW150914_ASDs.png
PSD_H1PSD_L1
WHITENING
suppress low frequencies noise.
strain_H1_whiten strain_L1_whiten
BANDPASSING
remove high frequency noise.
strain_H1_whitenbp strain_L1_whitenbp
STRAIN_WAVEFORM_FOR_WHITENED_DATA
plot whitened data.
WHITENED_strain_data
file:GW150914_strain_whitened.png
SPECTROGRAMS_FOR_STRAIN_DATA
plot spectrogram for strain data.
spectrogram
file:GW150914_{detector}_spectrogram.png
SPECTROGRAMS_FOR_WHITEND_DATA
plot spectrogram for whitened data.
spectrogram_whitened
file:GW150914_{detector}_spectrogram_whitened.png
FILTER_COEFS
Filter signal in time domain (bandpassing).
COEFFICIENTS
FILTER_DATA
filter data.
filtered_white_noise_data
file:GW150914_filter.png
strain_H1_filtstrain_L1_filt
STRAIN_WAVEFORM_FOR_FILTERED_DATA
plot the filtered data.
H1_strain_filtered
file:GW150914_H1_strain_filtered.png
H1_strain_unfiltered
file:GW150914_H1_strain_unfiltered.png
WAVE_FILE_GENERATOR_FOR_WHITENED_DATA
Make sound files for whitened data.
whitened_bandpass_wavefile
file:GW150914_{detector}_whitenbp.wav
SHIFT_FREQUENCY_BANDPASSED
shift frequency of bandpassed signal.
strain_H1_shifted strain_L1_shifted
WAVE_FILE_GENERATOR_FOR_SHIFTED_DATA
Make sound files for shifted data.
shifted_wavefile
file:GW150914_{detector}_shifted.wav
DOWNSAMPLING
Downsampling from 16384 Hz to 4096 Hz.
H1_ASD_SamplingRate
file:GW150914_H1_ASD_{SamplingRate}.png
FN_Detector
file:{Detector}_LOSC_4_V1-1126259446-32.hdf5
FN_Sampling_rate
file:H-H1_LOSC_{DownSampling}_V1-1126259446-32.hdf5
fs
upstream(strain_LI_whitenbp) [prospective]
WHITENING
strain_H1_whiten strain_L1_whiten
AMPLITUDE_SPECTRAL_DENSITY
PSD_H1 PSD_L1
LOAD_DATA
strain_H1 strain_L1
BANDPASSING
strain_L1_whitenbp
FN_Detector
file:{Detector}_LOSC_4_V1-...
FN_Sampling_rate
file:H-H1_LOSC_{Rate}_V1-...
fs
upstream(strain_L1_whitenbp) [URI-recon]
WHITENING
strain_H1_whiten strain_L1_whiten
AMPLITUDE_SPECTRAL_DENSITY
PSD_H1 PSD_L1
LOAD_DATA
strain_H1 strain_L1
BANDPASSING
strain_L1_whitenbp
FN_Detector
L-L1_LOSC_4_V1-1126259446-32.hdf5
H-H1_LOSC_4_V1-1126259446-32.hdf5
FN_Sampling_rate
H-H1_LOSC_4_V1-1126259446-32.hdf5
H-H1_LOSC_16_V1-1126259446-32.hdf5
fs
upstream(strain_LI_whitenbp) [NW-recon]
WHITENING
strain_L1_whiten
strain_L1_whiten = array([8.494, -1.672, ..., 72.156])
AMPLITUDE_SPECTRAL_DENSITY
PSD_L1
psd_L1 = scipy.interpolate.interpolate.interp1d
object at 0x113969418
LOAD_DATA
strain_L1
strain_L1 = array([-1.779e-18, -1.765e-18, ..., -1.719e-18])
BANDPASSING
strain_L1_whitenbp
strain_L1_whitenbp = array([8.184, 19.935,..., -0.684])
FN_Detector
fn_d = L-L1_LOSC_4_V1-1126259446-32.hdf5
fs
fs = 4096
Upstream of strain_L1_whitenbp
(hybrid YW-NW at the code-
level)
Upstream of strain_L1_whitenbp
(hybrid YW-NW at the file-level)
3 inputs spread across
5 (=2x2 + 1) files
Does intermediate data
strain_L1_whitenbp
depend on all 5 inputs?
• Intermediate data
strain_L1_whiten
bp depend only
on 2 out of 5
inputs!
YW Provenance
28
29. YW-IDCC’17 Demo Use Cases
Domain Use case Programming language Provenance methods
Climate science C3C4 MATLAB YW + MATLAB
RunManager
Astrophysics LIGO Python YW + NW (code-level)
Protein crystal samples Simulate data
collection
Python YW + NW (code-level)
Biodiversity data
curation
kurator-SPNHC Python YW-recon + YW-logging
Social network analysis Twitter Python YW + NW (file-level)
Oceanography OHIBC Howe Sound
(multi-run multi-script)
R YW + R RunManager
YW provenance
29
30. Adding YW to DataONE
Yaxing’s script with
inputs & output
products
Christopher’s
YesWorkflow
model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results
can be traced back all
the way to Yaxing’s
input
YW provenance
30