SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
Data-driven Hypothesis Management
Bernardo Gon¸calves
benunesg@umich.edu
@ IBM Research Brazil – August 27, 2015
Bernardo Gon¸calves Data-driven Hypothesis Management 1 / 54
Research Vision Hypothesis Management
Hypothesis Management
Scientific research is based on the central idea of a
hypothesis , meant to be established or refuted.
Bernardo Gon¸calves Data-driven Hypothesis Management 2 / 54
Research Vision Hypothesis Management
Hypothesis Management
Scientific research is based on the central idea of a
hypothesis , meant to be established or refuted.
Over time and from multiple sources, we may collect
evidence that support their (gray-shaded) truth or falsehood.
Bernardo Gon¸calves Data-driven Hypothesis Management 2 / 54
Research Vision Hypothesis Management
Hypothesis Management
Scientific research is based on the central idea of a
hypothesis , meant to be established or refuted.
Over time and from multiple sources, we may collect
evidence that support their (gray-shaded) truth or falsehood.
Hypothesis management is, therefore, closely related to the
management of probabilistic data.
Bernardo Gon¸calves Data-driven Hypothesis Management 2 / 54
Research Vision Hypothesis Management
Hypothesis Management
Scientific research is based on the central idea of a
hypothesis , meant to be established or refuted.
Over time and from multiple sources, we may collect
evidence that support their (gray-shaded) truth or falsehood.
Hypothesis management is, therefore, closely related to the
management of probabilistic data.
...And it should no more be restricted to the traditional
sciences.
Bernardo Gon¸calves Data-driven Hypothesis Management 2 / 54
Research Vision The Big Data Revolution
The Big Data Revolution
“You can’t manage what you don’t measure.”
Big data: the management revolution
McAfee and Brynjolfsson. Harvard Business Review 90(10):60–6, 2012.
Bernardo Gon¸calves Data-driven Hypothesis Management 3 / 54
Research Vision The Big Data Revolution
Data-Driven Decision Making
“ Decisions that previously were based on guesswork,
or on painstakingly handcrafted models of reality, can
now be made using data-driven mathematical models .”
Jagadish et al. Communications of the ACM 57(7):86–94, 2014.
But it’s not just the analytics...
(cf. ACM SIGMOD Blog – May 5, 2012)
Bernardo Gon¸calves Data-driven Hypothesis Management 4 / 54
Research Vision Data-driven Discovery
Engineering the Data-driven Discovery Cycle (1)
HYPOTHESIS
synthesis, deduction, prediction
DATA
learning, induction, fitting
Challenges for data-driven discovery:
How to remove friction and go back and forth more smoothly?
How to make systems more usable so that data is really accessible?
Bernardo Gon¸calves Data-driven Hypothesis Management 5 / 54
Research Vision Data-driven Discovery
Engineering the Data-driven Discovery Cycle (2)
Hypothesis management in the big data era is a broad field. Some specific
topics:
1 Data ETL: automatic synthesis of probabilistic DB’s; (Ph.D.)
Υ-DB: Managing scientific hypotheses as uncertain data.
PVLDB 7(11):959-62, 2014 .
Managing scientific hypotheses as data with support for predictive analytics.
IEEE Computing in Science & Eng. 17(5):35-43, 2015 .
(IF 1.248, special issue Open Simulation Labs).
Synthesis method for probabilistic database design. ACM Sigmod Record
(IF 0.955, under review).
A note on the complexity of the causal ordering problem. Artif. Intell.
(IF 3.371; under review, preprint at CoRR abs/1508.05804).
2 Data retrieval: disambiguation of imprecise DB queries; (postdoc)
Bernardo Gon¸calves Data-driven Hypothesis Management 6 / 54
Data ETL (Extract, Transform, Load) Use Case
1 Research Vision
Hypothesis Management
The Big Data Revolution
Data-driven Discovery
2 Data ETL
Use Case
Design-by-Synthesis Pipeline
Hypothesis Encoding
Probabilistic DB Synthesis
3 Data Retrieval (Imprecise DB Queries)
4 Future Directions
5 Appendix
Probabilistic DB Synthesis
Bernardo Gon¸calves Data-driven Hypothesis Management 7 / 54
Data ETL (Extract, Transform, Load) Use Case
From Hypotheses to Data
Law of free fall
“If a body falls from rest, then its
velocity at any point is proportional to
the time it has been falling.”
(i)
for k = 0:n;
t = k * dt;
v = -g*t + v_0;
s = -(g/2)*t^2 + v_0*t + s_0;
t_plot(k) = t;
v_plot(k) = v;
s_plot(k) = s;
end
(iii)
a = −g
v = −g t + v0
s = −(g/2) t2
+ v0 t + s0
(ii)
FALL t v s
0 0 5000
1 −32 4984
2 −64 4936
3 −96 4856
4 −128 4744
· · · · · · · · ·
(iv)
Bernardo Gon¸calves Data-driven Hypothesis Management 8 / 54
Data ETL (Extract, Transform, Load) Use Case
Competing Hypotheses: Probability Distribution
Alternative hypotheses supposed to explain the same phenomenon.
H1. Free fall law
a = −g
v = −g t + v0
s = −(g/2) t2
+ v0 t + s0
H2. Stokes’ law
a = 0
v =− gD/ 4.6×10−4
s =−t gD/4.6×10−4 + s0
H3. Velocity-squared law
a = 0
v =−gD2
/ 3.29×10−6
s =−(gD2
/ 3.29×10−6
) t + s0
Bernardo Gon¸calves Data-driven Hypothesis Management 9 / 54
Data ETL (Extract, Transform, Load) Use Case
Concrete Use Scenario in Computational Science
Example 1.
Bob is a computational scientist who is playing with a number of
models and different parameter settings to see which one gives a best fit
to his observation samples.
Each run constitutes a specific model instantiation that is associated with
a unique file (‘big table’). In the end of the day his resulting datasets are
spread over many files and folders.
N.B.: most pronounced V’s of Veracity and Variety, as opposed to Volume or Velocity.
Bernardo Gon¸calves Data-driven Hypothesis Management 10 / 54
Data ETL (Extract, Transform, Load) Use Case
Beyond Files: a Big Table Database
User’s default choice: struggle with the files to find relevant data.
“There is life beyond files.” (Jim Gray)
FALL tid t g v0 s0 a v s
1 0 32 0 5000 −32 0 5000
1 1 32 0 5000 −32 −32 4984
1 2 32 0 5000 −32 −64 4936
· · · · · · · · · · · · · · · · · · · · · · · ·
2 0 32.2 0 5000 −32.2 0 5000
2 1 32.2 0 5000 −32.2 −32.2 4983.9
2 2 32.2 0 5000 −32.2 −64.4 4935.6
· · · · · · · · · · · · · · · · · · · · · · · ·
Bernardo Gon¸calves Data-driven Hypothesis Management 11 / 54
Data ETL (Extract, Transform, Load) Use Case
Do Better: a Hypothesis Database
a = −g
v = −g t + v0
s = −(g/2) t2
+ v0 t + s0
FALL tid t g v0 s0 a v s
1 0 32 0 5000 −32 0 5000
1 1 32 0 5000 −32 −32 4984
1 2 32 0 5000 −32 −64 4936
· · · · · · · · · · · · · · · · · · · · · · · ·
2 0 32.2 0 5000 −32.2 0 5000
2 1 32.2 0 5000 −32.2 −32.2 4983.9
2 2 32.2 0 5000 −32.2 −64.4 4935.6
· · · · · · · · · · · · · · · · · · · · · · · ·
Predictive structure: strong correlations from the math models (!).
Project-level standardization: pick a favorite MathML editor to
report and manage model equations declaratively.
Bernardo Gon¸calves Data-driven Hypothesis Management 12 / 54
Data ETL (Extract, Transform, Load) Design-by-Synthesis Pipeline
Design-by-Synthesis Pipeline
files
Sk
D1
k D2
k
... Dp
k
h
n
k=1 Hk
y
n
k=1
m
=1 Yk
ETL
(big table)
p-ETL
(u-factors)
Conditioning
Technical challenges:
1 Encoding : math equations → structural eqs. → functional deps.;
2 Causal reasoning: inferring the causal ordering and u-factors;
3 Probabilistic DB synthesis: normalization based on the u-factors;
4 Conditioning: probability distribution update in face of evidence.
Bernardo Gon¸calves Data-driven Hypothesis Management 13 / 54
Data ETL (Extract, Transform, Load) Encoding
1 Research Vision
Hypothesis Management
The Big Data Revolution
Data-driven Discovery
2 Data ETL
Use Case
Design-by-Synthesis Pipeline
Hypothesis Encoding
Probabilistic DB Synthesis
3 Data Retrieval (Imprecise DB Queries)
4 Future Directions
5 Appendix
Probabilistic DB Synthesis
Bernardo Gon¸calves Data-driven Hypothesis Management 14 / 54
Data ETL (Extract, Transform, Load) Encoding
Given set E of equations over set V of variables... (1)
Law of free fall:
H = {
g = 32,
v0 = 0,
s0 = 5000,
a = −g,
v = −g t + v0,
s = −(g/2)t2
+ v0 t + s0 }.
h-encode
−−−−−→
Σ = {
φ → g,
φ → v0,
φ → s0,
g υ → a,
g v0 t υ → v,
g v0 s0 t υ → s }.
Bernardo Gon¸calves Data-driven Hypothesis Management 15 / 54
Data ETL (Extract, Transform, Load) Encoding
Given set E of equations over set V of variables... (2)
Law of free fall:
S = {
f1(g) = 0,
f2(v0) = 0,
f3(s0) = 0,
f4(a, g) = 0,
f5(v, g, t, v0) = 0,
f6(s, g, t, v0, s0) = 0 }.
h-encode
−−−−−→
Σ = {
φ → g,
φ → v0,
φ → s0,
g υ → a,
g v0 t υ → v,
g v0 s0 t υ → s }.
Bernardo Gon¸calves Data-driven Hypothesis Management 16 / 54
Data ETL (Extract, Transform, Load) Encoding
Causal Ordering Algorithm (AI Literature)
1 Identify ‘minimal substructures’ at step k;
2 Reduce the matrix by eliminating them;
3 Call step k+1 recursively.
x1 x2 x3 x4 x5 x6 x7
f1 1 0 0 0 0 0 0
f2 0 1 0 0 0 0 0
f3 0 0 1 0 0 0 0
f4 1 1 1 1 1 0 0
f5 1 0 1 1 1 0 0
f6 0 0 0 1 0 1 0
f7 0 0 0 0 1 0 1
→
x1 x2 x3 x4 x5 x6 x7
f1 1 0 0 0 0 0 0
f2 0 1 0 0 0 0 0
f3 0 0 1 0 0 0 0
f4 1 1 1 1 1 0 0
f5 1 0 1 1 1 0 0
f6 0 0 0 1 0 1 0
f7 0 0 0 0 1 0 1
Bernardo Gon¸calves Data-driven Hypothesis Management 17 / 54
Data ETL (Extract, Transform, Load) Encoding
Equivalent to Finding K-Balanced Pseudo-Bicliques
S = {
f1(x1) = 0,
f2(x2) = 0,
f3(x3) = 0,
f4(x1, x2, x3, x4, x5) = 0,
f5(x1, x3, x4, x5) = 0,
f6(x4, x6) = 0,
f7(x5, x7) = 0 }.
E V
f1 x1
f2 x2
f3 x3
f4 x4
f5 x5
f6 x6
f7 x7
Theorem 1.
Let S(E, V) be a complete structure. The extraction of its causal ordering
by COA(S) is NP-Hard.
Bernardo Gon¸calves Data-driven Hypothesis Management 18 / 54
Data ETL (Extract, Transform, Load) Encoding
Easier: Complete Matching in a Bipartite Graph
Hopcroft-Karp algorithm to
solve it in O( |E| |S|);
E V
f1 x1
f2 x2
f3 x3
f4 x4
f5 x5
f6 x6
f7 x7
Proposition 2.
Let S(E, V) be a structure. Then a total causal mapping ϕ: E → V over
S exists iff S is complete .
Bernardo Gon¸calves Data-driven Hypothesis Management 19 / 54
Data ETL (Extract, Transform, Load) Encoding
Provably Correct Approach to Hypothesis Encoding
Cϕ = { (xa, xb) | there exists f ∈ E such that ϕ(f ) = xb
and xa ∈ Vars(f ) }
Proposition 1.
Let S(E, V) be a structure, and ϕ1 : E → V and ϕ2 : E → V be any two
total causal mappings over S. Then C+
ϕ1
= C+
ϕ2
.
Bernardo Gon¸calves Data-driven Hypothesis Management 20 / 54
Data ETL (Extract, Transform, Load) Encoding
Experiments: logscale base 2 on the structure length |S|
27 28 29 210 211 212 213 214 215 216 217 218 219 220 221
25
27
29
211
213
215
structure length |S|
time[ms]
h-encode
Corollary 1.
Let S(E, V) be a complete structure. Then a total causal mapping ϕ: E → V over S can be
found by TCM(S) in time that is bounded by O( |E| |S|) .
Bernardo Gon¸calves Data-driven Hypothesis Management 21 / 54
Data ETL (Extract, Transform, Load) DB Synthesis
Coming Back from our Dive
files
Sk
D1
k D2
k
... Dp
k
h
n
k=1 Hk
y
n
k=1
m
=1 Yk
ETL
(big table)
p-ETL
(u-factors)
Conditioning
Technical challenges:
1 Encoding : math equations → structural eqs. → functional deps.;
2 Causal reasoning: inferring the causal ordering and u-factors;
3 Probabilistic DB synthesis: normalization based on the u-factors;
4 Conditioning: probability distribution update in face of evidence.
Bernardo Gon¸calves Data-driven Hypothesis Management 22 / 54
Data ETL (Extract, Transform, Load) DB Synthesis
Bob’s Big Tables (before our help)
H3 tid φ υ t x0 b p y0 d r x y
1 1 3 0 30 .5 .02 4 .75 .02 30 4
1 1 3 ... 30 .5 .02 4 .75 .02 ... ...
2 1 3 0 30 .5 .018 4 .75 .023 30 4
2 1 3 ... 30 .5 .018 4 .75 .023 ... ...
3 1 3 0 30 .4 .02 4 .8 .02 30 4
3 1 3 ... 30 .4 .02 4 .8 .02 ... ...
4 1 3 0 30 .4 .018 4 .8 .023 30 4
4 1 3 ... 30 .4 .018 4 .8 .023 ... ...
5 1 3 0 30 .397 .02 4 .786 .02 30 4
5 1 3 ... 30 .397 .02 4 .786 .02 ... ...
6 1 3 0 30 .397 .018 4 .786 .023 30 4
6 1 3 5 30 .397 .018 4 .786 .023 50.1 62.9
6 1 3 10 30 .397 .018 4 .786 .023 13.8 8.65
6 1 3 15 30 .397 .018 4 .786 .023 79.3 8.23
6 1 3 20 30 .397 .018 4 .786 .023 12.6 30.7
6 1 3 ... 30 .397 .018 4 .786 .023 ... ...
Bernardo Gon¸calves Data-driven Hypothesis Management 23 / 54
Data ETL (Extract, Transform, Load) DB Synthesis
Synthesized Tables: Ready for Predictive Analysis
Y0 V → D φ υ
x1 → 1 1 1
x1 → 2 1 2
x1 → 3 1 3
Y1
3 V →D φ x0
x2 →1 1 30
Y2
3 V →D φ b
x3 →1 1 .5
x3 →2 1 .4
x3 →3 1 .397
Y3
3 V →D φ p
x4 →1 1 .020
x4 →2 1 .018
Y4
3 V1 →D1 V2 →D2 V3 →D3 V4 →D4 φ υ t y x
x1 → 3 x2 → 1 x3 → 1 x4 → 1 1 3 1900 4 30
x1 → 3 x2 → 1 x3 → 1 x4 → 1 1 3 ... ... ...
... ... ... ... 1 3 ... ... ...
x1 → 3 x2 → 1 x3 → 3 x4 → 2 1 3 1900 4 30
x1 → 3 x2 → 1 x3 → 3 x4 → 2 1 3 1901 4.12 41.5
x1 → 3 x2 → 1 x3 → 3 x4 → 2 1 3 1902 5.78 56.7
x1 → 3 x2 → 1 x3 → 3 x4 → 2 1 3 1903 11.7 72.8
x1 → 3 x2 → 1 x3 → 3 x4 → 2 1 3 1904 31.1 75.9
x1 → 3 x2 → 1 x3 → 3 x4 → 2 1 3 .. .. ..
Bernardo Gon¸calves Data-driven Hypothesis Management 24 / 54
Data ETL (Extract, Transform, Load) DB Synthesis
Querying Alternative Predictions with Probabilities
Y4
3 V1 →D1 V2 →D2 V3 →D3 V4 →D4 φ υ t y x
x1 → 3 x2 → 1 x3 → 1 x4 → 1 2 3 1900 4 30
... ... ... ... ... ... ... ... ...
x1 → 3 x2 → 1 x3 → 3 x4 → 2 2 3 1904 .. 75.92
... ... ... ... ... ... ... ... ...
W V → D Pr
x1 → 2 .33
x1 → 3 .33
x1 → 3 .33
x2 → 1 1
x3 → 1 .33
x3 → 2 .33
x3 → 3 .33
x4 → 1 .5
x4 → 2 .5
Operation conf();
θ = { x1 →3, x2 →1, x3 →3, x4 →2 },
Pr = .33 ∗ 1 ∗ .33 ∗ .5 ≈ .055
Bernardo Gon¸calves Data-driven Hypothesis Management 25 / 54
Data ETL (Extract, Transform, Load) DB Synthesis
Bob’s Prototype System (1)
Bernardo Gon¸calves Data-driven Hypothesis Management 26 / 54
Data ETL (Extract, Transform, Load) DB Synthesis
Bob’s Prototype System (2)
Bernardo Gon¸calves Data-driven Hypothesis Management 27 / 54
Data Retrieval (Imprecise DB Queries)
1 Research Vision
Hypothesis Management
The Big Data Revolution
Data-driven Discovery
2 Data ETL
Use Case
Design-by-Synthesis Pipeline
Hypothesis Encoding
Probabilistic DB Synthesis
3 Data Retrieval (Imprecise DB Queries)
4 Future Directions
5 Appendix
Probabilistic DB Synthesis
Bernardo Gon¸calves Data-driven Hypothesis Management 28 / 54
Data Retrieval (Imprecise DB Queries)
(Bob likes to issue) Search Queries over Database Systems
Bayesian smoothing to enable imprecise database queries;
Learning from implicit user feedback;
(Source: Li and Jagadish, SIGMOD 2014).
Bernardo Gon¸calves Data-driven Hypothesis Management 29 / 54
Data Retrieval (Imprecise DB Queries)
Takeaways
Datasets have an implicit conceptual structure that may be unveiled.
We have seen an example in support of hypothesis management.
Synthesis (out of math equations) of a probabilistic DB: ready for
predictive analytics.
Needed: such usable systems for fast data-driven decision making.
Bernardo Gon¸calves Data-driven Hypothesis Management 30 / 54
Future Directions
Future: More on Tools for the Data-driven Discovery Cycle
HYPOTHESIS
synthesis, deduction, prediction
DATA
learning, induction, fitting
Design for hybrid data systems ( structured + unstructured ):
Integrity and low latency (relational) with high throughput (big data).
More on modeling, extracting and transforming data;
Usable tools for end-users — no DBA or ‘data scientist’ in the loop.
Crowd hypotheses on the social web...
Bernardo Gon¸calves Data-driven Hypothesis Management 31 / 54
Future Directions
Thank you.
Questions?
Bernardo Gon¸calves
http://umich.edu/~benunesg
benunesg@umich.edu
Bernardo Gon¸calves Data-driven Hypothesis Management 32 / 54
Appendix
Design-by-Synthesis Pipeline
files
Sk
D1
k D2
k
... Dp
k
h
n
k=1 Hk
y
n
k=1
m
=1 Yk
ETL
(big table)
p-ETL
(u-factors)
Conditioning
Technical challenges:
1 Encoding: math equations → structural eqs. → functional deps.;
2 Causal reasoning: inferring the causal ordering and u-factors;
3 Probabilistic DB synthesis: normalization based on the u-factors;
4 Conditioning: probability distribution update in face of evidence.
Bernardo Gon¸calves Data-driven Hypothesis Management 33 / 54
Appendix
The Folding Σ of Σ
Acyclic pseudo-transitive reasoning
Algorithm 1 Folding of an fd set.
1: procedure folding(Σ: fd set)
Require: Σ given encodes complete structure S
Ensure: Returns fd set Σ , the folding of Σ
2: Σ ← ∅
3: for all X, A ∈ Σ do
4: Z ← AFolding(Σ, A)
5: Σ ← Σ ∪ Z, A
6: return Σ
Bernardo Gon¸calves Data-driven Hypothesis Management 34 / 54
Appendix
The Folding Σ of Σ
Acyclic pseudo-transitive reasoning
Σ = { φ → x0,
φ → b,
φ → p,
φ → y0,
φ → d,
φ → r,
x0 b p t υ y → x,
y0 d r t υ x → y }.
folding
−−−→
Υ(Σ) = {
x0 b p t υ y y0 d r → x,
y0 d r t υ x x0 b p → y }.
Bernardo Gon¸calves Data-driven Hypothesis Management 35 / 54
Appendix
FD set Σ89 (a real Physiology Model that “fits the screen”)
Σ89 = { φ → C1a C1p C2a C2p C3a Cglobal Cmyo
Dp100 Pc t delta t max t min taua taud,
φ t → DelP,
C1a C1p C2a C2p C3a Cglobal Cmyo Dp100 Pc υ → Dc,
Dc Pc υ → Tc,
Cglobal Cmyo Dc Pc υ → Ac,
Dc υ → D t min,
Ac υ → A t min,
DelP Pc υ → P,
D P υ → T,
A C1a C1p C2a C2p C3a Dp100 P T υ → Ttarget ,
Cglobal Cmyo D P υ → Atarget,
D t min Dc T Tc Ttarget t taud υ → D,
A t min Atarget t taua υ → A }.
Bernardo Gon¸calves Data-driven Hypothesis Management 36 / 54
Appendix
Its Folding Σ89
Υ(Σ89) = { C1a C2a φ υ → A t min Ac D t min Dc Tc,
C1a DelP φ υ → P,
C1a C2a DelP φ t υ T → A Atarget D Ttarget }.
Bernardo Gon¸calves Data-driven Hypothesis Management 37 / 54
Appendix
Experiments: Folding (logscale base 2)
27 28 29 210 211 212 213 214 215 216 217 218 219 220 221
26
210
214
218
structure length |S|
time[ms]
folding
Bernardo Gon¸calves Data-driven Hypothesis Management 38 / 54
Appendix DB Synthesis
Design-by-Synthesis Pipeline
files
Sk
D1
k D2
k
... Dp
k
h
n
k=1 Hk
y
n
k=1
m
=1 Yk
ETL
(big table)
p-ETL
(u-factors)
Conditioning
Technical challenges:
1 Encoding: math equations → structural eqs. → functional deps.;
2 Causal reasoning: inferring the causal ordering and u-factors;
3 Probabilistic DB synthesis: normalization based on the u-factors;
4 Conditioning: probability distribution update in face of evidence.
Bernardo Gon¸calves Data-driven Hypothesis Management 39 / 54
Appendix DB Synthesis
Defining the Theoretical U-factor
Y0 := πφ, υ (repair-keyφ(H0) ).
H0 φ υ
1 1
1 2
1 3
Y0 V → D φ υ
x1 → 1 1 1
x1 → 2 1 2
x1 → 3 1 3
W V → D Pr
x1 → 1 .33
x1 → 2 .33
x1 → 3 .33
Bernardo Gon¸calves Data-driven Hypothesis Management 40 / 54
Appendix DB Synthesis
U-factor Learning
Discovery of contingent functional dependencies
‘Input’ relation of the Lotka-Volterra hypothesis. Observe:
Multiplicity of parameter values;
Correlations between parameter values.
H1
3 tid φ x0 b p y0 d r
1 1 30 .5 .020 4 .75 .020
2 1 30 .5 .018 4 .75 .023
3 1 30 .4 .020 4 .8 .020
4 1 30 .4 .018 4 .8 .023
5 1 30 .397 .020 4 .786 .020
6 1 30 .397 .018 4 .786 .023
Ω = { φ x0 → y0,
φ b → d,
φ p → r }.
Bernardo Gon¸calves Data-driven Hypothesis Management 41 / 54
Appendix DB Synthesis
U-factorization
Defining the Empirical U-factors
Y i
k := πφ A G (repair-keyφ @ count ( γφ, A, G, count(∗)(Hk) ) ).
Y1
3 V →D φ x0 y0
x2 →1 1 30 4
Y2
3 V →D φ b p
x3 →1 1 .5 .5
x3 →2 1 .4 .8
x3 →3 1 .397 .786
Y3
3 V →D φ p r
x4 →1 1 .020 .020
x4 →2 1 .018 .023
W V → D Pr
... ...
x2 → 1 1
x3 → 1 .33
x3 → 2 .33
x3 → 3 .33
x4 → 1 .5
x4 → 2 .5
Bernardo Gon¸calves Data-driven Hypothesis Management 42 / 54
Appendix DB Synthesis
Design-Theoretic Properties
Desirable properties for probability update are ensured;
Claim-centered decomposition.
Theorem 6: BCNF w.r.t. causal dependencies.
Correctness of uncertainty decomposition.
Theorem 7: Lossless join w.r.t. causal dependencies.
Bernardo Gon¸calves Data-driven Hypothesis Management 43 / 54
Appendix DB Synthesis
Design-by-Synthesis Pipeline
files
Sk
D1
k D2
k
... Dp
k
h
n
k=1 Hk
y
n
k=1
m
=1 Yk
ETL
(big table)
p-ETL
(u-factors)
Conditioning
Technical challenges:
1 Encoding: math equations → structural eqs. → functional deps.;
2 Causal reasoning: inferring the causal ordering and u-factors;
3 Probabilistic DB synthesis: normalization based on the u-factors;
4 Conditioning: probability distribution update in face of evidence.
Bernardo Gon¸calves Data-driven Hypothesis Management 44 / 54
Appendix DB Synthesis
Systematic Application of Bayesian Inference
1 User selection of the observation sample;
2 System selection of the competing prediction samples;
3 Bayesian inference ;
4 Probability distribution update ;
Bernardo Gon¸calves Data-driven Hypothesis Management 45 / 54
Appendix DB Synthesis
Bayes’ Rule
Normal density likelihood function:
f (y | µk) =
1
√
2πσ2
e− 1
2σ2 (y−µk )2
(1)
Bayes ’ rule:
p(µk | y1, . . . , yn) =
n
j=1 f (yj | µkj ) p(µk)
m
i=1
n
j=1
f (yj | µij ) p(µi )
(2)
where y1, . . . , yn is the observation sample, µk is prediction k and σ is the
standard deviation parameter.
Bernardo Gon¸calves Data-driven Hypothesis Management 46 / 54
Appendix DB Synthesis
Probability Update in Face of Evidence
STUDY φ υ pO2 SHbO2 Prior Posterior
1 32 100 9.72764121981342E-1 .333 .335441
1 28 100 9.74346796798538E-1 .333 .335398
1 31 100 9.90781330988763E-1 .333 .329161
Bernardo Gon¸calves Data-driven Hypothesis Management 47 / 54
Appendix DB Synthesis
Prototype System (1)
Bernardo Gon¸calves Data-driven Hypothesis Management 48 / 54
Appendix DB Synthesis
Prototype System (2)
Bernardo Gon¸calves Data-driven Hypothesis Management 49 / 54
Appendix DB Synthesis
Prototype System (3)
Bernardo Gon¸calves Data-driven Hypothesis Management 50 / 54
Appendix DB Synthesis
Prototype System (4)
Bernardo Gon¸calves Data-driven Hypothesis Management 51 / 54
Appendix DB Synthesis
Prototype System (5)
Bernardo Gon¸calves Data-driven Hypothesis Management 52 / 54
Appendix DB Synthesis
Prototype System (6)
Bernardo Gon¸calves Data-driven Hypothesis Management 53 / 54
Appendix DB Synthesis
Viewpoint: Why Hypothesis Management?
“Numerical simulations and ‘big data’ are essential in modern science, but they do not
alone yield understanding. Building a massive database to
feed simulations without corrective loops between hypotheses and experimental tests
seems, at best, a waste of time and money.” Nature 513, Sept 2014.
Bernardo Gon¸calves Data-driven Hypothesis Management 54 / 54

Weitere ähnliche Inhalte

Was ist angesagt?

An algorithm for generating new mandelbrot and julia sets
An algorithm for generating new mandelbrot and julia setsAn algorithm for generating new mandelbrot and julia sets
An algorithm for generating new mandelbrot and julia setsAlexander Decker
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetGael Varoquaux
 
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịDistance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịHong Ong
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingGael Varoquaux
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
 
Creative component alexzajichek
Creative component alexzajichekCreative component alexzajichek
Creative component alexzajichekAlexZajichek
 
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsTowards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsPlanetData Network of Excellence
 
A STRATEGIC HYBRID TECHNIQUE TO DEVELOP A GAME PLAYER
A STRATEGIC HYBRID TECHNIQUE TO DEVELOP A GAME PLAYERA STRATEGIC HYBRID TECHNIQUE TO DEVELOP A GAME PLAYER
A STRATEGIC HYBRID TECHNIQUE TO DEVELOP A GAME PLAYERijcseit
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 

Was ist angesagt? (10)

An algorithm for generating new mandelbrot and julia sets
An algorithm for generating new mandelbrot and julia setsAn algorithm for generating new mandelbrot and julia sets
An algorithm for generating new mandelbrot and julia sets
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budget
 
ML基本からResNetまで
ML基本からResNetまでML基本からResNetまで
ML基本からResNetまで
 
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịDistance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
 
Creative component alexzajichek
Creative component alexzajichekCreative component alexzajichek
Creative component alexzajichek
 
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsTowards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
 
A STRATEGIC HYBRID TECHNIQUE TO DEVELOP A GAME PLAYER
A STRATEGIC HYBRID TECHNIQUE TO DEVELOP A GAME PLAYERA STRATEGIC HYBRID TECHNIQUE TO DEVELOP A GAME PLAYER
A STRATEGIC HYBRID TECHNIQUE TO DEVELOP A GAME PLAYER
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 

Andere mochten auch

TripShow.com - New Chinese Travel Website
TripShow.com - New Chinese Travel WebsiteTripShow.com - New Chinese Travel Website
TripShow.com - New Chinese Travel WebsiteDr Jens Thraenhart
 
Case study - yebhi, an eCommerce solution
Case study - yebhi, an eCommerce solutionCase study - yebhi, an eCommerce solution
Case study - yebhi, an eCommerce solutionVineela Kanapala
 
2012 feb 25 agile ux nyc, seiden, requirements to hypotheses
2012 feb 25 agile ux nyc, seiden, requirements to hypotheses2012 feb 25 agile ux nyc, seiden, requirements to hypotheses
2012 feb 25 agile ux nyc, seiden, requirements to hypothesesJoshua Seiden
 
Product roadmap-guide-by-product plan
Product roadmap-guide-by-product planProduct roadmap-guide-by-product plan
Product roadmap-guide-by-product planLewis Lin 🦊
 
Powerpoint final case study presentation
Powerpoint final case study presentationPowerpoint final case study presentation
Powerpoint final case study presentationJLUM13
 
Case study solving technique
Case study solving techniqueCase study solving technique
Case study solving techniqueTriptisahu
 

Andere mochten auch (9)

TripShow.com - New Chinese Travel Website
TripShow.com - New Chinese Travel WebsiteTripShow.com - New Chinese Travel Website
TripShow.com - New Chinese Travel Website
 
Apostila estruturas
Apostila estruturasApostila estruturas
Apostila estruturas
 
Case study - yebhi, an eCommerce solution
Case study - yebhi, an eCommerce solutionCase study - yebhi, an eCommerce solution
Case study - yebhi, an eCommerce solution
 
An Approach To Case Analysis
An Approach To Case AnalysisAn Approach To Case Analysis
An Approach To Case Analysis
 
Case study - jabong
Case study -  jabongCase study -  jabong
Case study - jabong
 
2012 feb 25 agile ux nyc, seiden, requirements to hypotheses
2012 feb 25 agile ux nyc, seiden, requirements to hypotheses2012 feb 25 agile ux nyc, seiden, requirements to hypotheses
2012 feb 25 agile ux nyc, seiden, requirements to hypotheses
 
Product roadmap-guide-by-product plan
Product roadmap-guide-by-product planProduct roadmap-guide-by-product plan
Product roadmap-guide-by-product plan
 
Powerpoint final case study presentation
Powerpoint final case study presentationPowerpoint final case study presentation
Powerpoint final case study presentation
 
Case study solving technique
Case study solving techniqueCase study solving technique
Case study solving technique
 

Ähnlich wie Data-Driven Hypothesis Management Techniques

Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Editor IJARCET
 
Using model-based statistical inference to learn about evolution
Using model-based statistical inference to learn about evolutionUsing model-based statistical inference to learn about evolution
Using model-based statistical inference to learn about evolutionErick Matsen
 
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningBruno Gonçalves
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataArthur Charpentier
 
Imprecision in learning: an overview
Imprecision in learning: an overviewImprecision in learning: an overview
Imprecision in learning: an overviewSebastien Destercke
 
8. Recursion.pptx
8. Recursion.pptx8. Recursion.pptx
8. Recursion.pptxAbid523408
 
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Magnify Analytic Solutions
 
20170415 當julia遇上資料科學
20170415 當julia遇上資料科學20170415 當julia遇上資料科學
20170415 當julia遇上資料科學岳華 杜
 
20171127 當julia遇上資料科學
20171127 當julia遇上資料科學20171127 當julia遇上資料科學
20171127 當julia遇上資料科學岳華 杜
 
Developing fast low-rank tensor methods for solving PDEs with uncertain coef...
Developing fast  low-rank tensor methods for solving PDEs with uncertain coef...Developing fast  low-rank tensor methods for solving PDEs with uncertain coef...
Developing fast low-rank tensor methods for solving PDEs with uncertain coef...Alexander Litvinenko
 
My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...Alexander Litvinenko
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life SciencesEamonn Maguire
 
Neural ODEs - A state-of-the-art Deep Learning approach to process time serie...
Neural ODEs - A state-of-the-art Deep Learning approach to process time serie...Neural ODEs - A state-of-the-art Deep Learning approach to process time serie...
Neural ODEs - A state-of-the-art Deep Learning approach to process time serie...Fabian Hadiji
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
A scalable ontology reasoner via incremental materialization
A scalable ontology reasoner via incremental materializationA scalable ontology reasoner via incremental materialization
A scalable ontology reasoner via incremental materializationRokan Uddin Faruqui
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationDavid Gleich
 
When Classifier Selection meets Information Theory: A Unifying View
When Classifier Selection meets Information Theory: A Unifying ViewWhen Classifier Selection meets Information Theory: A Unifying View
When Classifier Selection meets Information Theory: A Unifying ViewMohamed Farouk
 
diffusion 모델부터 DALLE2까지.pdf
diffusion 모델부터 DALLE2까지.pdfdiffusion 모델부터 DALLE2까지.pdf
diffusion 모델부터 DALLE2까지.pdf수철 박
 

Ähnlich wie Data-Driven Hypothesis Management Techniques (20)

Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582
 
Using model-based statistical inference to learn about evolution
Using model-based statistical inference to learn about evolutionUsing model-based statistical inference to learn about evolution
Using model-based statistical inference to learn about evolution
 
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) Learning
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big data
 
Final Report-1-(1)
Final Report-1-(1)Final Report-1-(1)
Final Report-1-(1)
 
Imprecision in learning: an overview
Imprecision in learning: an overviewImprecision in learning: an overview
Imprecision in learning: an overview
 
8. Recursion.pptx
8. Recursion.pptx8. Recursion.pptx
8. Recursion.pptx
 
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
 
20170415 當julia遇上資料科學
20170415 當julia遇上資料科學20170415 當julia遇上資料科學
20170415 當julia遇上資料科學
 
20171127 當julia遇上資料科學
20171127 當julia遇上資料科學20171127 當julia遇上資料科學
20171127 當julia遇上資料科學
 
Developing fast low-rank tensor methods for solving PDEs with uncertain coef...
Developing fast  low-rank tensor methods for solving PDEs with uncertain coef...Developing fast  low-rank tensor methods for solving PDEs with uncertain coef...
Developing fast low-rank tensor methods for solving PDEs with uncertain coef...
 
My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life Sciences
 
Neural ODEs - A state-of-the-art Deep Learning approach to process time serie...
Neural ODEs - A state-of-the-art Deep Learning approach to process time serie...Neural ODEs - A state-of-the-art Deep Learning approach to process time serie...
Neural ODEs - A state-of-the-art Deep Learning approach to process time serie...
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
A scalable ontology reasoner via incremental materialization
A scalable ontology reasoner via incremental materializationA scalable ontology reasoner via incremental materialization
A scalable ontology reasoner via incremental materialization
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportation
 
When Classifier Selection meets Information Theory: A Unifying View
When Classifier Selection meets Information Theory: A Unifying ViewWhen Classifier Selection meets Information Theory: A Unifying View
When Classifier Selection meets Information Theory: A Unifying View
 
diffusion 모델부터 DALLE2까지.pdf
diffusion 모델부터 DALLE2까지.pdfdiffusion 모델부터 DALLE2까지.pdf
diffusion 모델부터 DALLE2까지.pdf
 

Kürzlich hochgeladen

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 

Kürzlich hochgeladen (20)

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 

Data-Driven Hypothesis Management Techniques

  • 1. Data-driven Hypothesis Management Bernardo Gon¸calves benunesg@umich.edu @ IBM Research Brazil – August 27, 2015 Bernardo Gon¸calves Data-driven Hypothesis Management 1 / 54
  • 2. Research Vision Hypothesis Management Hypothesis Management Scientific research is based on the central idea of a hypothesis , meant to be established or refuted. Bernardo Gon¸calves Data-driven Hypothesis Management 2 / 54
  • 3. Research Vision Hypothesis Management Hypothesis Management Scientific research is based on the central idea of a hypothesis , meant to be established or refuted. Over time and from multiple sources, we may collect evidence that support their (gray-shaded) truth or falsehood. Bernardo Gon¸calves Data-driven Hypothesis Management 2 / 54
  • 4. Research Vision Hypothesis Management Hypothesis Management Scientific research is based on the central idea of a hypothesis , meant to be established or refuted. Over time and from multiple sources, we may collect evidence that support their (gray-shaded) truth or falsehood. Hypothesis management is, therefore, closely related to the management of probabilistic data. Bernardo Gon¸calves Data-driven Hypothesis Management 2 / 54
  • 5. Research Vision Hypothesis Management Hypothesis Management Scientific research is based on the central idea of a hypothesis , meant to be established or refuted. Over time and from multiple sources, we may collect evidence that support their (gray-shaded) truth or falsehood. Hypothesis management is, therefore, closely related to the management of probabilistic data. ...And it should no more be restricted to the traditional sciences. Bernardo Gon¸calves Data-driven Hypothesis Management 2 / 54
  • 6. Research Vision The Big Data Revolution The Big Data Revolution “You can’t manage what you don’t measure.” Big data: the management revolution McAfee and Brynjolfsson. Harvard Business Review 90(10):60–6, 2012. Bernardo Gon¸calves Data-driven Hypothesis Management 3 / 54
  • 7. Research Vision The Big Data Revolution Data-Driven Decision Making “ Decisions that previously were based on guesswork, or on painstakingly handcrafted models of reality, can now be made using data-driven mathematical models .” Jagadish et al. Communications of the ACM 57(7):86–94, 2014. But it’s not just the analytics... (cf. ACM SIGMOD Blog – May 5, 2012) Bernardo Gon¸calves Data-driven Hypothesis Management 4 / 54
  • 8. Research Vision Data-driven Discovery Engineering the Data-driven Discovery Cycle (1) HYPOTHESIS synthesis, deduction, prediction DATA learning, induction, fitting Challenges for data-driven discovery: How to remove friction and go back and forth more smoothly? How to make systems more usable so that data is really accessible? Bernardo Gon¸calves Data-driven Hypothesis Management 5 / 54
  • 9. Research Vision Data-driven Discovery Engineering the Data-driven Discovery Cycle (2) Hypothesis management in the big data era is a broad field. Some specific topics: 1 Data ETL: automatic synthesis of probabilistic DB’s; (Ph.D.) Υ-DB: Managing scientific hypotheses as uncertain data. PVLDB 7(11):959-62, 2014 . Managing scientific hypotheses as data with support for predictive analytics. IEEE Computing in Science & Eng. 17(5):35-43, 2015 . (IF 1.248, special issue Open Simulation Labs). Synthesis method for probabilistic database design. ACM Sigmod Record (IF 0.955, under review). A note on the complexity of the causal ordering problem. Artif. Intell. (IF 3.371; under review, preprint at CoRR abs/1508.05804). 2 Data retrieval: disambiguation of imprecise DB queries; (postdoc) Bernardo Gon¸calves Data-driven Hypothesis Management 6 / 54
  • 10. Data ETL (Extract, Transform, Load) Use Case 1 Research Vision Hypothesis Management The Big Data Revolution Data-driven Discovery 2 Data ETL Use Case Design-by-Synthesis Pipeline Hypothesis Encoding Probabilistic DB Synthesis 3 Data Retrieval (Imprecise DB Queries) 4 Future Directions 5 Appendix Probabilistic DB Synthesis Bernardo Gon¸calves Data-driven Hypothesis Management 7 / 54
  • 11. Data ETL (Extract, Transform, Load) Use Case From Hypotheses to Data Law of free fall “If a body falls from rest, then its velocity at any point is proportional to the time it has been falling.” (i) for k = 0:n; t = k * dt; v = -g*t + v_0; s = -(g/2)*t^2 + v_0*t + s_0; t_plot(k) = t; v_plot(k) = v; s_plot(k) = s; end (iii) a = −g v = −g t + v0 s = −(g/2) t2 + v0 t + s0 (ii) FALL t v s 0 0 5000 1 −32 4984 2 −64 4936 3 −96 4856 4 −128 4744 · · · · · · · · · (iv) Bernardo Gon¸calves Data-driven Hypothesis Management 8 / 54
  • 12. Data ETL (Extract, Transform, Load) Use Case Competing Hypotheses: Probability Distribution Alternative hypotheses supposed to explain the same phenomenon. H1. Free fall law a = −g v = −g t + v0 s = −(g/2) t2 + v0 t + s0 H2. Stokes’ law a = 0 v =− gD/ 4.6×10−4 s =−t gD/4.6×10−4 + s0 H3. Velocity-squared law a = 0 v =−gD2 / 3.29×10−6 s =−(gD2 / 3.29×10−6 ) t + s0 Bernardo Gon¸calves Data-driven Hypothesis Management 9 / 54
  • 13. Data ETL (Extract, Transform, Load) Use Case Concrete Use Scenario in Computational Science Example 1. Bob is a computational scientist who is playing with a number of models and different parameter settings to see which one gives a best fit to his observation samples. Each run constitutes a specific model instantiation that is associated with a unique file (‘big table’). In the end of the day his resulting datasets are spread over many files and folders. N.B.: most pronounced V’s of Veracity and Variety, as opposed to Volume or Velocity. Bernardo Gon¸calves Data-driven Hypothesis Management 10 / 54
  • 14. Data ETL (Extract, Transform, Load) Use Case Beyond Files: a Big Table Database User’s default choice: struggle with the files to find relevant data. “There is life beyond files.” (Jim Gray) FALL tid t g v0 s0 a v s 1 0 32 0 5000 −32 0 5000 1 1 32 0 5000 −32 −32 4984 1 2 32 0 5000 −32 −64 4936 · · · · · · · · · · · · · · · · · · · · · · · · 2 0 32.2 0 5000 −32.2 0 5000 2 1 32.2 0 5000 −32.2 −32.2 4983.9 2 2 32.2 0 5000 −32.2 −64.4 4935.6 · · · · · · · · · · · · · · · · · · · · · · · · Bernardo Gon¸calves Data-driven Hypothesis Management 11 / 54
  • 15. Data ETL (Extract, Transform, Load) Use Case Do Better: a Hypothesis Database a = −g v = −g t + v0 s = −(g/2) t2 + v0 t + s0 FALL tid t g v0 s0 a v s 1 0 32 0 5000 −32 0 5000 1 1 32 0 5000 −32 −32 4984 1 2 32 0 5000 −32 −64 4936 · · · · · · · · · · · · · · · · · · · · · · · · 2 0 32.2 0 5000 −32.2 0 5000 2 1 32.2 0 5000 −32.2 −32.2 4983.9 2 2 32.2 0 5000 −32.2 −64.4 4935.6 · · · · · · · · · · · · · · · · · · · · · · · · Predictive structure: strong correlations from the math models (!). Project-level standardization: pick a favorite MathML editor to report and manage model equations declaratively. Bernardo Gon¸calves Data-driven Hypothesis Management 12 / 54
  • 16. Data ETL (Extract, Transform, Load) Design-by-Synthesis Pipeline Design-by-Synthesis Pipeline files Sk D1 k D2 k ... Dp k h n k=1 Hk y n k=1 m =1 Yk ETL (big table) p-ETL (u-factors) Conditioning Technical challenges: 1 Encoding : math equations → structural eqs. → functional deps.; 2 Causal reasoning: inferring the causal ordering and u-factors; 3 Probabilistic DB synthesis: normalization based on the u-factors; 4 Conditioning: probability distribution update in face of evidence. Bernardo Gon¸calves Data-driven Hypothesis Management 13 / 54
  • 17. Data ETL (Extract, Transform, Load) Encoding 1 Research Vision Hypothesis Management The Big Data Revolution Data-driven Discovery 2 Data ETL Use Case Design-by-Synthesis Pipeline Hypothesis Encoding Probabilistic DB Synthesis 3 Data Retrieval (Imprecise DB Queries) 4 Future Directions 5 Appendix Probabilistic DB Synthesis Bernardo Gon¸calves Data-driven Hypothesis Management 14 / 54
  • 18. Data ETL (Extract, Transform, Load) Encoding Given set E of equations over set V of variables... (1) Law of free fall: H = { g = 32, v0 = 0, s0 = 5000, a = −g, v = −g t + v0, s = −(g/2)t2 + v0 t + s0 }. h-encode −−−−−→ Σ = { φ → g, φ → v0, φ → s0, g υ → a, g v0 t υ → v, g v0 s0 t υ → s }. Bernardo Gon¸calves Data-driven Hypothesis Management 15 / 54
  • 19. Data ETL (Extract, Transform, Load) Encoding Given set E of equations over set V of variables... (2) Law of free fall: S = { f1(g) = 0, f2(v0) = 0, f3(s0) = 0, f4(a, g) = 0, f5(v, g, t, v0) = 0, f6(s, g, t, v0, s0) = 0 }. h-encode −−−−−→ Σ = { φ → g, φ → v0, φ → s0, g υ → a, g v0 t υ → v, g v0 s0 t υ → s }. Bernardo Gon¸calves Data-driven Hypothesis Management 16 / 54
  • 20. Data ETL (Extract, Transform, Load) Encoding Causal Ordering Algorithm (AI Literature) 1 Identify ‘minimal substructures’ at step k; 2 Reduce the matrix by eliminating them; 3 Call step k+1 recursively. x1 x2 x3 x4 x5 x6 x7 f1 1 0 0 0 0 0 0 f2 0 1 0 0 0 0 0 f3 0 0 1 0 0 0 0 f4 1 1 1 1 1 0 0 f5 1 0 1 1 1 0 0 f6 0 0 0 1 0 1 0 f7 0 0 0 0 1 0 1 → x1 x2 x3 x4 x5 x6 x7 f1 1 0 0 0 0 0 0 f2 0 1 0 0 0 0 0 f3 0 0 1 0 0 0 0 f4 1 1 1 1 1 0 0 f5 1 0 1 1 1 0 0 f6 0 0 0 1 0 1 0 f7 0 0 0 0 1 0 1 Bernardo Gon¸calves Data-driven Hypothesis Management 17 / 54
  • 21. Data ETL (Extract, Transform, Load) Encoding Equivalent to Finding K-Balanced Pseudo-Bicliques S = { f1(x1) = 0, f2(x2) = 0, f3(x3) = 0, f4(x1, x2, x3, x4, x5) = 0, f5(x1, x3, x4, x5) = 0, f6(x4, x6) = 0, f7(x5, x7) = 0 }. E V f1 x1 f2 x2 f3 x3 f4 x4 f5 x5 f6 x6 f7 x7 Theorem 1. Let S(E, V) be a complete structure. The extraction of its causal ordering by COA(S) is NP-Hard. Bernardo Gon¸calves Data-driven Hypothesis Management 18 / 54
  • 22. Data ETL (Extract, Transform, Load) Encoding Easier: Complete Matching in a Bipartite Graph Hopcroft-Karp algorithm to solve it in O( |E| |S|); E V f1 x1 f2 x2 f3 x3 f4 x4 f5 x5 f6 x6 f7 x7 Proposition 2. Let S(E, V) be a structure. Then a total causal mapping ϕ: E → V over S exists iff S is complete . Bernardo Gon¸calves Data-driven Hypothesis Management 19 / 54
  • 23. Data ETL (Extract, Transform, Load) Encoding Provably Correct Approach to Hypothesis Encoding Cϕ = { (xa, xb) | there exists f ∈ E such that ϕ(f ) = xb and xa ∈ Vars(f ) } Proposition 1. Let S(E, V) be a structure, and ϕ1 : E → V and ϕ2 : E → V be any two total causal mappings over S. Then C+ ϕ1 = C+ ϕ2 . Bernardo Gon¸calves Data-driven Hypothesis Management 20 / 54
  • 24. Data ETL (Extract, Transform, Load) Encoding Experiments: logscale base 2 on the structure length |S| 27 28 29 210 211 212 213 214 215 216 217 218 219 220 221 25 27 29 211 213 215 structure length |S| time[ms] h-encode Corollary 1. Let S(E, V) be a complete structure. Then a total causal mapping ϕ: E → V over S can be found by TCM(S) in time that is bounded by O( |E| |S|) . Bernardo Gon¸calves Data-driven Hypothesis Management 21 / 54
  • 25. Data ETL (Extract, Transform, Load) DB Synthesis Coming Back from our Dive files Sk D1 k D2 k ... Dp k h n k=1 Hk y n k=1 m =1 Yk ETL (big table) p-ETL (u-factors) Conditioning Technical challenges: 1 Encoding : math equations → structural eqs. → functional deps.; 2 Causal reasoning: inferring the causal ordering and u-factors; 3 Probabilistic DB synthesis: normalization based on the u-factors; 4 Conditioning: probability distribution update in face of evidence. Bernardo Gon¸calves Data-driven Hypothesis Management 22 / 54
  • 26. Data ETL (Extract, Transform, Load) DB Synthesis Bob’s Big Tables (before our help) H3 tid φ υ t x0 b p y0 d r x y 1 1 3 0 30 .5 .02 4 .75 .02 30 4 1 1 3 ... 30 .5 .02 4 .75 .02 ... ... 2 1 3 0 30 .5 .018 4 .75 .023 30 4 2 1 3 ... 30 .5 .018 4 .75 .023 ... ... 3 1 3 0 30 .4 .02 4 .8 .02 30 4 3 1 3 ... 30 .4 .02 4 .8 .02 ... ... 4 1 3 0 30 .4 .018 4 .8 .023 30 4 4 1 3 ... 30 .4 .018 4 .8 .023 ... ... 5 1 3 0 30 .397 .02 4 .786 .02 30 4 5 1 3 ... 30 .397 .02 4 .786 .02 ... ... 6 1 3 0 30 .397 .018 4 .786 .023 30 4 6 1 3 5 30 .397 .018 4 .786 .023 50.1 62.9 6 1 3 10 30 .397 .018 4 .786 .023 13.8 8.65 6 1 3 15 30 .397 .018 4 .786 .023 79.3 8.23 6 1 3 20 30 .397 .018 4 .786 .023 12.6 30.7 6 1 3 ... 30 .397 .018 4 .786 .023 ... ... Bernardo Gon¸calves Data-driven Hypothesis Management 23 / 54
  • 27. Data ETL (Extract, Transform, Load) DB Synthesis Synthesized Tables: Ready for Predictive Analysis Y0 V → D φ υ x1 → 1 1 1 x1 → 2 1 2 x1 → 3 1 3 Y1 3 V →D φ x0 x2 →1 1 30 Y2 3 V →D φ b x3 →1 1 .5 x3 →2 1 .4 x3 →3 1 .397 Y3 3 V →D φ p x4 →1 1 .020 x4 →2 1 .018 Y4 3 V1 →D1 V2 →D2 V3 →D3 V4 →D4 φ υ t y x x1 → 3 x2 → 1 x3 → 1 x4 → 1 1 3 1900 4 30 x1 → 3 x2 → 1 x3 → 1 x4 → 1 1 3 ... ... ... ... ... ... ... 1 3 ... ... ... x1 → 3 x2 → 1 x3 → 3 x4 → 2 1 3 1900 4 30 x1 → 3 x2 → 1 x3 → 3 x4 → 2 1 3 1901 4.12 41.5 x1 → 3 x2 → 1 x3 → 3 x4 → 2 1 3 1902 5.78 56.7 x1 → 3 x2 → 1 x3 → 3 x4 → 2 1 3 1903 11.7 72.8 x1 → 3 x2 → 1 x3 → 3 x4 → 2 1 3 1904 31.1 75.9 x1 → 3 x2 → 1 x3 → 3 x4 → 2 1 3 .. .. .. Bernardo Gon¸calves Data-driven Hypothesis Management 24 / 54
  • 28. Data ETL (Extract, Transform, Load) DB Synthesis Querying Alternative Predictions with Probabilities Y4 3 V1 →D1 V2 →D2 V3 →D3 V4 →D4 φ υ t y x x1 → 3 x2 → 1 x3 → 1 x4 → 1 2 3 1900 4 30 ... ... ... ... ... ... ... ... ... x1 → 3 x2 → 1 x3 → 3 x4 → 2 2 3 1904 .. 75.92 ... ... ... ... ... ... ... ... ... W V → D Pr x1 → 2 .33 x1 → 3 .33 x1 → 3 .33 x2 → 1 1 x3 → 1 .33 x3 → 2 .33 x3 → 3 .33 x4 → 1 .5 x4 → 2 .5 Operation conf(); θ = { x1 →3, x2 →1, x3 →3, x4 →2 }, Pr = .33 ∗ 1 ∗ .33 ∗ .5 ≈ .055 Bernardo Gon¸calves Data-driven Hypothesis Management 25 / 54
  • 29. Data ETL (Extract, Transform, Load) DB Synthesis Bob’s Prototype System (1) Bernardo Gon¸calves Data-driven Hypothesis Management 26 / 54
  • 30. Data ETL (Extract, Transform, Load) DB Synthesis Bob’s Prototype System (2) Bernardo Gon¸calves Data-driven Hypothesis Management 27 / 54
  • 31. Data Retrieval (Imprecise DB Queries) 1 Research Vision Hypothesis Management The Big Data Revolution Data-driven Discovery 2 Data ETL Use Case Design-by-Synthesis Pipeline Hypothesis Encoding Probabilistic DB Synthesis 3 Data Retrieval (Imprecise DB Queries) 4 Future Directions 5 Appendix Probabilistic DB Synthesis Bernardo Gon¸calves Data-driven Hypothesis Management 28 / 54
  • 32. Data Retrieval (Imprecise DB Queries) (Bob likes to issue) Search Queries over Database Systems Bayesian smoothing to enable imprecise database queries; Learning from implicit user feedback; (Source: Li and Jagadish, SIGMOD 2014). Bernardo Gon¸calves Data-driven Hypothesis Management 29 / 54
  • 33. Data Retrieval (Imprecise DB Queries) Takeaways Datasets have an implicit conceptual structure that may be unveiled. We have seen an example in support of hypothesis management. Synthesis (out of math equations) of a probabilistic DB: ready for predictive analytics. Needed: such usable systems for fast data-driven decision making. Bernardo Gon¸calves Data-driven Hypothesis Management 30 / 54
  • 34. Future Directions Future: More on Tools for the Data-driven Discovery Cycle HYPOTHESIS synthesis, deduction, prediction DATA learning, induction, fitting Design for hybrid data systems ( structured + unstructured ): Integrity and low latency (relational) with high throughput (big data). More on modeling, extracting and transforming data; Usable tools for end-users — no DBA or ‘data scientist’ in the loop. Crowd hypotheses on the social web... Bernardo Gon¸calves Data-driven Hypothesis Management 31 / 54
  • 35. Future Directions Thank you. Questions? Bernardo Gon¸calves http://umich.edu/~benunesg benunesg@umich.edu Bernardo Gon¸calves Data-driven Hypothesis Management 32 / 54
  • 36. Appendix Design-by-Synthesis Pipeline files Sk D1 k D2 k ... Dp k h n k=1 Hk y n k=1 m =1 Yk ETL (big table) p-ETL (u-factors) Conditioning Technical challenges: 1 Encoding: math equations → structural eqs. → functional deps.; 2 Causal reasoning: inferring the causal ordering and u-factors; 3 Probabilistic DB synthesis: normalization based on the u-factors; 4 Conditioning: probability distribution update in face of evidence. Bernardo Gon¸calves Data-driven Hypothesis Management 33 / 54
  • 37. Appendix The Folding Σ of Σ Acyclic pseudo-transitive reasoning Algorithm 1 Folding of an fd set. 1: procedure folding(Σ: fd set) Require: Σ given encodes complete structure S Ensure: Returns fd set Σ , the folding of Σ 2: Σ ← ∅ 3: for all X, A ∈ Σ do 4: Z ← AFolding(Σ, A) 5: Σ ← Σ ∪ Z, A 6: return Σ Bernardo Gon¸calves Data-driven Hypothesis Management 34 / 54
  • 38. Appendix The Folding Σ of Σ Acyclic pseudo-transitive reasoning Σ = { φ → x0, φ → b, φ → p, φ → y0, φ → d, φ → r, x0 b p t υ y → x, y0 d r t υ x → y }. folding −−−→ Υ(Σ) = { x0 b p t υ y y0 d r → x, y0 d r t υ x x0 b p → y }. Bernardo Gon¸calves Data-driven Hypothesis Management 35 / 54
  • 39. Appendix FD set Σ89 (a real Physiology Model that “fits the screen”) Σ89 = { φ → C1a C1p C2a C2p C3a Cglobal Cmyo Dp100 Pc t delta t max t min taua taud, φ t → DelP, C1a C1p C2a C2p C3a Cglobal Cmyo Dp100 Pc υ → Dc, Dc Pc υ → Tc, Cglobal Cmyo Dc Pc υ → Ac, Dc υ → D t min, Ac υ → A t min, DelP Pc υ → P, D P υ → T, A C1a C1p C2a C2p C3a Dp100 P T υ → Ttarget , Cglobal Cmyo D P υ → Atarget, D t min Dc T Tc Ttarget t taud υ → D, A t min Atarget t taua υ → A }. Bernardo Gon¸calves Data-driven Hypothesis Management 36 / 54
  • 40. Appendix Its Folding Σ89 Υ(Σ89) = { C1a C2a φ υ → A t min Ac D t min Dc Tc, C1a DelP φ υ → P, C1a C2a DelP φ t υ T → A Atarget D Ttarget }. Bernardo Gon¸calves Data-driven Hypothesis Management 37 / 54
  • 41. Appendix Experiments: Folding (logscale base 2) 27 28 29 210 211 212 213 214 215 216 217 218 219 220 221 26 210 214 218 structure length |S| time[ms] folding Bernardo Gon¸calves Data-driven Hypothesis Management 38 / 54
  • 42. Appendix DB Synthesis Design-by-Synthesis Pipeline files Sk D1 k D2 k ... Dp k h n k=1 Hk y n k=1 m =1 Yk ETL (big table) p-ETL (u-factors) Conditioning Technical challenges: 1 Encoding: math equations → structural eqs. → functional deps.; 2 Causal reasoning: inferring the causal ordering and u-factors; 3 Probabilistic DB synthesis: normalization based on the u-factors; 4 Conditioning: probability distribution update in face of evidence. Bernardo Gon¸calves Data-driven Hypothesis Management 39 / 54
  • 43. Appendix DB Synthesis Defining the Theoretical U-factor Y0 := πφ, υ (repair-keyφ(H0) ). H0 φ υ 1 1 1 2 1 3 Y0 V → D φ υ x1 → 1 1 1 x1 → 2 1 2 x1 → 3 1 3 W V → D Pr x1 → 1 .33 x1 → 2 .33 x1 → 3 .33 Bernardo Gon¸calves Data-driven Hypothesis Management 40 / 54
  • 44. Appendix DB Synthesis U-factor Learning Discovery of contingent functional dependencies ‘Input’ relation of the Lotka-Volterra hypothesis. Observe: Multiplicity of parameter values; Correlations between parameter values. H1 3 tid φ x0 b p y0 d r 1 1 30 .5 .020 4 .75 .020 2 1 30 .5 .018 4 .75 .023 3 1 30 .4 .020 4 .8 .020 4 1 30 .4 .018 4 .8 .023 5 1 30 .397 .020 4 .786 .020 6 1 30 .397 .018 4 .786 .023 Ω = { φ x0 → y0, φ b → d, φ p → r }. Bernardo Gon¸calves Data-driven Hypothesis Management 41 / 54
  • 45. Appendix DB Synthesis U-factorization Defining the Empirical U-factors Y i k := πφ A G (repair-keyφ @ count ( γφ, A, G, count(∗)(Hk) ) ). Y1 3 V →D φ x0 y0 x2 →1 1 30 4 Y2 3 V →D φ b p x3 →1 1 .5 .5 x3 →2 1 .4 .8 x3 →3 1 .397 .786 Y3 3 V →D φ p r x4 →1 1 .020 .020 x4 →2 1 .018 .023 W V → D Pr ... ... x2 → 1 1 x3 → 1 .33 x3 → 2 .33 x3 → 3 .33 x4 → 1 .5 x4 → 2 .5 Bernardo Gon¸calves Data-driven Hypothesis Management 42 / 54
  • 46. Appendix DB Synthesis Design-Theoretic Properties Desirable properties for probability update are ensured; Claim-centered decomposition. Theorem 6: BCNF w.r.t. causal dependencies. Correctness of uncertainty decomposition. Theorem 7: Lossless join w.r.t. causal dependencies. Bernardo Gon¸calves Data-driven Hypothesis Management 43 / 54
  • 47. Appendix DB Synthesis Design-by-Synthesis Pipeline files Sk D1 k D2 k ... Dp k h n k=1 Hk y n k=1 m =1 Yk ETL (big table) p-ETL (u-factors) Conditioning Technical challenges: 1 Encoding: math equations → structural eqs. → functional deps.; 2 Causal reasoning: inferring the causal ordering and u-factors; 3 Probabilistic DB synthesis: normalization based on the u-factors; 4 Conditioning: probability distribution update in face of evidence. Bernardo Gon¸calves Data-driven Hypothesis Management 44 / 54
  • 48. Appendix DB Synthesis Systematic Application of Bayesian Inference 1 User selection of the observation sample; 2 System selection of the competing prediction samples; 3 Bayesian inference ; 4 Probability distribution update ; Bernardo Gon¸calves Data-driven Hypothesis Management 45 / 54
  • 49. Appendix DB Synthesis Bayes’ Rule Normal density likelihood function: f (y | µk) = 1 √ 2πσ2 e− 1 2σ2 (y−µk )2 (1) Bayes ’ rule: p(µk | y1, . . . , yn) = n j=1 f (yj | µkj ) p(µk) m i=1 n j=1 f (yj | µij ) p(µi ) (2) where y1, . . . , yn is the observation sample, µk is prediction k and σ is the standard deviation parameter. Bernardo Gon¸calves Data-driven Hypothesis Management 46 / 54
  • 50. Appendix DB Synthesis Probability Update in Face of Evidence STUDY φ υ pO2 SHbO2 Prior Posterior 1 32 100 9.72764121981342E-1 .333 .335441 1 28 100 9.74346796798538E-1 .333 .335398 1 31 100 9.90781330988763E-1 .333 .329161 Bernardo Gon¸calves Data-driven Hypothesis Management 47 / 54
  • 51. Appendix DB Synthesis Prototype System (1) Bernardo Gon¸calves Data-driven Hypothesis Management 48 / 54
  • 52. Appendix DB Synthesis Prototype System (2) Bernardo Gon¸calves Data-driven Hypothesis Management 49 / 54
  • 53. Appendix DB Synthesis Prototype System (3) Bernardo Gon¸calves Data-driven Hypothesis Management 50 / 54
  • 54. Appendix DB Synthesis Prototype System (4) Bernardo Gon¸calves Data-driven Hypothesis Management 51 / 54
  • 55. Appendix DB Synthesis Prototype System (5) Bernardo Gon¸calves Data-driven Hypothesis Management 52 / 54
  • 56. Appendix DB Synthesis Prototype System (6) Bernardo Gon¸calves Data-driven Hypothesis Management 53 / 54
  • 57. Appendix DB Synthesis Viewpoint: Why Hypothesis Management? “Numerical simulations and ‘big data’ are essential in modern science, but they do not alone yield understanding. Building a massive database to feed simulations without corrective loops between hypotheses and experimental tests seems, at best, a waste of time and money.” Nature 513, Sept 2014. Bernardo Gon¸calves Data-driven Hypothesis Management 54 / 54