SlideShare a Scribd company logo
1 of 35
Download to read offline
Provenance for Data Mining
Boris Glavic1 Javed Siddique2 Periklis Andritsos3
Ren´ee J. Miller2
Illinois Institute of
Technology1
DBGroup
University of Toronto2
Miller Lab
University of Toronto3
iSchool
TaPP 2013, June 18, 2013
Outline
1 Motivation
2 Provenance for Data Mining
3 Frequent Itemset Provenance
4 Multidimensional Scaling Provenance
5 Conclusions
Data Mining / KDD
Goals
Extract useful information from data
Approach
1 Preprocessing
Cleaning
Feature Extraction
. . .
2 Apply algorithms to extract information
Clustering, Frequent Itemset Mining, Classification, . . .
Most approaches: Reduce size/Summarize data
Slide 1 of 18 Boris Glavic Datamining Provenance: Motivation
Provenance for Data Mining?
Dilemmas
Purpose of data mining necessitates summarization
“Needle in the haystack”
Loss of information
Makes interpreting “raw” results harder
User point of view: DM algorithm is black box
Slide 2 of 18 Boris Glavic Datamining Provenance: Motivation
Provenance for Data Mining?
How to solve Dilemmas?
Selective access to input data result is based on
Inputs to mining algorithm
Inputs to preprocessing
Contextual information
Understand importance of inputs for results
Input data
Parameter settings
Understand how data mining algorithm generates result from inputs
Understand missing results
Slide 2 of 18 Boris Glavic Datamining Provenance: Motivation
Provenance for Data Mining?
How to solve Dilemmas?
Selective access to input data result is based on (Data Provenance+)
Inputs to mining algorithm
Inputs to preprocessing
Contextual information
Understand importance of inputs for results (Responsibility)
Input data
Parameter settings
Understand how data mining algorithm generates result from inputs
(Process provenance)
Understand missing results (Missing answers)
Slide 2 of 18 Boris Glavic Datamining Provenance: Motivation
Related Work
Provenance
Database provenance, Workflow provenance, Missing answers,
Responsibility, . . .
Data Mining
Enriching mining results with additional information
Contextual information for frequent itemsetsa
Visualization techniquesb
Determining effect of parameter settings/inputs on result
e.g., K-means cluster stability based on parameter settingsc
a
Q. Mei et al. “Generating semantic annotations for frequent patterns with context analysis”. In: SIGKDD.
2006, pp. 337–346.
b
D.A. Keim and H.P. Kriegel. “Visualization techniques for mining large databases: A comparison”. In: TKDE
8.6 (1996), pp. 923–938.
c
L.I. Kuncheva and D.P. Vetrov. “Evaluation of stability of k-means cluster ensembles with respect to random
initialization”. In: TPAMI 28.11 (2006), pp. 1798–1808.
Slide 3 of 18 Boris Glavic Datamining Provenance: Motivation
Contributions
Analyze requirements and use cases for data mining provenance
(DMProv)
Discuss applicability of existing approaches
Outline challenges and sketch research directions
Exemplify concepts on two concrete mining algorithms
Frequent Itemset Mining (FIM)
Multidimensional Scaling (MDS)
Slide 4 of 18 Boris Glavic Datamining Provenance: Motivation
Outline
1 Motivation
2 Provenance for Data Mining
3 Frequent Itemset Provenance
4 Multidimensional Scaling Provenance
5 Conclusions
Fine-Grained Provenance
Why-Provenance
Here Why-Provenance means all models based on influence
Subset of the input that caused output to appear in result
“Caused by” modelled as
Sufficiency
Necessity
Preservation of Equivalence / Computability
Causality
Useful for Data Mining?
Provenance concepts seem applicable to data mining
Have to deal with large provenance size (summarization)
Can abstract processing of classes of mining algorithms?
Do not reinvent provenance tracking for each algorithm!
Slide 5 of 18 Boris Glavic Datamining Provenance: Provenance for Data Mining
Fine-Grained Provenance
Tracing Through Preprocessing Steps
Track back data mining results to inputs of preprocessing
ETL tools are used for preprocessing
⇒ can use database or workflow provenance approaches?
Slide 5 of 18 Boris Glavic Datamining Provenance: Provenance for Data Mining
Enriching Provenance with Context
Contextual Information as Provenance
Mining algorithms often applied to a subset of available data
Contextual Data: data related to the mining inputs
Automatic detection
User provided
Contextual data often more usable and concise than provenance
Which contextual data is of interest will differ
per use-case
maybe even per query
⇒ Should support contextual data per provenance query/generation
Need flexible mechanism to select context (declarative?)
Slide 6 of 18 Boris Glavic Datamining Provenance: Provenance for Data Mining
DMProv and Data Responsibility
Measuring Amount of Influence
Single mining result influenced by large subset of input (all)
e.g., clustering
Amount of influence differs significantly (Responsibility)
DB Responsibility Model
Causalitya
Counterfactual cause i for output o
Removing i removes o from result
Actual cause i for output o
Contingency C: Set of inputs to remove before i becomes CC for o
Responsibility: 1
size of minimal contingency
a
A. Meliou et al. “Causality in databases”. In: IEEE Data Engineering Bulletin 33.3 (2010), pp. 59–67;
James Cheney. “Causality and the Semantics of Provenance”. In: DCM. 2010, pp. 63–74.
Slide 7 of 18 Boris Glavic Datamining Provenance: Provenance for Data Mining
DMProv and Data Responsibility
Applicability for Data Mining
Reduce size of provenance
only return top-k responsible inputs in provenance
only return input with responsibility over threshold
However: Output variables are not boolean
Solution Sketch
Consider every input as a cause
Consider every set of inputs as a contingency
Measure amount of change
e.g., distance between cluster means
Responsibility is sum of 1
size of contingency · d(o, o ) over all
contingencies normalized by number of contingencies
Slide 7 of 18 Boris Glavic Datamining Provenance: Provenance for Data Mining
Data vs. Parameter Responsibility
Effect of Parameter Settings
Mining results do depend on
Data
Parameter settings
Define new responsibility type using both data and parameters
Related Work: Robustness against parameter changes only
stability of clusteringsa
a
L.I. Kuncheva and D.P. Vetrov. “Evaluation of stability of k-means cluster ensembles with respect to random
initialization”. In: TPAMI 28.11 (2006), pp. 1798–1808.
Slide 8 of 18 Boris Glavic Datamining Provenance: Provenance for Data Mining
How/Process Provenance
So far only data provenance
Understand how mining algorithm combines inputs to produce an
output
Applicability of workflow and program analysis provenance techniques
Either too detailed or too coarse grained
Slide 9 of 18 Boris Glavic Datamining Provenance: Provenance for Data Mining
Outline
1 Motivation
2 Provenance for Data Mining
3 Frequent Itemset Provenance
4 Multidimensional Scaling Provenance
5 Conclusions
Frequent Itemset Mining (FIM)
One of the most prevalent mining tasks
Input: set of transactions (sets of items)
Fixed domain D
Output: subsets of D (frequent itemsets)
appear in fraction larger σ (minimum support) of the transactions
Slide 10 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
FIM Example
Transaction
TID Items CID
1 {Coffee-mate, Coffee, Diaper, Beer} 1
2 {Diaper, Bread, Beer} 2
3 {Coffee-mate, Diaper, Coffee, Beer} 3
4 {Bread, Coffee} 4
5 {Coffee-mate, Coffee} 4
6 {Coffee-mate, Sugar} 4
FIM
FID Frequent Items Support
1 {Coffee} 4
2 {Coffee-mate} 4
3 {Diaper} 3
4 {Beer} 3
5 {Diaper, Beer} 3
6 {Coffee, Coffee-mate} 3
Slide 11 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
FIM Example with Contextual Data
Transaction
TID Items CID
1 {Coffee-mate, Coffee, Diaper, Beer} 1
2 {Diaper, Bread, Beer} 2
3 {Coffee-mate, Diaper, Coffee, Beer} 3
4 {Bread, Coffee} 4
5 {Coffee-mate, Coffee} 4
6 {Coffee-mate, Sugar} 4
FIM
FID Frequent Items Support
1 {Coffee} 4
2 {Coffee-mate} 4
3 {Diaper} 3
4 {Beer} 3
5 {Diaper, Beer} 3
6 {Coffee, Coffee-mate} 3
Customer
CID AgeGroup Sex
1 20-40 m
2 20-40 m
3 20-40 m
4 50-60 f
Slide 11 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
Why-Provenance for FIM
Intuition
The transactions containing a frequent itemset I caused I to be
frequent
Define the Why-provenance as this set
Definition (Why-Provenance for FI)
Given transaction base D, minimum support σ, itemset I
W(I) = {t | I ⊆ t ∧ t ∈ D}
Slide 12 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
FIM Provenance
Transaction
TID Items CID
1 {Coffee-mate, Coffee, Diaper, Beer} 1
2 {Diaper, Bread, Beer} 2
3 {Coffee-mate, Diaper, Coffee, Beer} 3
4 {Bread, Coffee} 4
5 {Coffee-mate, Coffee} 4
6 {Coffee-mate, Sugar} 4
FIM
FID Frequent Items Support
1 {Coffee} 4
2 {Coffee-mate} 4
3 {Diaper} 3
4 {Beer} 3
5 {Diaper, Beer} 3
6 {Coffee, Coffee-mate} 3
Customer
CID AgeGroup Sex
1 20-40 m
2 20-40 m
3 20-40 m
4 50-60 f
Why-Provenance
FID TIDs
1 {1,3,5,6}
2 {1,3,5,6}
3 {1,2,3}
4 {1,2,3}
5 {1,2,3}
6 {1,3,5,6}
Slide 13 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
FIM Provenance with Context
Customer
CID AgeGroup Sex
1 20-40 m
2 20-40 m
3 20-40 m
4 50-60 f
Why-Provenance
FID TIDs
1 {1,3,5,6}
2 {1,3,5,6}
3 {1,2,3}
4 {1,2,3}
5 {1,2,3}
6 {1,3,5,6}
Example (Beer and Diaper)
Beer and Diaper is frequent
but why?
Slide 13 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
FIM Provenance with Context
Customer
CID AgeGroup Sex
1 20-40 m
2 20-40 m
3 20-40 m
4 50-60 f
Why-Provenance
FID TIDs
1 {1,3,5,6}
2 {1,3,5,6}
3 {1,2,3}
4 {1,2,3}
5 {1,2,3}
6 {1,3,5,6}
Example (Beer and Diaper)
Beer and Diaper is frequent
but why?
Why-provenance ⇒ it appeared in this set of transactions
Slide 13 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
FIM Provenance with Context
Customer
CID AgeGroup Sex
1 20-40 m
2 20-40 m
3 20-40 m
4 50-60 f
Why-Provenance
FID TIDs
1 {1,3,5,6}
2 {1,3,5,6}
3 {1,2,3}
4 {1,2,3}
5 {1,2,3}
6 {1,3,5,6}
Example (Beer and Diaper)
Beer and Diaper is frequent
but why?
Why-provenance ⇒ it appeared in this set of transactions
Not very useful!
Unfeasible if D is large
Slide 13 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
FIM Provenance with Context
Customer
CID AgeGroup Sex
1 20-40 m
2 20-40 m
3 20-40 m
4 50-60 f
Why-Provenance
FID TIDs
1 {1,3,5,6}
2 {1,3,5,6}
3 {1,2,3}
4 {1,2,3}
5 {1,2,3}
6 {1,3,5,6}
Example (Beer and Diaper)
Beer and Diaper is frequent
but why?
Why-provenance ⇒ it appeared in this set of transactions
Not very useful!
Unfeasible if D is large
Because male customers in age group 20 − 40 bought it
Slide 13 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
FIM Provenance with Context
Customer
CID AgeGroup Sex
1 20-40 m
2 20-40 m
3 20-40 m
4 50-60 f
Why-Provenance
FID TIDs
1 {1,3,5,6}
2 {1,3,5,6}
3 {1,2,3}
4 {1,2,3}
5 {1,2,3}
6 {1,3,5,6}
Example (Beer and Diaper)
Beer and Diaper is frequent
but why?
Why-provenance ⇒ it appeared in this set of transactions
Not very useful!
Unfeasible if D is large
Because male customers in age group 20 − 40 bought it
More useful and concise
Summarization of inputs in provenance using contextual information
Will need different context for different use cases
Slide 13 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
Preliminary Results
FIM Provenance
Why-Provenance for FIM
Declarative selection of context
I-Provenance
Prefix compressed tree representation of provenance
Precise modelling of interdependencies of items in provenance within
transactions
Database-based provenance generation and querying
Slide 14 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
Outline
1 Motivation
2 Provenance for Data Mining
3 Frequent Itemset Provenance
4 Multidimensional Scaling Provenance
5 Conclusions
Multidimensional Scaling (MDS)
Approach
Input: Set of observation with pair-wise similarities
Output: Mapping into n-dim space that tries to preserve similarities
Optimization problem
Use-case Marketing:
Customers rate products pairs according to similarity
MDS to generate layout (perceptual map) depicting similarity of
products
Example
A B C D
A - - - -
B 2 - - -
C 2 4 - -
D 3 3 3 -
A
C
D
B
2
3
5.5
3.5
3.7
2.8
Slide 15 of 18 Boris Glavic Datamining Provenance: Multidimensional Scaling Provenance
Data vs. Parameter Responsibility
Problem
If two items are close in the layout then
either they are similar
or because it minimized the fitness function
or some combination of both
Using Provenance
Why-provenance
Show (difference to) original similarities for subset of the data
Data vs. Parameter Responsibility
Influence of actual data properties
Parameter settings
Idiosyncrasies of the algorithm
Slide 16 of 18 Boris Glavic Datamining Provenance: Multidimensional Scaling Provenance
Outline
1 Motivation
2 Provenance for Data Mining
3 Frequent Itemset Provenance
4 Multidimensional Scaling Provenance
5 Conclusions
Challenges
Why-Provenance
Common model that generalizes processing of large classes of mining
algorithms
Dealing with large (potentially overlapping) provenance
Context and Preprocessing
Dynamic handling of contextual data
Tracing through preprocessing steps
Responsibility
Computational complexity
How to model parameter vs. data responsibility?
Slide 17 of 18 Boris Glavic Datamining Provenance: Conclusions
Conclusions
Take Away Messages
Data Mining is interesting and challenging application domain for
provenance
No previous work
Future Work
Extend preliminary results on FIM
Clustering (responsibility)
MDS (parameter vs. data responsibility)
Slide 18 of 18 Boris Glavic Datamining Provenance: Conclusions
Questions?
This is a vision paper
so let’s discuss!
Provenance for Data Mining

More Related Content

Similar to TaPP 2013 - Provenance for Data Mining

Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...ijdpsjournal
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptSangrangBargayary3
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and butest
 
Matt Archer - How To Regression Test A Billion Rows Of Financial Data Every S...
Matt Archer - How To Regression Test A Billion Rows Of Financial Data Every S...Matt Archer - How To Regression Test A Billion Rows Of Financial Data Every S...
Matt Archer - How To Regression Test A Billion Rows Of Financial Data Every S...TEST Huddle
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Subrata Kumer Paul
 
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).pptUNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).pptRaviKiranVarma4
 
Big Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data ScientistsBig Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data ScientistsWay-Yen Lin
 
Overview of Data Mining
Overview of Data MiningOverview of Data Mining
Overview of Data MiningBowo Prasetyo
 
Data mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageData mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageq-Maxim
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061badirh
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Conceptsdataminers.ir
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 

Similar to TaPP 2013 - Provenance for Data Mining (20)

05
0505
05
 
Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...
 
Why am I doing this???
Why am I doing this???Why am I doing this???
Why am I doing this???
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
 
ConQueSt
ConQueStConQueSt
ConQueSt
 
Matt Archer - How To Regression Test A Billion Rows Of Financial Data Every S...
Matt Archer - How To Regression Test A Billion Rows Of Financial Data Every S...Matt Archer - How To Regression Test A Billion Rows Of Financial Data Every S...
Matt Archer - How To Regression Test A Billion Rows Of Financial Data Every S...
 
10probs.ppt
10probs.ppt10probs.ppt
10probs.ppt
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
06 fp basic
06 fp basic06 fp basic
06 fp basic
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
 
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).pptUNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
 
Big Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data ScientistsBig Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data Scientists
 
06FPBasic.ppt
06FPBasic.ppt06FPBasic.ppt
06FPBasic.ppt
 
06FPBasic.ppt
06FPBasic.ppt06FPBasic.ppt
06FPBasic.ppt
 
Overview of Data Mining
Overview of Data MiningOverview of Data Mining
Overview of Data Mining
 
Data mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageData mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid language
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 

More from Boris Glavic

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...Boris Glavic
 
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...Boris Glavic
 
2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata GeneratorBoris Glavic
 
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...Boris Glavic
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...Boris Glavic
 
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSONBoris Glavic
 
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersTaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationBoris Glavic
 
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceTaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceBoris Glavic
 
EDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesEDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesBoris Glavic
 
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...Boris Glavic
 
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...Boris Glavic
 
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"Boris Glavic
 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"Boris Glavic
 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"Boris Glavic
 
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...Boris Glavic
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanBoris Glavic
 

More from Boris Glavic (18)

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
 
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
 
2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator
 
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
 
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
 
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersTaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database Virtualization
 
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceTaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
 
EDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesEDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested Subqueries
 
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
 
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
 
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
 
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, Ian
 

Recently uploaded

Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin
 

Recently uploaded (20)

Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by Petrovic
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
 

TaPP 2013 - Provenance for Data Mining

  • 1. Provenance for Data Mining Boris Glavic1 Javed Siddique2 Periklis Andritsos3 Ren´ee J. Miller2 Illinois Institute of Technology1 DBGroup University of Toronto2 Miller Lab University of Toronto3 iSchool TaPP 2013, June 18, 2013
  • 2. Outline 1 Motivation 2 Provenance for Data Mining 3 Frequent Itemset Provenance 4 Multidimensional Scaling Provenance 5 Conclusions
  • 3. Data Mining / KDD Goals Extract useful information from data Approach 1 Preprocessing Cleaning Feature Extraction . . . 2 Apply algorithms to extract information Clustering, Frequent Itemset Mining, Classification, . . . Most approaches: Reduce size/Summarize data Slide 1 of 18 Boris Glavic Datamining Provenance: Motivation
  • 4. Provenance for Data Mining? Dilemmas Purpose of data mining necessitates summarization “Needle in the haystack” Loss of information Makes interpreting “raw” results harder User point of view: DM algorithm is black box Slide 2 of 18 Boris Glavic Datamining Provenance: Motivation
  • 5. Provenance for Data Mining? How to solve Dilemmas? Selective access to input data result is based on Inputs to mining algorithm Inputs to preprocessing Contextual information Understand importance of inputs for results Input data Parameter settings Understand how data mining algorithm generates result from inputs Understand missing results Slide 2 of 18 Boris Glavic Datamining Provenance: Motivation
  • 6. Provenance for Data Mining? How to solve Dilemmas? Selective access to input data result is based on (Data Provenance+) Inputs to mining algorithm Inputs to preprocessing Contextual information Understand importance of inputs for results (Responsibility) Input data Parameter settings Understand how data mining algorithm generates result from inputs (Process provenance) Understand missing results (Missing answers) Slide 2 of 18 Boris Glavic Datamining Provenance: Motivation
  • 7. Related Work Provenance Database provenance, Workflow provenance, Missing answers, Responsibility, . . . Data Mining Enriching mining results with additional information Contextual information for frequent itemsetsa Visualization techniquesb Determining effect of parameter settings/inputs on result e.g., K-means cluster stability based on parameter settingsc a Q. Mei et al. “Generating semantic annotations for frequent patterns with context analysis”. In: SIGKDD. 2006, pp. 337–346. b D.A. Keim and H.P. Kriegel. “Visualization techniques for mining large databases: A comparison”. In: TKDE 8.6 (1996), pp. 923–938. c L.I. Kuncheva and D.P. Vetrov. “Evaluation of stability of k-means cluster ensembles with respect to random initialization”. In: TPAMI 28.11 (2006), pp. 1798–1808. Slide 3 of 18 Boris Glavic Datamining Provenance: Motivation
  • 8. Contributions Analyze requirements and use cases for data mining provenance (DMProv) Discuss applicability of existing approaches Outline challenges and sketch research directions Exemplify concepts on two concrete mining algorithms Frequent Itemset Mining (FIM) Multidimensional Scaling (MDS) Slide 4 of 18 Boris Glavic Datamining Provenance: Motivation
  • 9. Outline 1 Motivation 2 Provenance for Data Mining 3 Frequent Itemset Provenance 4 Multidimensional Scaling Provenance 5 Conclusions
  • 10. Fine-Grained Provenance Why-Provenance Here Why-Provenance means all models based on influence Subset of the input that caused output to appear in result “Caused by” modelled as Sufficiency Necessity Preservation of Equivalence / Computability Causality Useful for Data Mining? Provenance concepts seem applicable to data mining Have to deal with large provenance size (summarization) Can abstract processing of classes of mining algorithms? Do not reinvent provenance tracking for each algorithm! Slide 5 of 18 Boris Glavic Datamining Provenance: Provenance for Data Mining
  • 11. Fine-Grained Provenance Tracing Through Preprocessing Steps Track back data mining results to inputs of preprocessing ETL tools are used for preprocessing ⇒ can use database or workflow provenance approaches? Slide 5 of 18 Boris Glavic Datamining Provenance: Provenance for Data Mining
  • 12. Enriching Provenance with Context Contextual Information as Provenance Mining algorithms often applied to a subset of available data Contextual Data: data related to the mining inputs Automatic detection User provided Contextual data often more usable and concise than provenance Which contextual data is of interest will differ per use-case maybe even per query ⇒ Should support contextual data per provenance query/generation Need flexible mechanism to select context (declarative?) Slide 6 of 18 Boris Glavic Datamining Provenance: Provenance for Data Mining
  • 13. DMProv and Data Responsibility Measuring Amount of Influence Single mining result influenced by large subset of input (all) e.g., clustering Amount of influence differs significantly (Responsibility) DB Responsibility Model Causalitya Counterfactual cause i for output o Removing i removes o from result Actual cause i for output o Contingency C: Set of inputs to remove before i becomes CC for o Responsibility: 1 size of minimal contingency a A. Meliou et al. “Causality in databases”. In: IEEE Data Engineering Bulletin 33.3 (2010), pp. 59–67; James Cheney. “Causality and the Semantics of Provenance”. In: DCM. 2010, pp. 63–74. Slide 7 of 18 Boris Glavic Datamining Provenance: Provenance for Data Mining
  • 14. DMProv and Data Responsibility Applicability for Data Mining Reduce size of provenance only return top-k responsible inputs in provenance only return input with responsibility over threshold However: Output variables are not boolean Solution Sketch Consider every input as a cause Consider every set of inputs as a contingency Measure amount of change e.g., distance between cluster means Responsibility is sum of 1 size of contingency · d(o, o ) over all contingencies normalized by number of contingencies Slide 7 of 18 Boris Glavic Datamining Provenance: Provenance for Data Mining
  • 15. Data vs. Parameter Responsibility Effect of Parameter Settings Mining results do depend on Data Parameter settings Define new responsibility type using both data and parameters Related Work: Robustness against parameter changes only stability of clusteringsa a L.I. Kuncheva and D.P. Vetrov. “Evaluation of stability of k-means cluster ensembles with respect to random initialization”. In: TPAMI 28.11 (2006), pp. 1798–1808. Slide 8 of 18 Boris Glavic Datamining Provenance: Provenance for Data Mining
  • 16. How/Process Provenance So far only data provenance Understand how mining algorithm combines inputs to produce an output Applicability of workflow and program analysis provenance techniques Either too detailed or too coarse grained Slide 9 of 18 Boris Glavic Datamining Provenance: Provenance for Data Mining
  • 17. Outline 1 Motivation 2 Provenance for Data Mining 3 Frequent Itemset Provenance 4 Multidimensional Scaling Provenance 5 Conclusions
  • 18. Frequent Itemset Mining (FIM) One of the most prevalent mining tasks Input: set of transactions (sets of items) Fixed domain D Output: subsets of D (frequent itemsets) appear in fraction larger σ (minimum support) of the transactions Slide 10 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
  • 19. FIM Example Transaction TID Items CID 1 {Coffee-mate, Coffee, Diaper, Beer} 1 2 {Diaper, Bread, Beer} 2 3 {Coffee-mate, Diaper, Coffee, Beer} 3 4 {Bread, Coffee} 4 5 {Coffee-mate, Coffee} 4 6 {Coffee-mate, Sugar} 4 FIM FID Frequent Items Support 1 {Coffee} 4 2 {Coffee-mate} 4 3 {Diaper} 3 4 {Beer} 3 5 {Diaper, Beer} 3 6 {Coffee, Coffee-mate} 3 Slide 11 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
  • 20. FIM Example with Contextual Data Transaction TID Items CID 1 {Coffee-mate, Coffee, Diaper, Beer} 1 2 {Diaper, Bread, Beer} 2 3 {Coffee-mate, Diaper, Coffee, Beer} 3 4 {Bread, Coffee} 4 5 {Coffee-mate, Coffee} 4 6 {Coffee-mate, Sugar} 4 FIM FID Frequent Items Support 1 {Coffee} 4 2 {Coffee-mate} 4 3 {Diaper} 3 4 {Beer} 3 5 {Diaper, Beer} 3 6 {Coffee, Coffee-mate} 3 Customer CID AgeGroup Sex 1 20-40 m 2 20-40 m 3 20-40 m 4 50-60 f Slide 11 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
  • 21. Why-Provenance for FIM Intuition The transactions containing a frequent itemset I caused I to be frequent Define the Why-provenance as this set Definition (Why-Provenance for FI) Given transaction base D, minimum support σ, itemset I W(I) = {t | I ⊆ t ∧ t ∈ D} Slide 12 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
  • 22. FIM Provenance Transaction TID Items CID 1 {Coffee-mate, Coffee, Diaper, Beer} 1 2 {Diaper, Bread, Beer} 2 3 {Coffee-mate, Diaper, Coffee, Beer} 3 4 {Bread, Coffee} 4 5 {Coffee-mate, Coffee} 4 6 {Coffee-mate, Sugar} 4 FIM FID Frequent Items Support 1 {Coffee} 4 2 {Coffee-mate} 4 3 {Diaper} 3 4 {Beer} 3 5 {Diaper, Beer} 3 6 {Coffee, Coffee-mate} 3 Customer CID AgeGroup Sex 1 20-40 m 2 20-40 m 3 20-40 m 4 50-60 f Why-Provenance FID TIDs 1 {1,3,5,6} 2 {1,3,5,6} 3 {1,2,3} 4 {1,2,3} 5 {1,2,3} 6 {1,3,5,6} Slide 13 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
  • 23. FIM Provenance with Context Customer CID AgeGroup Sex 1 20-40 m 2 20-40 m 3 20-40 m 4 50-60 f Why-Provenance FID TIDs 1 {1,3,5,6} 2 {1,3,5,6} 3 {1,2,3} 4 {1,2,3} 5 {1,2,3} 6 {1,3,5,6} Example (Beer and Diaper) Beer and Diaper is frequent but why? Slide 13 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
  • 24. FIM Provenance with Context Customer CID AgeGroup Sex 1 20-40 m 2 20-40 m 3 20-40 m 4 50-60 f Why-Provenance FID TIDs 1 {1,3,5,6} 2 {1,3,5,6} 3 {1,2,3} 4 {1,2,3} 5 {1,2,3} 6 {1,3,5,6} Example (Beer and Diaper) Beer and Diaper is frequent but why? Why-provenance ⇒ it appeared in this set of transactions Slide 13 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
  • 25. FIM Provenance with Context Customer CID AgeGroup Sex 1 20-40 m 2 20-40 m 3 20-40 m 4 50-60 f Why-Provenance FID TIDs 1 {1,3,5,6} 2 {1,3,5,6} 3 {1,2,3} 4 {1,2,3} 5 {1,2,3} 6 {1,3,5,6} Example (Beer and Diaper) Beer and Diaper is frequent but why? Why-provenance ⇒ it appeared in this set of transactions Not very useful! Unfeasible if D is large Slide 13 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
  • 26. FIM Provenance with Context Customer CID AgeGroup Sex 1 20-40 m 2 20-40 m 3 20-40 m 4 50-60 f Why-Provenance FID TIDs 1 {1,3,5,6} 2 {1,3,5,6} 3 {1,2,3} 4 {1,2,3} 5 {1,2,3} 6 {1,3,5,6} Example (Beer and Diaper) Beer and Diaper is frequent but why? Why-provenance ⇒ it appeared in this set of transactions Not very useful! Unfeasible if D is large Because male customers in age group 20 − 40 bought it Slide 13 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
  • 27. FIM Provenance with Context Customer CID AgeGroup Sex 1 20-40 m 2 20-40 m 3 20-40 m 4 50-60 f Why-Provenance FID TIDs 1 {1,3,5,6} 2 {1,3,5,6} 3 {1,2,3} 4 {1,2,3} 5 {1,2,3} 6 {1,3,5,6} Example (Beer and Diaper) Beer and Diaper is frequent but why? Why-provenance ⇒ it appeared in this set of transactions Not very useful! Unfeasible if D is large Because male customers in age group 20 − 40 bought it More useful and concise Summarization of inputs in provenance using contextual information Will need different context for different use cases Slide 13 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
  • 28. Preliminary Results FIM Provenance Why-Provenance for FIM Declarative selection of context I-Provenance Prefix compressed tree representation of provenance Precise modelling of interdependencies of items in provenance within transactions Database-based provenance generation and querying Slide 14 of 18 Boris Glavic Datamining Provenance: Frequent Itemset Provenance
  • 29. Outline 1 Motivation 2 Provenance for Data Mining 3 Frequent Itemset Provenance 4 Multidimensional Scaling Provenance 5 Conclusions
  • 30. Multidimensional Scaling (MDS) Approach Input: Set of observation with pair-wise similarities Output: Mapping into n-dim space that tries to preserve similarities Optimization problem Use-case Marketing: Customers rate products pairs according to similarity MDS to generate layout (perceptual map) depicting similarity of products Example A B C D A - - - - B 2 - - - C 2 4 - - D 3 3 3 - A C D B 2 3 5.5 3.5 3.7 2.8 Slide 15 of 18 Boris Glavic Datamining Provenance: Multidimensional Scaling Provenance
  • 31. Data vs. Parameter Responsibility Problem If two items are close in the layout then either they are similar or because it minimized the fitness function or some combination of both Using Provenance Why-provenance Show (difference to) original similarities for subset of the data Data vs. Parameter Responsibility Influence of actual data properties Parameter settings Idiosyncrasies of the algorithm Slide 16 of 18 Boris Glavic Datamining Provenance: Multidimensional Scaling Provenance
  • 32. Outline 1 Motivation 2 Provenance for Data Mining 3 Frequent Itemset Provenance 4 Multidimensional Scaling Provenance 5 Conclusions
  • 33. Challenges Why-Provenance Common model that generalizes processing of large classes of mining algorithms Dealing with large (potentially overlapping) provenance Context and Preprocessing Dynamic handling of contextual data Tracing through preprocessing steps Responsibility Computational complexity How to model parameter vs. data responsibility? Slide 17 of 18 Boris Glavic Datamining Provenance: Conclusions
  • 34. Conclusions Take Away Messages Data Mining is interesting and challenging application domain for provenance No previous work Future Work Extend preliminary results on FIM Clustering (responsibility) MDS (parameter vs. data responsibility) Slide 18 of 18 Boris Glavic Datamining Provenance: Conclusions
  • 35. Questions? This is a vision paper so let’s discuss! Provenance for Data Mining