SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Generating Example Data For Dataflow Programs Chris Olston		Shubham Chopra Utkarsh Srivastava Research
Data Processing Renaissance ,[object Object]
Lots of queries and programs to analyze that data
New data flow languages
Map-Reduce, Pig Latin, Dryad
Other data flow systems
Aurora, Tioga, River,[object Object]
Iterative Process LOAD (user, url) LOAD (url, pagerank) Joining on right attribute? JOIN on url GROUP on user TRANSFORM user,  canonicalize(url) TRANSFORM user, AVG(pagerank) Bug in UDF canonicalize? Everything being filtered out? FILTER avgPR> 0.5 No Output 
How to do test runs? Run with real data Too inefficient (TBs of data) Create smaller data sets (e.g., by sampling) Empty results due to joins [Chaudhuri et. al. 99], and selective filters Biased sampling for joins Indexes not always present
Examples to Illustrate Program (www.cnn.com, 0.9)  (www.frogs.com, 0.3) (www.snails.com, 0.4) LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com)  (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url (Amy, www.cnn.com, 0.9)  (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) GROUP on user TRANSFORM user,  canonicalize(url) (Amy, www.cnn.com, 0.9)  (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) ) ( Amy,  ( Fred, ) TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com)  (Amy, www.frogs.com) (Fred, www.snails.com) (Amy, 0.6)  (Fred, 0.4) FILTER avgPR> 0.5 (Amy, 0.6)
Value Addition From Examples Examples can be used for Debugging Understanding a program written by someone else Learning a new operator, or language
Outline Formalization of good examples Example Generation Algorithm Performance Evaluation
Good Examples: Consistency LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com)  (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url GROUP on user 0. Consistency TRANSFORM user,  canonicalize(url) TRANSFORM user, AVG(pagerank) output example  =  operator applied on input example (Amy, www.cnn.com)  (Amy, www.frogs.com) (Fred, www.snails.com) FILTER avgPR> 0.5
Good Examples: Realism LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com)  (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url GROUP on user 1. Realism TRANSFORM user,  canonicalize(url) TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com)  (Amy, www.frogs.com) (Fred, www.snails.com) Formalization:  Fraction of examples that are real or are derived from real records FILTER avgPR> 0.5
Good Examples: Completeness LOAD (user, url) LOAD (url, pagerank) 2. Completeness JOIN on url Demonstrate the salient properties of each operator, e.g., FILTER GROUP on user TRANSFORM user,  canonicalize(url) TRANSFORM user, AVG(pagerank) (Amy, 0.6)  (Fred, 0.4) FILTER avgPR> 0.5 (Amy, 0.6)
Good Examples: Completeness (www.cnn.com, 0.9)  (www.frogs.com, 0.3) (www.snails.com, 0.4) LOAD (user, url) LOAD (url, pagerank) JOIN on url (Amy, www.cnn.com, 0.9)  (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) GROUP on user TRANSFORM user,  canonicalize(url) 2. Completeness TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com)  (Amy, www.frogs.com) (Fred, www.snails.com) Demonstrate the salient properties of each operator, e.g., JOIN FILTER avgPR> 0.5
Formalizing Completeness ,[object Object]
Each equivalence class demonstrates one property of the operator.
Try to have at least one example from each class ,[object Object]
Formalizing Completeness Operator Completeness:  	Fraction of equivalence classes that have at least one example record. Overall Completeness:  	Average of per-operator completeness.
Good Examples: Conciseness LOAD (user, url) LOAD (url, pagerank) 3. Conciseness (Amy, cnn.com)  (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url Operator Conciseness: # equivalence classes # example records GROUP on user TRANSFORM user,  canonicalize(url) Overall Conciseness: Average of per-operator conciseness  TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com)  (Amy, www.frogs.com) (Fred, www.snails.com) FILTER avgPR> 0.5
Outline Formalization of good examples Example Generation Algorithm Performance Evaluation
Related Work Related Areas: Reverse Query Processing Database Testing Software and Hardware Verification Differences Realism not a concern Notion of conciseness is different Intermediate result size is immaterial
Strawman I: Downstream Propagation Take some portion of input data and run the program over it. 1. Realism 2. Completeness 3. Conciseness
Strawman II: Upstream Propagation Start from what output is desired, and work backwards 1. Realism 2. Completeness 3. Conciseness
Our Algorithm Algorithm Passes Downstream  Pruning Upstream Pruning
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Take a subset of input and propagate through the program. (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Fred, 25) (Jack, 30) (Amy, 20)  (Fred, 25) (Jack, 30) (Amy, 20)  (Fred, 25) (Amy, 20)  (Fred, 25)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples, i.e., improve conciseness without hurting completeness.  (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Fred, 25) (Jack, 30) (Amy, 20)  (Fred, 25) (Jack, 30) (Amy, 20)  (Fred, 25) (Amy, 20)  (Fred, 25)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples, i.e., improve conciseness without hurting completeness.  (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Fred, 25) (Jack, 30) (Amy, 20)  (Fred, 25) (Jack, 30) (Amy, 20)  (Fred, 25) (Amy, 20)  (Fred, 25)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples, i.e., improve conciseness without hurting completeness.  (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Jack, 30) (Amy, 20)  (Jack, 30) (Amy, 20)  (Amy, 20)
Formalization of Pruning Example Records             Elements   Equivalence Classes            Sets Pick minimum #records to cover every equivalence class Set-Cover Problem More involved because completeness of other operators must be maintained; details in paper
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Jack, 30) (Amy, 20)  (Jack, 30) (Amy, 20)  (Amy, 20)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Jack, 30) (Amy, 20)  (Jack, 30) (--, 17)  (Amy, 20)  (Amy, 20)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) (--, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Jack, 30) (Amy, 20)  (Jack, 30) (--, 17)  (Amy, 20) (--, 17) (Amy, 20) (Bill, 17)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) (Bob, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Jack, 30) (Amy, 20)  (Jack, 30) (Bill, 17) (Bob, 17)  (Amy, 20) (Bill, 17) (Amy, 20) (Bill, 17)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones. (Jack, 30) (Bob, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Jack, 30) (Amy, 20)  (Jack, 30) (Bill, 17) (Bob, 17)  (Amy, 20) (Bill, 17) (Amy, 20) (Bill, 17)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones. (Jack, 30) (Bob, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Jack, 30) (Amy, 20)  (Jack, 30) (Bill, 17) (Bob, 17)  (Amy, 20) (Bill, 17) (Amy, 20) (Bill, 17)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones. (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Jack, 30) (Jack, 30) (Bill, 17) (Bill, 17) (Bill, 17)
Implementation Status Available as ILLUSTRATE command in open-source release of Pig Available as Eclipse Plugin (PigPen)
PigPen Snapshot
Performance Evaluation Program I: (Web Search Result Viewing Statistics) LOAD FILTER by compound arithmetic expression GROUP TRANSFORM using built-in aggregate function

Weitere ähnliche Inhalte

Ähnlich wie Generating Example Data For Dataflow Programs

Lessons Learnt With Lambdas and Streams in JDK 8
Lessons Learnt With Lambdas and Streams in JDK 8Lessons Learnt With Lambdas and Streams in JDK 8
Lessons Learnt With Lambdas and Streams in JDK 8Simon Ritter
 
Streams: The Good, The Bad And The Ugly
Streams: The Good, The Bad And The UglyStreams: The Good, The Bad And The Ugly
Streams: The Good, The Bad And The UglySimon Ritter
 
AWSでの機械学習におけるデータレイク・GPU実行環境
AWSでの機械学習におけるデータレイク・GPU実行環境AWSでの機械学習におけるデータレイク・GPU実行環境
AWSでの機械学習におけるデータレイク・GPU実行環境Yasuhiro Matsuo
 
Gdg almaty. Функциональное программирование в Java 8
Gdg almaty. Функциональное программирование в Java 8Gdg almaty. Функциональное программирование в Java 8
Gdg almaty. Функциональное программирование в Java 8Madina Kamzina
 
Dealing with combinatorial explosions and boring tests
Dealing with combinatorial explosions and boring testsDealing with combinatorial explosions and boring tests
Dealing with combinatorial explosions and boring testsAlexander Tarlinder
 
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsRAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsHyunjung Park
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
서버리스(Serverless) 웹 애플리케이션 구축하기
서버리스(Serverless) 웹 애플리케이션 구축하기서버리스(Serverless) 웹 애플리케이션 구축하기
서버리스(Serverless) 웹 애플리케이션 구축하기Amazon Web Services Korea
 
Routes Generation. Susanin will Help! - Alexey Osipenko
Routes Generation. Susanin will Help! - Alexey OsipenkoRoutes Generation. Susanin will Help! - Alexey Osipenko
Routes Generation. Susanin will Help! - Alexey OsipenkoRuby Meditation
 
Java 8 - Return of the Java
Java 8 - Return of the JavaJava 8 - Return of the Java
Java 8 - Return of the JavaFredrik Vraalsen
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward
 
computer notes - Data Structures - 5
computer notes - Data Structures - 5computer notes - Data Structures - 5
computer notes - Data Structures - 5ecomputernotes
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
L'ingénierie dans les nuages
L'ingénierie dans les nuagesL'ingénierie dans les nuages
L'ingénierie dans les nuagesAndrew Forward
 
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Big Data Spain
 
Hierarchical free monads and software design in fp
Hierarchical free monads and software design in fpHierarchical free monads and software design in fp
Hierarchical free monads and software design in fpAlexander Granin
 
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...Flink Forward
 

Ähnlich wie Generating Example Data For Dataflow Programs (20)

Lessons Learnt With Lambdas and Streams in JDK 8
Lessons Learnt With Lambdas and Streams in JDK 8Lessons Learnt With Lambdas and Streams in JDK 8
Lessons Learnt With Lambdas and Streams in JDK 8
 
Streams: The Good, The Bad And The Ugly
Streams: The Good, The Bad And The UglyStreams: The Good, The Bad And The Ugly
Streams: The Good, The Bad And The Ugly
 
AWSでの機械学習におけるデータレイク・GPU実行環境
AWSでの機械学習におけるデータレイク・GPU実行環境AWSでの機械学習におけるデータレイク・GPU実行環境
AWSでの機械学習におけるデータレイク・GPU実行環境
 
Gdg almaty. Функциональное программирование в Java 8
Gdg almaty. Функциональное программирование в Java 8Gdg almaty. Функциональное программирование в Java 8
Gdg almaty. Функциональное программирование в Java 8
 
Dealing with combinatorial explosions and boring tests
Dealing with combinatorial explosions and boring testsDealing with combinatorial explosions and boring tests
Dealing with combinatorial explosions and boring tests
 
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsRAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
 
Practical pig
Practical pigPractical pig
Practical pig
 
서버리스(Serverless) 웹 애플리케이션 구축하기
서버리스(Serverless) 웹 애플리케이션 구축하기서버리스(Serverless) 웹 애플리케이션 구축하기
서버리스(Serverless) 웹 애플리케이션 구축하기
 
Routes Generation. Susanin will Help! - Alexey Osipenko
Routes Generation. Susanin will Help! - Alexey OsipenkoRoutes Generation. Susanin will Help! - Alexey Osipenko
Routes Generation. Susanin will Help! - Alexey Osipenko
 
Java 8 - Return of the Java
Java 8 - Return of the JavaJava 8 - Return of the Java
Java 8 - Return of the Java
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
 
computer notes - Data Structures - 5
computer notes - Data Structures - 5computer notes - Data Structures - 5
computer notes - Data Structures - 5
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Java 8 Workshop
Java 8 WorkshopJava 8 Workshop
Java 8 Workshop
 
L'ingénierie dans les nuages
L'ingénierie dans les nuagesL'ingénierie dans les nuages
L'ingénierie dans les nuages
 
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
 
Lambda Functions in Java 8
Lambda Functions in Java 8Lambda Functions in Java 8
Lambda Functions in Java 8
 
Hierarchical free monads and software design in fp
Hierarchical free monads and software design in fpHierarchical free monads and software design in fp
Hierarchical free monads and software design in fp
 
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Kürzlich hochgeladen (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Generating Example Data For Dataflow Programs

  • 1. Generating Example Data For Dataflow Programs Chris Olston Shubham Chopra Utkarsh Srivastava Research
  • 2.
  • 3. Lots of queries and programs to analyze that data
  • 4. New data flow languages
  • 7.
  • 8. Iterative Process LOAD (user, url) LOAD (url, pagerank) Joining on right attribute? JOIN on url GROUP on user TRANSFORM user, canonicalize(url) TRANSFORM user, AVG(pagerank) Bug in UDF canonicalize? Everything being filtered out? FILTER avgPR> 0.5 No Output 
  • 9. How to do test runs? Run with real data Too inefficient (TBs of data) Create smaller data sets (e.g., by sampling) Empty results due to joins [Chaudhuri et. al. 99], and selective filters Biased sampling for joins Indexes not always present
  • 10. Examples to Illustrate Program (www.cnn.com, 0.9) (www.frogs.com, 0.3) (www.snails.com, 0.4) LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url (Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) GROUP on user TRANSFORM user, canonicalize(url) (Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) ) ( Amy, ( Fred, ) TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) (Amy, 0.6) (Fred, 0.4) FILTER avgPR> 0.5 (Amy, 0.6)
  • 11. Value Addition From Examples Examples can be used for Debugging Understanding a program written by someone else Learning a new operator, or language
  • 12. Outline Formalization of good examples Example Generation Algorithm Performance Evaluation
  • 13. Good Examples: Consistency LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url GROUP on user 0. Consistency TRANSFORM user, canonicalize(url) TRANSFORM user, AVG(pagerank) output example = operator applied on input example (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) FILTER avgPR> 0.5
  • 14. Good Examples: Realism LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url GROUP on user 1. Realism TRANSFORM user, canonicalize(url) TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) Formalization: Fraction of examples that are real or are derived from real records FILTER avgPR> 0.5
  • 15. Good Examples: Completeness LOAD (user, url) LOAD (url, pagerank) 2. Completeness JOIN on url Demonstrate the salient properties of each operator, e.g., FILTER GROUP on user TRANSFORM user, canonicalize(url) TRANSFORM user, AVG(pagerank) (Amy, 0.6) (Fred, 0.4) FILTER avgPR> 0.5 (Amy, 0.6)
  • 16. Good Examples: Completeness (www.cnn.com, 0.9) (www.frogs.com, 0.3) (www.snails.com, 0.4) LOAD (user, url) LOAD (url, pagerank) JOIN on url (Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) GROUP on user TRANSFORM user, canonicalize(url) 2. Completeness TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) Demonstrate the salient properties of each operator, e.g., JOIN FILTER avgPR> 0.5
  • 17.
  • 18. Each equivalence class demonstrates one property of the operator.
  • 19.
  • 20. Formalizing Completeness Operator Completeness: Fraction of equivalence classes that have at least one example record. Overall Completeness: Average of per-operator completeness.
  • 21. Good Examples: Conciseness LOAD (user, url) LOAD (url, pagerank) 3. Conciseness (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url Operator Conciseness: # equivalence classes # example records GROUP on user TRANSFORM user, canonicalize(url) Overall Conciseness: Average of per-operator conciseness TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) FILTER avgPR> 0.5
  • 22. Outline Formalization of good examples Example Generation Algorithm Performance Evaluation
  • 23. Related Work Related Areas: Reverse Query Processing Database Testing Software and Hardware Verification Differences Realism not a concern Notion of conciseness is different Intermediate result size is immaterial
  • 24. Strawman I: Downstream Propagation Take some portion of input data and run the program over it. 1. Realism 2. Completeness 3. Conciseness
  • 25. Strawman II: Upstream Propagation Start from what output is desired, and work backwards 1. Realism 2. Completeness 3. Conciseness
  • 26. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning
  • 27. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Take a subset of input and propagate through the program. (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Fred, 25) (Jack, 30) (Amy, 20) (Fred, 25) (Jack, 30) (Amy, 20) (Fred, 25) (Amy, 20) (Fred, 25)
  • 28. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples, i.e., improve conciseness without hurting completeness. (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Fred, 25) (Jack, 30) (Amy, 20) (Fred, 25) (Jack, 30) (Amy, 20) (Fred, 25) (Amy, 20) (Fred, 25)
  • 29. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples, i.e., improve conciseness without hurting completeness. (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Fred, 25) (Jack, 30) (Amy, 20) (Fred, 25) (Jack, 30) (Amy, 20) (Fred, 25) (Amy, 20) (Fred, 25)
  • 30. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples, i.e., improve conciseness without hurting completeness. (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Jack, 30) (Amy, 20) (Jack, 30) (Amy, 20) (Amy, 20)
  • 31. Formalization of Pruning Example Records Elements Equivalence Classes Sets Pick minimum #records to cover every equivalence class Set-Cover Problem More involved because completeness of other operators must be maintained; details in paper
  • 32. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Jack, 30) (Amy, 20) (Jack, 30) (Amy, 20) (Amy, 20)
  • 33. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Jack, 30) (Amy, 20) (Jack, 30) (--, 17) (Amy, 20) (Amy, 20)
  • 34. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) (--, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Jack, 30) (Amy, 20) (Jack, 30) (--, 17) (Amy, 20) (--, 17) (Amy, 20) (Bill, 17)
  • 35. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) (Bob, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Jack, 30) (Amy, 20) (Jack, 30) (Bill, 17) (Bob, 17) (Amy, 20) (Bill, 17) (Amy, 20) (Bill, 17)
  • 36. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones. (Jack, 30) (Bob, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Jack, 30) (Amy, 20) (Jack, 30) (Bill, 17) (Bob, 17) (Amy, 20) (Bill, 17) (Amy, 20) (Bill, 17)
  • 37. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones. (Jack, 30) (Bob, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Jack, 30) (Amy, 20) (Jack, 30) (Bill, 17) (Bob, 17) (Amy, 20) (Bill, 17) (Amy, 20) (Bill, 17)
  • 38. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones. (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Jack, 30) (Jack, 30) (Bill, 17) (Bill, 17) (Bill, 17)
  • 39. Implementation Status Available as ILLUSTRATE command in open-source release of Pig Available as Eclipse Plugin (PigPen)
  • 41. Performance Evaluation Program I: (Web Search Result Viewing Statistics) LOAD FILTER by compound arithmetic expression GROUP TRANSFORM using built-in aggregate function
  • 43. Performance Evaluation Program II: (Web Advertising Activity) LOAD table A FILTER A by compound logical expression JOIN with table B (highly selective) TRANSFORM using 4 string manipulation UDFS (non-invertible)
  • 46.
  • 47. Actual dataset too large for test runs.
  • 48. Our algorithm can automatically generate examples that illustrate the program through:

Hinweis der Redaktion

  1. remove Y! logo, slide for related work
  2. say what canonicalize does, filter like having clause in SQL
  3. cite surajit, motwani
  4. this is what someone would write by hand, or when teaching a class
  5. skip
  6. input or output records?, give rule for UNION
  7. mention that only highlevel description
  8. call out which filter every time you say it
  9. nice to point out that real ones can be pruned too
  10. say 8 programs, sampling of workload at yahoo, going to show one easy one and 1 hard one
  11. say downstream run with 10000 initial samples, same as our algo
  12. investigate completeness with downstream
  13. put up y! logo, pig logo