SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Generating Example Data For Dataflow Programs Chris Olston		Shubham Chopra Utkarsh Srivastava Research
Data Processing Renaissance ,[object Object]
Lots of queries and programs to analyze that data
New data flow languages
Map-Reduce, Pig Latin, Dryad
Other data flow systems
Aurora, Tioga, River,[object Object]
Iterative Process LOAD (user, url) LOAD (url, pagerank) Joining on right attribute? JOIN on url GROUP on user TRANSFORM user,  canonicalize(url) TRANSFORM user, AVG(pagerank) Bug in UDF canonicalize? Everything being filtered out? FILTER avgPR> 0.5 No Output 
How to do test runs? Run with real data Too inefficient (TBs of data) Create smaller data sets (e.g., by sampling) Empty results due to joins [Chaudhuri et. al. 99], and selective filters Biased sampling for joins Indexes not always present
Examples to Illustrate Program (www.cnn.com, 0.9)  (www.frogs.com, 0.3) (www.snails.com, 0.4) LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com)  (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url (Amy, www.cnn.com, 0.9)  (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) GROUP on user TRANSFORM user,  canonicalize(url) (Amy, www.cnn.com, 0.9)  (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) ) ( Amy,  ( Fred, ) TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com)  (Amy, www.frogs.com) (Fred, www.snails.com) (Amy, 0.6)  (Fred, 0.4) FILTER avgPR> 0.5 (Amy, 0.6)
Value Addition From Examples Examples can be used for Debugging Understanding a program written by someone else Learning a new operator, or language
Outline Formalization of good examples Example Generation Algorithm Performance Evaluation
Good Examples: Consistency LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com)  (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url GROUP on user 0. Consistency TRANSFORM user,  canonicalize(url) TRANSFORM user, AVG(pagerank) output example  =  operator applied on input example (Amy, www.cnn.com)  (Amy, www.frogs.com) (Fred, www.snails.com) FILTER avgPR> 0.5
Good Examples: Realism LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com)  (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url GROUP on user 1. Realism TRANSFORM user,  canonicalize(url) TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com)  (Amy, www.frogs.com) (Fred, www.snails.com) Formalization:  Fraction of examples that are real or are derived from real records FILTER avgPR> 0.5
Good Examples: Completeness LOAD (user, url) LOAD (url, pagerank) 2. Completeness JOIN on url Demonstrate the salient properties of each operator, e.g., FILTER GROUP on user TRANSFORM user,  canonicalize(url) TRANSFORM user, AVG(pagerank) (Amy, 0.6)  (Fred, 0.4) FILTER avgPR> 0.5 (Amy, 0.6)
Good Examples: Completeness (www.cnn.com, 0.9)  (www.frogs.com, 0.3) (www.snails.com, 0.4) LOAD (user, url) LOAD (url, pagerank) JOIN on url (Amy, www.cnn.com, 0.9)  (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) GROUP on user TRANSFORM user,  canonicalize(url) 2. Completeness TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com)  (Amy, www.frogs.com) (Fred, www.snails.com) Demonstrate the salient properties of each operator, e.g., JOIN FILTER avgPR> 0.5
Formalizing Completeness ,[object Object]
Each equivalence class demonstrates one property of the operator.
Try to have at least one example from each class ,[object Object]
Formalizing Completeness Operator Completeness:  	Fraction of equivalence classes that have at least one example record. Overall Completeness:  	Average of per-operator completeness.
Good Examples: Conciseness LOAD (user, url) LOAD (url, pagerank) 3. Conciseness (Amy, cnn.com)  (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url Operator Conciseness: # equivalence classes # example records GROUP on user TRANSFORM user,  canonicalize(url) Overall Conciseness: Average of per-operator conciseness  TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com)  (Amy, www.frogs.com) (Fred, www.snails.com) FILTER avgPR> 0.5
Outline Formalization of good examples Example Generation Algorithm Performance Evaluation
Related Work Related Areas: Reverse Query Processing Database Testing Software and Hardware Verification Differences Realism not a concern Notion of conciseness is different Intermediate result size is immaterial
Strawman I: Downstream Propagation Take some portion of input data and run the program over it. 1. Realism 2. Completeness 3. Conciseness
Strawman II: Upstream Propagation Start from what output is desired, and work backwards 1. Realism 2. Completeness 3. Conciseness
Our Algorithm Algorithm Passes Downstream  Pruning Upstream Pruning
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Take a subset of input and propagate through the program. (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Fred, 25) (Jack, 30) (Amy, 20)  (Fred, 25) (Jack, 30) (Amy, 20)  (Fred, 25) (Amy, 20)  (Fred, 25)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples, i.e., improve conciseness without hurting completeness.  (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Fred, 25) (Jack, 30) (Amy, 20)  (Fred, 25) (Jack, 30) (Amy, 20)  (Fred, 25) (Amy, 20)  (Fred, 25)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples, i.e., improve conciseness without hurting completeness.  (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Fred, 25) (Jack, 30) (Amy, 20)  (Fred, 25) (Jack, 30) (Amy, 20)  (Fred, 25) (Amy, 20)  (Fred, 25)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples, i.e., improve conciseness without hurting completeness.  (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Jack, 30) (Amy, 20)  (Jack, 30) (Amy, 20)  (Amy, 20)
Formalization of Pruning Example Records             Elements   Equivalence Classes            Sets Pick minimum #records to cover every equivalence class Set-Cover Problem More involved because completeness of other operators must be maintained; details in paper
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Jack, 30) (Amy, 20)  (Jack, 30) (Amy, 20)  (Amy, 20)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Jack, 30) (Amy, 20)  (Jack, 30) (--, 17)  (Amy, 20)  (Amy, 20)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) (--, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Jack, 30) (Amy, 20)  (Jack, 30) (--, 17)  (Amy, 20) (--, 17) (Amy, 20) (Bill, 17)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) (Bob, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Jack, 30) (Amy, 20)  (Jack, 30) (Bill, 17) (Bob, 17)  (Amy, 20) (Bill, 17) (Amy, 20) (Bill, 17)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones. (Jack, 30) (Bob, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Jack, 30) (Amy, 20)  (Jack, 30) (Bill, 17) (Bob, 17)  (Amy, 20) (Bill, 17) (Amy, 20) (Bill, 17)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones. (Jack, 30) (Bob, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20)  (Jack, 30) (Amy, 20)  (Jack, 30) (Bill, 17) (Bob, 17)  (Amy, 20) (Bill, 17) (Amy, 20) (Bill, 17)
Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones. (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Jack, 30) (Jack, 30) (Bill, 17) (Bill, 17) (Bill, 17)
Implementation Status Available as ILLUSTRATE command in open-source release of Pig Available as Eclipse Plugin (PigPen)
PigPen Snapshot
Performance Evaluation Program I: (Web Search Result Viewing Statistics) LOAD FILTER by compound arithmetic expression GROUP TRANSFORM using built-in aggregate function

Weitere ähnliche Inhalte

Ähnlich wie Generating Example Data For Dataflow Programs

Lessons Learnt With Lambdas and Streams in JDK 8
Lessons Learnt With Lambdas and Streams in JDK 8Lessons Learnt With Lambdas and Streams in JDK 8
Lessons Learnt With Lambdas and Streams in JDK 8Simon Ritter
 
Streams: The Good, The Bad And The Ugly
Streams: The Good, The Bad And The UglyStreams: The Good, The Bad And The Ugly
Streams: The Good, The Bad And The UglySimon Ritter
 
AWSでの機械学習におけるデータレイク・GPU実行環境
AWSでの機械学習におけるデータレイク・GPU実行環境AWSでの機械学習におけるデータレイク・GPU実行環境
AWSでの機械学習におけるデータレイク・GPU実行環境Yasuhiro Matsuo
 
Gdg almaty. Функциональное программирование в Java 8
Gdg almaty. Функциональное программирование в Java 8Gdg almaty. Функциональное программирование в Java 8
Gdg almaty. Функциональное программирование в Java 8Madina Kamzina
 
Dealing with combinatorial explosions and boring tests
Dealing with combinatorial explosions and boring testsDealing with combinatorial explosions and boring tests
Dealing with combinatorial explosions and boring testsAlexander Tarlinder
 
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsRAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsHyunjung Park
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
서버리스(Serverless) 웹 애플리케이션 구축하기
서버리스(Serverless) 웹 애플리케이션 구축하기서버리스(Serverless) 웹 애플리케이션 구축하기
서버리스(Serverless) 웹 애플리케이션 구축하기Amazon Web Services Korea
 
Routes Generation. Susanin will Help! - Alexey Osipenko
Routes Generation. Susanin will Help! - Alexey OsipenkoRoutes Generation. Susanin will Help! - Alexey Osipenko
Routes Generation. Susanin will Help! - Alexey OsipenkoRuby Meditation
 
Java 8 - Return of the Java
Java 8 - Return of the JavaJava 8 - Return of the Java
Java 8 - Return of the JavaFredrik Vraalsen
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward
 
computer notes - Data Structures - 5
computer notes - Data Structures - 5computer notes - Data Structures - 5
computer notes - Data Structures - 5ecomputernotes
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
L'ingénierie dans les nuages
L'ingénierie dans les nuagesL'ingénierie dans les nuages
L'ingénierie dans les nuagesAndrew Forward
 
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Big Data Spain
 
Hierarchical free monads and software design in fp
Hierarchical free monads and software design in fpHierarchical free monads and software design in fp
Hierarchical free monads and software design in fpAlexander Granin
 
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...Flink Forward
 

Ähnlich wie Generating Example Data For Dataflow Programs (20)

Lessons Learnt With Lambdas and Streams in JDK 8
Lessons Learnt With Lambdas and Streams in JDK 8Lessons Learnt With Lambdas and Streams in JDK 8
Lessons Learnt With Lambdas and Streams in JDK 8
 
Streams: The Good, The Bad And The Ugly
Streams: The Good, The Bad And The UglyStreams: The Good, The Bad And The Ugly
Streams: The Good, The Bad And The Ugly
 
AWSでの機械学習におけるデータレイク・GPU実行環境
AWSでの機械学習におけるデータレイク・GPU実行環境AWSでの機械学習におけるデータレイク・GPU実行環境
AWSでの機械学習におけるデータレイク・GPU実行環境
 
Gdg almaty. Функциональное программирование в Java 8
Gdg almaty. Функциональное программирование в Java 8Gdg almaty. Функциональное программирование в Java 8
Gdg almaty. Функциональное программирование в Java 8
 
Dealing with combinatorial explosions and boring tests
Dealing with combinatorial explosions and boring testsDealing with combinatorial explosions and boring tests
Dealing with combinatorial explosions and boring tests
 
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsRAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
 
Practical pig
Practical pigPractical pig
Practical pig
 
서버리스(Serverless) 웹 애플리케이션 구축하기
서버리스(Serverless) 웹 애플리케이션 구축하기서버리스(Serverless) 웹 애플리케이션 구축하기
서버리스(Serverless) 웹 애플리케이션 구축하기
 
Routes Generation. Susanin will Help! - Alexey Osipenko
Routes Generation. Susanin will Help! - Alexey OsipenkoRoutes Generation. Susanin will Help! - Alexey Osipenko
Routes Generation. Susanin will Help! - Alexey Osipenko
 
Java 8 - Return of the Java
Java 8 - Return of the JavaJava 8 - Return of the Java
Java 8 - Return of the Java
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
 
computer notes - Data Structures - 5
computer notes - Data Structures - 5computer notes - Data Structures - 5
computer notes - Data Structures - 5
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Java 8 Workshop
Java 8 WorkshopJava 8 Workshop
Java 8 Workshop
 
L'ingénierie dans les nuages
L'ingénierie dans les nuagesL'ingénierie dans les nuages
L'ingénierie dans les nuages
 
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
 
Lambda Functions in Java 8
Lambda Functions in Java 8Lambda Functions in Java 8
Lambda Functions in Java 8
 
Hierarchical free monads and software design in fp
Hierarchical free monads and software design in fpHierarchical free monads and software design in fp
Hierarchical free monads and software design in fp
 
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...
 

Kürzlich hochgeladen

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Kürzlich hochgeladen (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Generating Example Data For Dataflow Programs

  • 1. Generating Example Data For Dataflow Programs Chris Olston Shubham Chopra Utkarsh Srivastava Research
  • 2.
  • 3. Lots of queries and programs to analyze that data
  • 4. New data flow languages
  • 7.
  • 8. Iterative Process LOAD (user, url) LOAD (url, pagerank) Joining on right attribute? JOIN on url GROUP on user TRANSFORM user, canonicalize(url) TRANSFORM user, AVG(pagerank) Bug in UDF canonicalize? Everything being filtered out? FILTER avgPR> 0.5 No Output 
  • 9. How to do test runs? Run with real data Too inefficient (TBs of data) Create smaller data sets (e.g., by sampling) Empty results due to joins [Chaudhuri et. al. 99], and selective filters Biased sampling for joins Indexes not always present
  • 10. Examples to Illustrate Program (www.cnn.com, 0.9) (www.frogs.com, 0.3) (www.snails.com, 0.4) LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url (Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) GROUP on user TRANSFORM user, canonicalize(url) (Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) ) ( Amy, ( Fred, ) TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) (Amy, 0.6) (Fred, 0.4) FILTER avgPR> 0.5 (Amy, 0.6)
  • 11. Value Addition From Examples Examples can be used for Debugging Understanding a program written by someone else Learning a new operator, or language
  • 12. Outline Formalization of good examples Example Generation Algorithm Performance Evaluation
  • 13. Good Examples: Consistency LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url GROUP on user 0. Consistency TRANSFORM user, canonicalize(url) TRANSFORM user, AVG(pagerank) output example = operator applied on input example (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) FILTER avgPR> 0.5
  • 14. Good Examples: Realism LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url GROUP on user 1. Realism TRANSFORM user, canonicalize(url) TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) Formalization: Fraction of examples that are real or are derived from real records FILTER avgPR> 0.5
  • 15. Good Examples: Completeness LOAD (user, url) LOAD (url, pagerank) 2. Completeness JOIN on url Demonstrate the salient properties of each operator, e.g., FILTER GROUP on user TRANSFORM user, canonicalize(url) TRANSFORM user, AVG(pagerank) (Amy, 0.6) (Fred, 0.4) FILTER avgPR> 0.5 (Amy, 0.6)
  • 16. Good Examples: Completeness (www.cnn.com, 0.9) (www.frogs.com, 0.3) (www.snails.com, 0.4) LOAD (user, url) LOAD (url, pagerank) JOIN on url (Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) GROUP on user TRANSFORM user, canonicalize(url) 2. Completeness TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) Demonstrate the salient properties of each operator, e.g., JOIN FILTER avgPR> 0.5
  • 17.
  • 18. Each equivalence class demonstrates one property of the operator.
  • 19.
  • 20. Formalizing Completeness Operator Completeness: Fraction of equivalence classes that have at least one example record. Overall Completeness: Average of per-operator completeness.
  • 21. Good Examples: Conciseness LOAD (user, url) LOAD (url, pagerank) 3. Conciseness (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url Operator Conciseness: # equivalence classes # example records GROUP on user TRANSFORM user, canonicalize(url) Overall Conciseness: Average of per-operator conciseness TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) FILTER avgPR> 0.5
  • 22. Outline Formalization of good examples Example Generation Algorithm Performance Evaluation
  • 23. Related Work Related Areas: Reverse Query Processing Database Testing Software and Hardware Verification Differences Realism not a concern Notion of conciseness is different Intermediate result size is immaterial
  • 24. Strawman I: Downstream Propagation Take some portion of input data and run the program over it. 1. Realism 2. Completeness 3. Conciseness
  • 25. Strawman II: Upstream Propagation Start from what output is desired, and work backwards 1. Realism 2. Completeness 3. Conciseness
  • 26. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning
  • 27. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Take a subset of input and propagate through the program. (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Fred, 25) (Jack, 30) (Amy, 20) (Fred, 25) (Jack, 30) (Amy, 20) (Fred, 25) (Amy, 20) (Fred, 25)
  • 28. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples, i.e., improve conciseness without hurting completeness. (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Fred, 25) (Jack, 30) (Amy, 20) (Fred, 25) (Jack, 30) (Amy, 20) (Fred, 25) (Amy, 20) (Fred, 25)
  • 29. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples, i.e., improve conciseness without hurting completeness. (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Fred, 25) (Jack, 30) (Amy, 20) (Fred, 25) (Jack, 30) (Amy, 20) (Fred, 25) (Amy, 20) (Fred, 25)
  • 30. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples, i.e., improve conciseness without hurting completeness. (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Jack, 30) (Amy, 20) (Jack, 30) (Amy, 20) (Amy, 20)
  • 31. Formalization of Pruning Example Records Elements Equivalence Classes Sets Pick minimum #records to cover every equivalence class Set-Cover Problem More involved because completeness of other operators must be maintained; details in paper
  • 32. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Jack, 30) (Amy, 20) (Jack, 30) (Amy, 20) (Amy, 20)
  • 33. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Jack, 30) (Amy, 20) (Jack, 30) (--, 17) (Amy, 20) (Amy, 20)
  • 34. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) (--, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Jack, 30) (Amy, 20) (Jack, 30) (--, 17) (Amy, 20) (--, 17) (Amy, 20) (Bill, 17)
  • 35. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Enhance completeness by inserting constraint records (best effort; details in paper) (Jack, 30) (Bob, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Jack, 30) (Amy, 20) (Jack, 30) (Bill, 17) (Bob, 17) (Amy, 20) (Bill, 17) (Amy, 20) (Bill, 17)
  • 36. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones. (Jack, 30) (Bob, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Jack, 30) (Amy, 20) (Jack, 30) (Bill, 17) (Bob, 17) (Amy, 20) (Bill, 17) (Amy, 20) (Bill, 17)
  • 37. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones. (Jack, 30) (Bob, 17) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Amy, 20) (Jack, 30) (Amy, 20) (Jack, 30) (Bill, 17) (Bob, 17) (Amy, 20) (Bill, 17) (Amy, 20) (Bill, 17)
  • 38. Our Algorithm Algorithm Passes Downstream Pruning Upstream Pruning Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones. (Jack, 30) LOAD (user, age) UNION FILTER age>18 LOAD (user, age) FILTER udf(user) (Jack, 30) (Jack, 30) (Bill, 17) (Bill, 17) (Bill, 17)
  • 39. Implementation Status Available as ILLUSTRATE command in open-source release of Pig Available as Eclipse Plugin (PigPen)
  • 41. Performance Evaluation Program I: (Web Search Result Viewing Statistics) LOAD FILTER by compound arithmetic expression GROUP TRANSFORM using built-in aggregate function
  • 43. Performance Evaluation Program II: (Web Advertising Activity) LOAD table A FILTER A by compound logical expression JOIN with table B (highly selective) TRANSFORM using 4 string manipulation UDFS (non-invertible)
  • 46.
  • 47. Actual dataset too large for test runs.
  • 48. Our algorithm can automatically generate examples that illustrate the program through:

Hinweis der Redaktion

  1. remove Y! logo, slide for related work
  2. say what canonicalize does, filter like having clause in SQL
  3. cite surajit, motwani
  4. this is what someone would write by hand, or when teaching a class
  5. skip
  6. input or output records?, give rule for UNION
  7. mention that only highlevel description
  8. call out which filter every time you say it
  9. nice to point out that real ones can be pruned too
  10. say 8 programs, sampling of workload at yahoo, going to show one easy one and 1 hard one
  11. say downstream run with 10000 initial samples, same as our algo
  12. investigate completeness with downstream
  13. put up y! logo, pig logo