SlideShare a Scribd company logo
1 of 34
Download to read offline
SampleClean: Bringing Data 
Cleaning into the BDAS Stack! 
Sanjay Krishnan and Daniel Haas! 
In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan 
Wang, Tim Kraska, Michael Franklin, Tova Milo, Ken 
Goldberg !
Who publishes more? ! 
! 
! 
2
Microsoft Academic Search! 
! 
! 
Paper Id! Affiliation! 
16! Computer Science Division--University of 
California Berkeley CA! 
101! University of California at Berkeley! 
102! Department of Physics Stanford ! 
University California! 
116! Lawrence Berkeley National Labs! 
<ref>California</ref>! 
3
Microsoft Academic Search! 
! 
! 
Paper Id! Affiliation! 
16! Computer Science Division--University of 
California Berkeley CA! 
101! University of California at Berkeley! 
102! Department of Physics Stanford ! 
University California! 
116! Lawrence Berkeley National Labs! 
<ref>California</ref>! 
X 
4
Microsoft Academic Search! 
! 
! 
University of California at Berkeley! 
Computer Science Division! 
University of California at Berkeley! 
Department of Physics Stanford ! 
University California! 
5
• Data cleaning in BDAS.! 
– Problem 1. Scale! 
– Problem 2. Latency! 
! 
• Sampling to cope with scale.! 
• Asynchrony to cope with latency.! 
! 
Enter SampleClean! 
6
Now it’s your turn!! 
Be the crowd and help us decide! 
! 
! 
7
Dirty Data is Ubiquitous! 
8! 
Example: Missing, incomplete, inconsistent data!
Data Cleaning is Hard! 
9 
Time consuming!
Data Cleaning is Hard! 
10 
Time consuming! 
Costly!
Data Cleaning is Hard! 
11 
Time consuming! 
Costly! 
Domain-specific!
Data Cleaning is Hard! 
12 
Time consuming! 
Costly! 
Domain-specific!
A New Data Cleaning Architecture! 
Analy0cs 
13 
Data 
Data 
Cleaning
A New Data Cleaning Architecture! 
Analy0cs 
14 
Data 
Cleaning 
Data
Can it Scale?! 
People are slow and expensive! 
Crowd 
Machine 
Learning 
Regex 
Time 
15
Insight 1: Asynchrony Hides Latency! 
16
Insight 2: Sampling Hides Scale! 
Query ! 
Error! 
BlinkDB! 
Time! 
17
Insight 2: Sampling Hides Scale! 
Query ! 
Error! 
Time! 
Data 
Error 
BlinkDB! 
18
Insight 2: Sampling Hides Scale! 
Query ! 
Error! 
Time! 
Data 
Error 
SampleClean! 
BlinkDB! 
19
SampleClean Data Flow! 
Dirty 
Data 
Dirty 
Sample 
Query 
Clean 
Sample 
Data 
Cleaning 
20 
Sampling 
Asynchrony
SampleClean Data Flow! 
Query 
Clean 
Sample 
Data 
Cleaning 
Asynchrony 
21
The SampleClean Architecture! 
Data 
Cleaning 
Library 
Issue Queries, ! 
Get Results! 
Approximate 
Asynchronous 
Query 
Processing 
Pipelines 
Clean 
Sample 
Declare Cleaning ! 
Operations! 
Dirty 
Sample 
22
The SampleClean Architecture! 
Data 
Cleaning 
Library 
Issue Queries, ! 
Get Results! 
Approximate 
Asynchronous 
Query 
Processing 
Pipelines 
Clean 
Sample 
Declare Cleaning ! 
Operations! 
Dirty 
Sample 
23
Approximate Query Processing! 
• Estimate early results and bound with 
error bars! 
Query ! 
Error! 
Time! 
SampleClean: Fast and Accurate Query Processing on Dirty Data. SIGMOD 2014! 
! 
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very 
Large Data. EuroSys 2013! 
24
The SampleClean Architecture! 
25 
Issue Queries, ! 
Get Results! 
Approximate 
Asynchronous 
Query 
Processing 
Pipelines 
Clean 
Sample 
Declare Cleaning ! 
Operations! 
Dirty 
Sample 
Data 
Cleaning 
Library
Crowds and Machines 
Work Together! 
• Extensible library of data cleaning tools! 
• Tools are:! 
– Automated! 
– Human-powered! 
– Hybrid! 
! 
Crowd 
Machine 
Learning 
Regex 
Time 
26
Active Learning and Crowds! 
• Choose informative training points! 
Not ! 
Informative! 
Are these the same?! 
Stanford Department of IEOR! 
! 
UC Berkeley Stats! 
! 
¢ Yes ! 
¤ No! 
Informative! 
Are these the same?! 
Department of Mathematics Stanford University! 
! 
University of California Berkeley Department of 
Mathematics! 
! 
¢ Yes ! 
¤ No! 
27
Active Learning and Crowds! 
• Choose informative training points! 
Not ! 
Informative! 
Are these the same?! 
Stanford Department of IEOR! 
! 
UC Berkeley Stats! 
! 
¢ Yes ! 
¤ No! 
Informative! 
Are these the same?! 
Department of Mathematics Stanford University! 
! 
University of California Berkeley Department of 
Mathematics! 
! 
¢ Yes ! 
¤ No! 
28
The SampleClean Architecture! 
29 
Data 
Cleaning 
Library 
Issue Queries, ! 
Get Results! 
Clean 
Sample 
Declare Cleaning ! 
Operations! 
Dirty 
Sample 
Approximate 
Asynchronous 
Query 
Processing 
Pipelines
Putting it all together: 
Asynchronous Pipelines! 
• Users group data cleaning operations into 
pipelines! 
30
The SampleClean Architecture! 
Data 
Cleaning 
Library 
Issue Queries, ! 
Get Results! 
Approximate 
Asynchronous 
Query 
Processing 
Pipelines 
Clean 
Sample 
Declare Cleaning ! 
Operations! 
Dirty 
Sample 
31
Great, Now What?! 
• Prototype implementation complete!! 
• Significant research challenges remain:! 
• Crowd worker performance and quality! 
• Pipeline semantics and optimization! 
• Programming model and interface! 
! 
• Open source release targeted for next 
year! 
32
Summary! 
• Data Cleaning is slow, costly, and 
domain-specific! 
• SampleClean brings data cleaning into 
the BDAS stack ! 
• SampleClean uses asynchrony to hide 
latency, and sampling to hide scale! 
• SampleClean combines Algorithms, 
Machines, and People, all in one system! 33
Asynchrony in Spark! 
• The Spark abstraction: blocking BSP! 
• So how do we achieve asynchrony?! 
• Multithreaded master! 
• Intermediate results materialized in 
Hive! 
• Standalone Finagle HTTP server for 
crowd work! 
! 
34

More Related Content

Viewers also liked

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying SparkDatabricks
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Haoyuan Li
 
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...Spark Summit
 
Linux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreLinux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreMark Wong
 
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Spark Summit
 
The Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanThe Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanGalder Zamarreño
 
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftAdvanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftDaniel Krook
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSCeph Community
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
ELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot TimesELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot Timesandrewmurraympc
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsYannick Pouliot
 
Velox: Models in Action
Velox: Models in ActionVelox: Models in Action
Velox: Models in ActionDan Crankshaw
 
Naïveté vs. Experience
Naïveté vs. ExperienceNaïveté vs. Experience
Naïveté vs. ExperienceMike Fogus
 

Viewers also liked (18)

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
 
Open Stack Cheat Sheet V1
Open Stack Cheat Sheet V1Open Stack Cheat Sheet V1
Open Stack Cheat Sheet V1
 
Linux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreLinux Filesystems, RAID, and more
Linux Filesystems, RAID, and more
 
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
 
The Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanThe Hot Rod Protocol in Infinispan
The Hot Rod Protocol in Infinispan
 
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftAdvanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
ELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot TimesELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot Times
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
 
Velox: Models in Action
Velox: Models in ActionVelox: Models in Action
Velox: Models in Action
 
Naïveté vs. Experience
Naïveté vs. ExperienceNaïveté vs. Experience
Naïveté vs. Experience
 
OpenStack Cheat Sheet V2
OpenStack Cheat Sheet V2OpenStack Cheat Sheet V2
OpenStack Cheat Sheet V2
 

Similar to SampleClean: Bringing Data Cleaning into the BDAS Stack

Transparency1
Transparency1Transparency1
Transparency1A M
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Kris Jack
 
Empirical Evaluations in Software Engineering Research: A Personal Perspective
Empirical Evaluations in Software Engineering Research: A Personal PerspectiveEmpirical Evaluations in Software Engineering Research: A Personal Perspective
Empirical Evaluations in Software Engineering Research: A Personal PerspectiveSAIL_QU
 
Collaborative Digital Experiments
Collaborative Digital ExperimentsCollaborative Digital Experiments
Collaborative Digital ExperimentsJose Enrique Ruiz
 
Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...Lourdes Verdes-Montenegro
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesBertram Ludäscher
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceSusanna-Assunta Sansone
 
Why Electronic Data Capture?
Why Electronic Data Capture?Why Electronic Data Capture?
Why Electronic Data Capture?Somalee D.
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinityPeterMorrell4
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Spark Summit
 

Similar to SampleClean: Bringing Data Cleaning into the BDAS Stack (20)

Transparency1
Transparency1Transparency1
Transparency1
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Importance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistryImportance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistry
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 
Empirical Evaluations in Software Engineering Research: A Personal Perspective
Empirical Evaluations in Software Engineering Research: A Personal PerspectiveEmpirical Evaluations in Software Engineering Research: A Personal Perspective
Empirical Evaluations in Software Engineering Research: A Personal Perspective
 
Collaborative Digital Experiments
Collaborative Digital ExperimentsCollaborative Digital Experiments
Collaborative Digital Experiments
 
Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science Tales
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
 
Why Electronic Data Capture?
Why Electronic Data Capture?Why Electronic Data Capture?
Why Electronic Data Capture?
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
What IF? The University of Central Florida’s strategy for success. Colding
What IF? The University of Central Florida’s strategy for success. Colding What IF? The University of Central Florida’s strategy for success. Colding
What IF? The University of Central Florida’s strategy for success. Colding
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
 

More from jeykottalam

AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Introjeykottalam
 
Concurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine LearningConcurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine Learningjeykottalam
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascentjeykottalam
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Communityjeykottalam
 

More from jeykottalam (6)

AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Concurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine LearningConcurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine Learning
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 

Recently uploaded

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 

Recently uploaded (20)

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 

SampleClean: Bringing Data Cleaning into the BDAS Stack

  • 1. SampleClean: Bringing Data Cleaning into the BDAS Stack! Sanjay Krishnan and Daniel Haas! In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan Wang, Tim Kraska, Michael Franklin, Tova Milo, Ken Goldberg !
  • 3. Microsoft Academic Search! ! ! Paper Id! Affiliation! 16! Computer Science Division--University of California Berkeley CA! 101! University of California at Berkeley! 102! Department of Physics Stanford ! University California! 116! Lawrence Berkeley National Labs! <ref>California</ref>! 3
  • 4. Microsoft Academic Search! ! ! Paper Id! Affiliation! 16! Computer Science Division--University of California Berkeley CA! 101! University of California at Berkeley! 102! Department of Physics Stanford ! University California! 116! Lawrence Berkeley National Labs! <ref>California</ref>! X 4
  • 5. Microsoft Academic Search! ! ! University of California at Berkeley! Computer Science Division! University of California at Berkeley! Department of Physics Stanford ! University California! 5
  • 6. • Data cleaning in BDAS.! – Problem 1. Scale! – Problem 2. Latency! ! • Sampling to cope with scale.! • Asynchrony to cope with latency.! ! Enter SampleClean! 6
  • 7. Now it’s your turn!! Be the crowd and help us decide! ! ! 7
  • 8. Dirty Data is Ubiquitous! 8! Example: Missing, incomplete, inconsistent data!
  • 9. Data Cleaning is Hard! 9 Time consuming!
  • 10. Data Cleaning is Hard! 10 Time consuming! Costly!
  • 11. Data Cleaning is Hard! 11 Time consuming! Costly! Domain-specific!
  • 12. Data Cleaning is Hard! 12 Time consuming! Costly! Domain-specific!
  • 13. A New Data Cleaning Architecture! Analy0cs 13 Data Data Cleaning
  • 14. A New Data Cleaning Architecture! Analy0cs 14 Data Cleaning Data
  • 15. Can it Scale?! People are slow and expensive! Crowd Machine Learning Regex Time 15
  • 16. Insight 1: Asynchrony Hides Latency! 16
  • 17. Insight 2: Sampling Hides Scale! Query ! Error! BlinkDB! Time! 17
  • 18. Insight 2: Sampling Hides Scale! Query ! Error! Time! Data Error BlinkDB! 18
  • 19. Insight 2: Sampling Hides Scale! Query ! Error! Time! Data Error SampleClean! BlinkDB! 19
  • 20. SampleClean Data Flow! Dirty Data Dirty Sample Query Clean Sample Data Cleaning 20 Sampling Asynchrony
  • 21. SampleClean Data Flow! Query Clean Sample Data Cleaning Asynchrony 21
  • 22. The SampleClean Architecture! Data Cleaning Library Issue Queries, ! Get Results! Approximate Asynchronous Query Processing Pipelines Clean Sample Declare Cleaning ! Operations! Dirty Sample 22
  • 23. The SampleClean Architecture! Data Cleaning Library Issue Queries, ! Get Results! Approximate Asynchronous Query Processing Pipelines Clean Sample Declare Cleaning ! Operations! Dirty Sample 23
  • 24. Approximate Query Processing! • Estimate early results and bound with error bars! Query ! Error! Time! SampleClean: Fast and Accurate Query Processing on Dirty Data. SIGMOD 2014! ! BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. EuroSys 2013! 24
  • 25. The SampleClean Architecture! 25 Issue Queries, ! Get Results! Approximate Asynchronous Query Processing Pipelines Clean Sample Declare Cleaning ! Operations! Dirty Sample Data Cleaning Library
  • 26. Crowds and Machines Work Together! • Extensible library of data cleaning tools! • Tools are:! – Automated! – Human-powered! – Hybrid! ! Crowd Machine Learning Regex Time 26
  • 27. Active Learning and Crowds! • Choose informative training points! Not ! Informative! Are these the same?! Stanford Department of IEOR! ! UC Berkeley Stats! ! ¢ Yes ! ¤ No! Informative! Are these the same?! Department of Mathematics Stanford University! ! University of California Berkeley Department of Mathematics! ! ¢ Yes ! ¤ No! 27
  • 28. Active Learning and Crowds! • Choose informative training points! Not ! Informative! Are these the same?! Stanford Department of IEOR! ! UC Berkeley Stats! ! ¢ Yes ! ¤ No! Informative! Are these the same?! Department of Mathematics Stanford University! ! University of California Berkeley Department of Mathematics! ! ¢ Yes ! ¤ No! 28
  • 29. The SampleClean Architecture! 29 Data Cleaning Library Issue Queries, ! Get Results! Clean Sample Declare Cleaning ! Operations! Dirty Sample Approximate Asynchronous Query Processing Pipelines
  • 30. Putting it all together: Asynchronous Pipelines! • Users group data cleaning operations into pipelines! 30
  • 31. The SampleClean Architecture! Data Cleaning Library Issue Queries, ! Get Results! Approximate Asynchronous Query Processing Pipelines Clean Sample Declare Cleaning ! Operations! Dirty Sample 31
  • 32. Great, Now What?! • Prototype implementation complete!! • Significant research challenges remain:! • Crowd worker performance and quality! • Pipeline semantics and optimization! • Programming model and interface! ! • Open source release targeted for next year! 32
  • 33. Summary! • Data Cleaning is slow, costly, and domain-specific! • SampleClean brings data cleaning into the BDAS stack ! • SampleClean uses asynchrony to hide latency, and sampling to hide scale! • SampleClean combines Algorithms, Machines, and People, all in one system! 33
  • 34. Asynchrony in Spark! • The Spark abstraction: blocking BSP! • So how do we achieve asynchrony?! • Multithreaded master! • Intermediate results materialized in Hive! • Standalone Finagle HTTP server for crowd work! ! 34

Editor's Notes

  1. Start with Berkeley vs. Stanford, not the dataset
  2. Talk more about the dataset/problem/query before jumping into the sources of error
  3. Do *not* say ‘algorithms only go so far’!
  4. …and can’t be ignored! Analytics on dirty data can lead to incorrect decision-making.
  5. Asynchrony: We allow data cleaning to proceed in the background while analysts make use of the already cleaned data. Approximation: approximate results are often sufficient, especially for early data analysis tasks such as exploratory data analysis. We leverage sampling / machine learning to provide approximations quickly, then improve our answers as more of the data is cleaned
  6. Asynchrony: We allow data cleaning to proceed in the background while analysts make use of the already cleaned data. You saw this in the demo just now—the dashboard issued queries in realtime as the data updated.
  7. Approximation: approximate results are often sufficient, especially for early data analysis tasks such as exploratory data analysis. We leverage sampling / machine learning to provide approximations quickly, then improve our answers as more of the data is cleaned.
  8. Asynchrony: We allow data cleaning to proceed in the background while analysts make use of the already cleaned data. Approximation: approximate results are often sufficient, especially for early data analysis tasks such as exploratory data analysis. We leverage sampling / machine learning to provide approximations quickly, then improve our answers as more of the data is cleaned.
  9. Asynchrony: We allow data cleaning to proceed in the background while analysts make use of the already cleaned data. Approximation: approximate results are often sufficient, especially for early data analysis tasks such as exploratory data analysis. We leverage sampling / machine learning to provide approximations quickly, then improve our answers as more of the data is cleaned.
  10. Imagine such a scenario, where you have a large and dirty dataset and cleaning the entire data may spend you a lot of time and money. When using our system, you don’t have to clean the entire data. You can only clean a small sample of the data, then our system will use the results of the cleaning process to understand data error and return a better query result for you. Even better, our system can also bound the query results and tell you that if you clean the entire data, in which ranges your query results will be. If you want to know more details about this sampling feature, you can refer to our latest SIGMOD paper. We follow the BlinkDB path and only support aggregate queries. We can extend this approach to support more complex queries using non-parametric bootstrap and diagnostics. In addition, we extend the BlinkDB approach to handle data error in addition to query error
  11. So in order to require as little work from humans as possible, we use humans to train models that can extrapolate human work to the rest of our data. In particular, we use a technique called active learning, where we have humans clean the most informative bits of data so we can train a better model faster.
  12. Point out that we have an extensible general purpose active learning library built on MLLib that can talk to multiple crowds
  13. Talk about executing on a sample samples Talk about arguments to pipeline