SlideShare ist ein Scribd-Unternehmen logo
1 von 76
Querying Petabytes of Data in
Seconds using Sampling
UC
Berkeley
Sameer Agarwal
Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet
Talwalkar,
Michael Jordan, Samuel Madden, Ion Stoica
M I T 1
Can we do better than in-
memory?
Can we get more with less?
Can fast get faster?
Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
10 TB on 100 machines
Query Execution on Samples
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering
ratio in the table?
0.2325
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering
ratio in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19
0.2325
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering
ratio in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19 +/- 0.05
0.2325
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering
ratio in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/2
3 Berkeley 0.25 1/2
5 NYC 0.19 1/2
6 Berkeley 0.09 1/2
8 NYC 0.18 1/2
12 Berkeley 0.49 1/2
Uniform
Sample
$0.22 +/- 0.02
0.2325
0.19 +/- 0.05
Speed/Accuracy Trade-offError
30 mins
Time to
Execute on
Entire
Dataset
Interactive
Queries
2 sec
Execution Time (Sample Size)
Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
10x as response
time
is dominated by I/O
Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
(0.02%)
(0.07%) (1.1%) (3.4%) (11%)
Error Bars
Sampling Error
Typically, error depends on sample size
(n) and not on original data size, i.e.,
error is proportional to (1/sqrt(n))*
Sampling Error
Typically, error depends on sample size
(n) and not on original data size, i.e.,
error is proportional to (1/sqrt(n))*
* Conditions Apply
Sampling Error
Typically, error depends on sample size
(n) and not on original data size, i.e.,
error is proportional to (1/sqrt(n))*
* Conditions Apply
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
WITHIN 1 SECONDS 234.23 ± 15.32
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
WITHIN 2 SECONDS 234.23 ± 15.32
239.46 ± 4.96
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
WITHIN 1 SECONDS
AVG, COUNT, SUM,
STDEV, PERCENTILE
etc.
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
WITHIN 1 SECONDS FILTERS, GROUP BY
clauses
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
WITHIN 1 SECONDS
JOINS,
Nested
Queries etc.
Speed/Accuracy Trade-off
SELECT my_function(sessionTime)
FROM Table
WHERE city=‘San Francisco’
LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
WITHIN 1 SECONDS
ML Primitives,
User Defined
Functions
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
ERROR 0.1 CONFIDENCE 95.0%
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
Uniform Samples
2
4
1
3
Uniform Samples
2
4
1
3
U
Uniform Samples
2
4
1
3
U
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Uniform Samples
2
4
1
3
U
1. FILTER rand() < 1/3
2. Adds per-row weights
3. In-memory Shuffle
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Uniform Samples
2
4
1
3
U
ID City Data Weight
2 NYC 0.13 1/3
8 NYC 0.25 1/3
6 Berkeley 0.09 1/3
11 NYC 0.19 1/3
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Doesn’t change
Spark RDD
Semantics
Stratified Samples
2
4
1
3
Stratified Samples
2
4
1
3
S
Stratified Samples
2
4
1
3
S
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Stratified Samples
2
4
1
3
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
SPLIT
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count
NYC 7
Berkeley 5
GROUP
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count Ratio
NYC 7 2/7
Berkeley 5 2/5
GROUP
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count Ratio
NYC 7 2/7
Berkeley 5 2/5
S2 JOIN
Stratified Samples
2
4
1
3
S1
S2
S2
U
ID City Data Weight
2 NYC 0.13 2/7
8 NYC 0.25 2/7
6 Berkeley 0.09 2/5
12 Berkeley 0.49 2/5
Doesn’t change
Shark RDD
Semantics
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
A
1
2
Sampl
e
AVG
SUM
COUNT
STDEV
VARIANCE
A
1
2
Sampl
e
A
±ε
A
Error Estimation
Generalized Aggregate Functions
- Statistical Bootstrap
- Applicable to complex and nested queries,
UDFs, joins etc.
Error Estimation
Generalized Aggregate Functions
- Statistical Bootstrap
- Applicable to complex and nested
queries, UDFs, joins etc.
Sampl
e
A
Sampl
e
AA1A2A100
…
…
B
±ε
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of random
and stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
Error VerificationError
Sample Size
More Data  Higher
Accuracy300 Data Points  97%
Accuracy [KDD’13] [SIGMOD’14]
Single Pass Execution
Sampl
e
A
Approximate Query on a Sample
46
Single Pass Execution
Sampl
e
A
State Age Metric Weight
CA 20 1971 1/4
CA 22 2819 1/4
MA 22 3819 1/4
MA 30 3091 1/4
47
Single Pass Execution
Sampl
e
R
A
State Age Metric Weight
CA 20 1971 1/4
CA 22 2819 1/4
MA 22 3819 1/4
MA 30 3091 1/4
Resampling Operator
48
Single Pass Execution
Sampl
e
A
R
Sample “Pushdown”
State Age Metric Weight
CA 20 1971 1/4
CA 22 2819 1/4
MA 22 3819 1/4
MA 30 3091 1/4
49
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Sample “Pushdown”
50
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Resampling Operator
51
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Metric Weight
1971 1/4
1971 1/4
Metric Weight
3819 1/4
3819 1/4
Resampling Operator
52
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Metric Weight
1971 1/4
1971 1/4
Metric Weight
3819 1/4
3819 1/4
A A1 An
…
53
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Metric Weight
1971 1/4
1971 1/4
Metric Weight
3819 1/4
3819 1/4
A A1 An
…
54
Sampl
e
A
Metric Weight
1971 1/4
3819 1/4
Leverage Poissonized
Resampling to generate
samples with
replacement
Single Pass Execution
55
Sampl
e
A
Metric Weight S1
1971 1/4 2
3819 1/4 1
Sample from a
Poisson (1) Distribution
Single Pass Execution
A1
56
Sampl
e
A
Metric Weight S1 Sk
1971 1/4 2 1
3819 1/4 1 0
Construct all Resamples
in Single Pass
Single Pass Execution
A1 Ak
57
S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
SAMPLE BOOTSTRAP
WEIGHTS
DIAGNOSTICS
WEIGHTS
Single Pass Execution
Sampl
e
A
58
S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
SAMPLE BOOTSTRAP
WEIGHTS
DIAGNOSTICS
WEIGHTS
Single Pass Execution
Sampl
e
A
Additional Overhead: 200 bytes/row
59
S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
Single Pass Execution
Sampl
e
A
Embarrassingly Parallel
SAMPLE BOOTSTRAP
WEIGHTS
DIAGNOSTICS
WEIGHTS 60
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of random
and stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
[Offline Process]
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
[Online Process]
TABLE
Sampling
Module
Original
Data
Offline-sampling:
Creates an optimal
set of samples on
native tables and
materialized views
based on query
history and
BlinkDB Architecture
64
TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
Sample
Placement:
Samples striped
over 100s or 1,000s
of machines both
on disks and in-
memory.
BlinkDB Architecture
65
SELECT
foo (*)
FROM
TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
BlinkDB Architecture
66
SELECT
foo (*)
FROM
TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
Online sample
selection to pick
best sample(s)
based on query
latency and
accuracy
requirements
BlinkDB Architecture
67
TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
Hive/Shark/Prest
o
SELECT
foo (*)
FROM
TABLE
WITHIN 2
New Query Plan
HiveQL/SQL
Query
Sample Selection
Error Bars &
Confidence
Intervals
Result
182.23 ± 5.56
(95% confidence)
Parallel query
execution on
multiple samples
striped across
BlinkDB Architecture
68
BlinkDB is Fast!
- 5 Queries, 5 machines
- 20 GB samples (0.001%-1% of original data)
- 1-5% Error
ResponseTime(s)
Query Execution
Overall Query Execution
Overall Query ExecutionResponseTime(s)
Error Estimation
Overhead
Overall Query ExecutionResponseTime(s)
Error Verification
Overhead
Coming Soon: Native Spark
Integration
BlinkDB Prototype
1. Alpha 0.2.0 released and available at http://blinkdb.org
2. Allows you to create samples on native tables and
materialized views
3. Adds approximate aggregate functions with statistical
closed forms to HiveQL
4. Compatible with Apache Hive, Spark and Facebook’s
Presto (storage, serdes, UDFs, types, metadata)
An Open Question
We still haven’t figured out the right user-
interface for approximate queries:
- Time/Error Bounds?
- Continuous Error Bars?
- Hide Errors Altogether?
- UI/UX Specific?
- Application Specific?
- … 75
http://blinkdb.org
Native Spark Integration Coming
Soon!

Weitere ähnliche Inhalte

Ähnlich wie BlinkDB: Qureying Petabytes of Data in Seconds using Sampling

IllinoisScan_seminar.ppt
IllinoisScan_seminar.pptIllinoisScan_seminar.ppt
IllinoisScan_seminar.pptcoolbusinessman
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraAlexander Korotkov
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraSveta Smirnova
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for ScyllaScyllaDB
 
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Sease
 
Iaetsd design and implementation of multiple sic vectors
Iaetsd design and implementation of multiple sic vectorsIaetsd design and implementation of multiple sic vectors
Iaetsd design and implementation of multiple sic vectorsIaetsd Iaetsd
 
Spanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed DatabaseSpanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed DatabaseAhmedmchayaa
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfcookie1969
 
Modeling computer networks by colored Petri nets
Modeling computer networks by colored Petri netsModeling computer networks by colored Petri nets
Modeling computer networks by colored Petri netsDmitryZaitsev5
 
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...Arkansas State University
 
Multivariate dimensionality reduction in cross-correlation analysis
Multivariate dimensionality reduction in cross-correlation analysis Multivariate dimensionality reduction in cross-correlation analysis
Multivariate dimensionality reduction in cross-correlation analysis ivanokitov
 
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...Zhen Ming (Jack) Jiang
 
Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...
Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...
Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...iosrjce
 
Using Neural Networks to predict user ratings
Using Neural Networks to predict user ratingsUsing Neural Networks to predict user ratings
Using Neural Networks to predict user ratingsrecsysfr
 

Ähnlich wie BlinkDB: Qureying Petabytes of Data in Seconds using Sampling (20)

IllinoisScan_seminar.ppt
IllinoisScan_seminar.pptIllinoisScan_seminar.ppt
IllinoisScan_seminar.ppt
 
Clustering
ClusteringClustering
Clustering
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second era
 
MaPU-HPCA2016
MaPU-HPCA2016MaPU-HPCA2016
MaPU-HPCA2016
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second era
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for Scylla
 
Final_Presentation_Docker_KP
Final_Presentation_Docker_KPFinal_Presentation_Docker_KP
Final_Presentation_Docker_KP
 
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
 
Iaetsd design and implementation of multiple sic vectors
Iaetsd design and implementation of multiple sic vectorsIaetsd design and implementation of multiple sic vectors
Iaetsd design and implementation of multiple sic vectors
 
Spanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed DatabaseSpanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed Database
 
Binary Analysis - Luxembourg
Binary Analysis - LuxembourgBinary Analysis - Luxembourg
Binary Analysis - Luxembourg
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
Modeling computer networks by colored Petri nets
Modeling computer networks by colored Petri netsModeling computer networks by colored Petri nets
Modeling computer networks by colored Petri nets
 
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
 
Data Access Patterns
Data Access PatternsData Access Patterns
Data Access Patterns
 
Multivariate dimensionality reduction in cross-correlation analysis
Multivariate dimensionality reduction in cross-correlation analysis Multivariate dimensionality reduction in cross-correlation analysis
Multivariate dimensionality reduction in cross-correlation analysis
 
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
 
Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...
Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...
Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...
 
H010613642
H010613642H010613642
H010613642
 
Using Neural Networks to predict user ratings
Using Neural Networks to predict user ratingsUsing Neural Networks to predict user ratings
Using Neural Networks to predict user ratings
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

BlinkDB: Qureying Petabytes of Data in Seconds using Sampling

  • 1. Querying Petabytes of Data in Seconds using Sampling UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica M I T 1
  • 2. Can we do better than in- memory?
  • 3. Can we get more with less?
  • 4. Can fast get faster?
  • 5. Hard Disks ½ - 1 Hour 1 - 5 Minutes 1 second ? Memory 10 TB on 100 machines Query Execution on Samples
  • 6. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? 0.2325
  • 7. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 0.2325
  • 8. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 +/- 0.05 0.2325
  • 9. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.19 1/2 6 Berkeley 0.09 1/2 8 NYC 0.18 1/2 12 Berkeley 0.49 1/2 Uniform Sample $0.22 +/- 0.02 0.2325 0.19 +/- 0.05
  • 10. Speed/Accuracy Trade-offError 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time (Sample Size)
  • 11. Sampling Vs. No Sampling 0 200 400 600 800 1000 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data QueryResponseTime(Seconds) 103 1020 18 13 10 8 10x as response time is dominated by I/O
  • 12. Sampling Vs. No Sampling 0 200 400 600 800 1000 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data QueryResponseTime(Seconds) 103 1020 18 13 10 8 (0.02%) (0.07%) (1.1%) (3.4%) (11%) Error Bars
  • 13. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))*
  • 14. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))* * Conditions Apply
  • 15. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))* * Conditions Apply
  • 16. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS 234.23 ± 15.32
  • 17. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 2 SECONDS 234.23 ± 15.32 239.46 ± 4.96
  • 18. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS AVG, COUNT, SUM, STDEV, PERCENTILE etc.
  • 19. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS FILTERS, GROUP BY clauses
  • 20. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id WITHIN 1 SECONDS JOINS, Nested Queries etc.
  • 21. Speed/Accuracy Trade-off SELECT my_function(sessionTime) FROM Table WHERE city=‘San Francisco’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id WITHIN 1 SECONDS ML Primitives, User Defined Functions
  • 22. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ ERROR 0.1 CONFIDENCE 95.0%
  • 23. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 24. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 27. Uniform Samples 2 4 1 3 U ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
  • 28. Uniform Samples 2 4 1 3 U 1. FILTER rand() < 1/3 2. Adds per-row weights 3. In-memory Shuffle ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
  • 29. Uniform Samples 2 4 1 3 U ID City Data Weight 2 NYC 0.13 1/3 8 NYC 0.25 1/3 6 Berkeley 0.09 1/3 11 NYC 0.19 1/3 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Doesn’t change Spark RDD Semantics
  • 32. Stratified Samples 2 4 1 3 S ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
  • 33. Stratified Samples 2 4 1 3 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 SPLIT
  • 34. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count NYC 7 Berkeley 5 GROUP
  • 35. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count Ratio NYC 7 2/7 Berkeley 5 2/5 GROUP
  • 36. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count Ratio NYC 7 2/7 Berkeley 5 2/5 S2 JOIN
  • 37. Stratified Samples 2 4 1 3 S1 S2 S2 U ID City Data Weight 2 NYC 0.13 2/7 8 NYC 0.25 2/7 6 Berkeley 0.09 2/5 12 Berkeley 0.49 2/5 Doesn’t change Shark RDD Semantics
  • 38. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 39. Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV
  • 40. Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV
  • 41. Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV A 1 2 Sampl e AVG SUM COUNT STDEV VARIANCE A 1 2 Sampl e A ±ε A
  • 42. Error Estimation Generalized Aggregate Functions - Statistical Bootstrap - Applicable to complex and nested queries, UDFs, joins etc.
  • 43. Error Estimation Generalized Aggregate Functions - Statistical Bootstrap - Applicable to complex and nested queries, UDFs, joins etc. Sampl e A Sampl e AA1A2A100 … … B ±ε
  • 44. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 45. Error VerificationError Sample Size More Data  Higher Accuracy300 Data Points  97% Accuracy [KDD’13] [SIGMOD’14]
  • 47. Single Pass Execution Sampl e A State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 47
  • 48. Single Pass Execution Sampl e R A State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 Resampling Operator 48
  • 49. Single Pass Execution Sampl e A R Sample “Pushdown” State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 49
  • 50. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Sample “Pushdown” 50
  • 51. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Resampling Operator 51
  • 52. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 Resampling Operator 52
  • 53. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 A A1 An … 53
  • 54. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 A A1 An … 54
  • 55. Sampl e A Metric Weight 1971 1/4 3819 1/4 Leverage Poissonized Resampling to generate samples with replacement Single Pass Execution 55
  • 56. Sampl e A Metric Weight S1 1971 1/4 2 3819 1/4 1 Sample from a Poisson (1) Distribution Single Pass Execution A1 56
  • 57. Sampl e A Metric Weight S1 Sk 1971 1/4 2 1 3819 1/4 1 0 Construct all Resamples in Single Pass Single Pass Execution A1 Ak 57
  • 58. S S1 S100 … Da1 Da100 … Db1 Db100 … Dc1 Dc100 … SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS Single Pass Execution Sampl e A 58
  • 59. S S1 S100 … Da1 Da100 … Db1 Db100 … Dc1 Dc100 … SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS Single Pass Execution Sampl e A Additional Overhead: 200 bytes/row 59
  • 60. S S1 S100 … Da1 Da100 … Db1 Db100 … Dc1 Dc100 … Single Pass Execution Sampl e A Embarrassingly Parallel SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS 60
  • 61. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 62. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime [Offline Process]
  • 63. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime [Online Process]
  • 64. TABLE Sampling Module Original Data Offline-sampling: Creates an optimal set of samples on native tables and materialized views based on query history and BlinkDB Architecture 64
  • 65. TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in- memory. BlinkDB Architecture 65
  • 66. SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data BlinkDB Architecture 66
  • 67. SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Online sample selection to pick best sample(s) based on query latency and accuracy requirements BlinkDB Architecture 67
  • 68. TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Hive/Shark/Prest o SELECT foo (*) FROM TABLE WITHIN 2 New Query Plan HiveQL/SQL Query Sample Selection Error Bars & Confidence Intervals Result 182.23 ± 5.56 (95% confidence) Parallel query execution on multiple samples striped across BlinkDB Architecture 68
  • 69. BlinkDB is Fast! - 5 Queries, 5 machines - 20 GB samples (0.001%-1% of original data) - 1-5% Error
  • 73. Coming Soon: Native Spark Integration
  • 74. BlinkDB Prototype 1. Alpha 0.2.0 released and available at http://blinkdb.org 2. Allows you to create samples on native tables and materialized views 3. Adds approximate aggregate functions with statistical closed forms to HiveQL 4. Compatible with Apache Hive, Spark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)
  • 75. An Open Question We still haven’t figured out the right user- interface for approximate queries: - Time/Error Bounds? - Continuous Error Bars? - Hide Errors Altogether? - UI/UX Specific? - Application Specific? - … 75

Hinweis der Redaktion

  1. Now with shark and spark really pushing the limits of in-memory computations, one of the natural questions that comes to mind is– “Can we do better than in-memory?” And being better than in memory, could mean one or both of these 2 things--
  2. And if we focus on this error, the error has an amazing statistical property. Error Decreases with Moore’s law: Halves every 36 months!
  3. And if we focus on this error, the error has an amazing statistical property. Error Decreases with Moore’s law: Halves every 36 months!
  4. And if we focus on this error, the error has an amazing statistical property. Error Decreases with Moore’s law: Halves every 36 months!
  5. Original data 2TB-2PB