SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Data profiling
with Apache Calcite
DataWorks Summit 2017
SAN JOSE, USA
2017/06/14
Julian Hyde
@julianhyde
SQL
Query planning
Query federation
OLAP
Streaming
Hadoop
ASF member
Original author of Apache Calcite
PMC Apache Arrow, Drill, Eagle, Kylin
Overview
Apache Calcite
Motivating problem: Automatically designing summary tables
What is data profiling?
Naive profiling algorithm
Improving the algorithm using sketches, parallelism, information theory
Applying data profiling to other problems
Apache Calcite
Apache top-level project since October, 2015
Query planning framework
➢ Relational algebra, rewrite rules
➢ Cost model & statistics
➢ Federation via adapters
➢ Extensible
Packaging
➢ Library
➢ Optional SQL parser, JDBC server
➢ Community-authored rules, adapters
Planning queries
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Table: splunk
Optimized query
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: splunk
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Optimizing queries
Problem
10 TB database, disk with 1 GB/s throughput, and a query that reads 1 TB data.
Solutions
1. Sequential scan Query takes 1,000s.
2. Parallelize Spread the data over 100 disks in 25 machines. Query takes 10s.
3. Cache Keep the data in memory. 2nd query: 10ms. 3rd query: 10s.
4. Materialize Summarize the data on disk. All queries: 100ms.
5. Materialize + cache + adapt As above, building summaries on demand.
Optimizing data
A materialized view (“materialization”) is a table that contains the result of a
query. The DBMS maintains it, and uses it to answer queries on other tables.
Challenges:
● Design Which materializations to create?
● Populate Load them with data
● Maintain Incrementally populate when data changes
● Rewrite Transparently rewrite queries to use materializations
● Adapt Design and populate new materializations, drop unused ones
● Express Need a rich algebra, to model how data is derived
create materialized view EmpSummary as
select deptno, COUNT(*) as c, SUM(sal) as s
from Emp
group by deptno
Lattice
Key
z zipcode (43k)
s state (50)
g gender (2)
y year (5)
m month (12)
() 1
(z, s, g, y,
m) 912k
(s, g, y,
m) 6k
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12
raw 1m
(y, m)
60
(g, y) 10
(z, s)
43.4k
(g, y, m)
120
Fewer than you would
expect, because 5m
combinations cannot
occur in 1m row table
Fewer than you
would expect,
because state
depends on zipcode
Algorithm: Design summary tables
Given a database with 30 columns, 10M rows. Find X summary tables with under
Y rows that improve query response time the most.
AdaptiveMonteCarlo algorithm [1]:
● Based on research [2]
● Greedy algorithm that takes a combination of summary tables and tries to
find the table that yields the greatest cost/benefit improvement
● Models “benefit” of the table as query time saved over simulated query load
● The “cost” of a table is its size
[1] org.pentaho.aggdes.algorithm.impl.AdaptiveMonteCarloAlgorithm
[2] Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes efficiently”
Lattice (optimized) () 1
(z, s, g, y,
m) 912k
(s, g, y,
m) 6k
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12
(z, g, y,
m) 909k
(z, s, y,
m) 831k
raw 1m
(z, s, g,
m) 644k
(z, s, g,
y) 392k
(y, m)
60
(z, s)
43.4k
(z, s, g)
83.6k
(g, y) 10
(g, y, m)
120
(g, m)
24
Key
z zipcode (43k)
s state (50)
g gender (2)
y year (5)
m month (12)
Lattice (optimized) () 1
(z, s, g, y,
m) 912k
(s, g, y,
m) 6k
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12
(z, g, y,
m) 909k
(z, s, y,
m) 831k
raw 1m
(z, s, g,
m) 644k
(z, s, g,
y) 392k
(y, m)
60
(z, s)
43.4k
(z, s, g)
83.6k
(g, y) 10
(g, y, m)
120
(g, m)
24
Key
z zipcode (43k)
s state (50)
g gender (2)
y year (5)
m month (12)
Aggregate Cost
(rows)
Benefit (query
rows saved)
% queries
s, g, y, m 6k 497k 50%
z, s, g 87k 304k 33%
g, y 10 1.5k 25%
g, m 24 1.5k 25%
s, g 100 1.5k 25%
y, m 60 1.5k 25%
Data profiling
Algorithm needs count(distinct a, b, ...) for each combination of attributes:
● Previous example had 25
= 32 possible tables
● Schema with 30 attributes has 230
(about 109
) possible tables
● Algorithm considers a significant fraction of these
● Approximations are OK
Attempts to solve the profiling problem:
1. Compute each combination: scan, sort, unique, count; repeat 230
times!
2. Sketches (HyperLogLog)
3. Sketches + parallelism + information theory (CALCITE-1616)
Sketches
HyperLogLog is an algorithm that computes
approximate distinct count. It can estimate
cardinalities of 109
with a typical error rate of
2%, using 1.5 kB of memory. [3][4]
With 16 MB memory per machine we can
compute 10,000 combinations of attributes
each pass.
So, we’re down from 109
to 105
passes.
[3] Flajolet, Fusy, Gandouet, Meunier (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm"
[4] https://github.com/mrjgreen/HyperLogLog
Given Expected cardinality Actual cardinality Surprise
(gender): 2 (state): 50 (gender, state): 100.0 100 0.000
(month): 12 (zipcode): 43,000 (month, zipcode): 441,699.3 442,700 0.001
(state): 50 (zipcode): 43,000 (state, zipcode): 799,666.7 43,400 0.897
(state, zipcode): 43,400
(gender, state): 100
(gender, zipcode): 85,995
(gender, state, zipcode): 86,799
= min(86,799, 892,234, 892,228)
83,567 0.019
● Surprise = abs(actual - expected) / (actual + expected)
● E(card (x, y)) = n . (1 - ((n - 1) / n) ^ p) n = card (x) * card (y), p = row count
Combining probability & information theory
Algorithm
Three ways “surprise” can help:
● If a cardinality is not
surprising, we don’t need to
store it -- we can derive it
● If a combination’s cardinality
is not surprising, it is unlikely
to have surprising children
● If we’re not seeing surprising
results, it’s time to stop
surprise_threshold := 1
queue := {singleton combinations} // (a), (b), ...
while queue is not empty {
batch := remove first 10,000 entries in queue
compute cardinality of each combination in batch
for each actual (computed) cardinality a {
e := expected cardinality of combination
s := surprise(a, e)
if s > surprise_threshold {
store combination and its cardinality
add child combinations to queue // (x, a), (x, b), ...
}
increase surprise_threshold
}
}
Algorithm progress and “surprise” threshold
Progress of algorithm
Rejected as not
sufficiently
surprising
Surprise
threshold rises
as algorithm
progresses
Singleton
combinations
are have surprise
= 1
Surprise thresold
rises after we
hve completed
the first batch
Hierarchies considered
harmful
Hierarchies are a feature of most OLAP systems
Does it makes sense to store (year, quarter,
month, date) and roll up to (year, quarter)?
No -- algorithm can deduce hierarchies; less
configuration means fewer mistakes
Summary optimizer naturally includes attributes
that don’t increase summary cardinality by much
Feel free to specify a “drill path” in slice & dice UI
True hierarchy
(year)
↑
(year, quarter)
↑
(year, quarter, month)
↑
(year, quarter, month, date)
Almost a hierarchy
(nation)
↑
(nation, state)
↑
(nation, state, zipcode)
Other applications of data profiling
Query optimization:
● Planners are poor at estimating selectivity of conditions after N-way join
(especially on real data)
● New join-order benchmark: “Movies made by French directors tend to have
French actors”
● Predict number of reducers in MapReduce & Spark
“Grokking” a data set
Identifying problems in normalization, partitioning, quality
Applications in machine learning?
Further improvements
● Build sketches in parallel
● Run algorithm in a distributed framework (Spark or MapReduce)
● Compute histograms
○ For example, Median age for male/female customers
● Seek out functional dependencies
○ Once you know FDs, a lot of cardinalities are no longer “surprising”
○ FDs occur in denormalized tables, e.g. star schemas
● Smarter criteria for stopping algorithm
● Skew/heavy hitters. Are some values much more frequent than others?
● Conditional cardinalities and functional dependencies
○ Does one partition of the data behave differently from others? (e.g. year=2005, state=LA)
Thank you!
https://issues.apache.org/jira/browse/CALCITE-1788
https://calcite.apache.org
@ApacheCalcite
@julianhyde

Weitere ähnliche Inhalte

Was ist angesagt?

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

Was ist angesagt? (20)

Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Apache hive
Apache hiveApache hive
Apache hive
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
Managing your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache AmbariManaging your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache Ambari
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 

Ähnlich wie Data profiling with Apache Calcite

Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
Databricks
 

Ähnlich wie Data profiling with Apache Calcite (20)

Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Exploratory Analysis Part1 Coursera DataScience Specialisation
Exploratory Analysis Part1 Coursera DataScience SpecialisationExploratory Analysis Part1 Coursera DataScience Specialisation
Exploratory Analysis Part1 Coursera DataScience Specialisation
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
 
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceRob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
 
SQL Windowing
SQL WindowingSQL Windowing
SQL Windowing
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQL
 
Spark_Documentation_Template1
Spark_Documentation_Template1Spark_Documentation_Template1
Spark_Documentation_Template1
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기 - 윤석찬 (AWS 테크에반젤리스트)
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기  - 윤석찬 (AWS 테크에반젤리스트)Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기  - 윤석찬 (AWS 테크에반젤리스트)
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기 - 윤석찬 (AWS 테크에반젤리스트)
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
Spatially resolved pair correlation functions for point cloud data
Spatially resolved pair correlation functions for point cloud dataSpatially resolved pair correlation functions for point cloud data
Spatially resolved pair correlation functions for point cloud data
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 

Mehr von Julian Hyde

Mehr von Julian Hyde (20)

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 

Kürzlich hochgeladen

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 

Kürzlich hochgeladen (20)

WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 

Data profiling with Apache Calcite

  • 1. Data profiling with Apache Calcite DataWorks Summit 2017 SAN JOSE, USA 2017/06/14 Julian Hyde
  • 2. @julianhyde SQL Query planning Query federation OLAP Streaming Hadoop ASF member Original author of Apache Calcite PMC Apache Arrow, Drill, Eagle, Kylin
  • 3. Overview Apache Calcite Motivating problem: Automatically designing summary tables What is data profiling? Naive profiling algorithm Improving the algorithm using sketches, parallelism, information theory Applying data profiling to other problems
  • 4. Apache Calcite Apache top-level project since October, 2015 Query planning framework ➢ Relational algebra, rewrite rules ➢ Cost model & statistics ➢ Federation via adapters ➢ Extensible Packaging ➢ Library ➢ Optional SQL parser, JDBC server ➢ Community-authored rules, adapters
  • 5. Planning queries MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc Table: splunk
  • 6. Optimized query MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: splunk Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc
  • 7. Optimizing queries Problem 10 TB database, disk with 1 GB/s throughput, and a query that reads 1 TB data. Solutions 1. Sequential scan Query takes 1,000s. 2. Parallelize Spread the data over 100 disks in 25 machines. Query takes 10s. 3. Cache Keep the data in memory. 2nd query: 10ms. 3rd query: 10s. 4. Materialize Summarize the data on disk. All queries: 100ms. 5. Materialize + cache + adapt As above, building summaries on demand.
  • 8. Optimizing data A materialized view (“materialization”) is a table that contains the result of a query. The DBMS maintains it, and uses it to answer queries on other tables. Challenges: ● Design Which materializations to create? ● Populate Load them with data ● Maintain Incrementally populate when data changes ● Rewrite Transparently rewrite queries to use materializations ● Adapt Design and populate new materializations, drop unused ones ● Express Need a rich algebra, to model how data is derived create materialized view EmpSummary as select deptno, COUNT(*) as c, SUM(sal) as s from Emp group by deptno
  • 9. Lattice Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 raw 1m (y, m) 60 (g, y) 10 (z, s) 43.4k (g, y, m) 120 Fewer than you would expect, because 5m combinations cannot occur in 1m row table Fewer than you would expect, because state depends on zipcode
  • 10. Algorithm: Design summary tables Given a database with 30 columns, 10M rows. Find X summary tables with under Y rows that improve query response time the most. AdaptiveMonteCarlo algorithm [1]: ● Based on research [2] ● Greedy algorithm that takes a combination of summary tables and tries to find the table that yields the greatest cost/benefit improvement ● Models “benefit” of the table as query time saved over simulated query load ● The “cost” of a table is its size [1] org.pentaho.aggdes.algorithm.impl.AdaptiveMonteCarloAlgorithm [2] Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes efficiently”
  • 11. Lattice (optimized) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (y, m) 60 (z, s) 43.4k (z, s, g) 83.6k (g, y) 10 (g, y, m) 120 (g, m) 24 Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12)
  • 12. Lattice (optimized) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (y, m) 60 (z, s) 43.4k (z, s, g) 83.6k (g, y) 10 (g, y, m) 120 (g, m) 24 Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12) Aggregate Cost (rows) Benefit (query rows saved) % queries s, g, y, m 6k 497k 50% z, s, g 87k 304k 33% g, y 10 1.5k 25% g, m 24 1.5k 25% s, g 100 1.5k 25% y, m 60 1.5k 25%
  • 13. Data profiling Algorithm needs count(distinct a, b, ...) for each combination of attributes: ● Previous example had 25 = 32 possible tables ● Schema with 30 attributes has 230 (about 109 ) possible tables ● Algorithm considers a significant fraction of these ● Approximations are OK Attempts to solve the profiling problem: 1. Compute each combination: scan, sort, unique, count; repeat 230 times! 2. Sketches (HyperLogLog) 3. Sketches + parallelism + information theory (CALCITE-1616)
  • 14. Sketches HyperLogLog is an algorithm that computes approximate distinct count. It can estimate cardinalities of 109 with a typical error rate of 2%, using 1.5 kB of memory. [3][4] With 16 MB memory per machine we can compute 10,000 combinations of attributes each pass. So, we’re down from 109 to 105 passes. [3] Flajolet, Fusy, Gandouet, Meunier (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm" [4] https://github.com/mrjgreen/HyperLogLog
  • 15. Given Expected cardinality Actual cardinality Surprise (gender): 2 (state): 50 (gender, state): 100.0 100 0.000 (month): 12 (zipcode): 43,000 (month, zipcode): 441,699.3 442,700 0.001 (state): 50 (zipcode): 43,000 (state, zipcode): 799,666.7 43,400 0.897 (state, zipcode): 43,400 (gender, state): 100 (gender, zipcode): 85,995 (gender, state, zipcode): 86,799 = min(86,799, 892,234, 892,228) 83,567 0.019 ● Surprise = abs(actual - expected) / (actual + expected) ● E(card (x, y)) = n . (1 - ((n - 1) / n) ^ p) n = card (x) * card (y), p = row count Combining probability & information theory
  • 16. Algorithm Three ways “surprise” can help: ● If a cardinality is not surprising, we don’t need to store it -- we can derive it ● If a combination’s cardinality is not surprising, it is unlikely to have surprising children ● If we’re not seeing surprising results, it’s time to stop surprise_threshold := 1 queue := {singleton combinations} // (a), (b), ... while queue is not empty { batch := remove first 10,000 entries in queue compute cardinality of each combination in batch for each actual (computed) cardinality a { e := expected cardinality of combination s := surprise(a, e) if s > surprise_threshold { store combination and its cardinality add child combinations to queue // (x, a), (x, b), ... } increase surprise_threshold } }
  • 17. Algorithm progress and “surprise” threshold Progress of algorithm Rejected as not sufficiently surprising Surprise threshold rises as algorithm progresses Singleton combinations are have surprise = 1 Surprise thresold rises after we hve completed the first batch
  • 18. Hierarchies considered harmful Hierarchies are a feature of most OLAP systems Does it makes sense to store (year, quarter, month, date) and roll up to (year, quarter)? No -- algorithm can deduce hierarchies; less configuration means fewer mistakes Summary optimizer naturally includes attributes that don’t increase summary cardinality by much Feel free to specify a “drill path” in slice & dice UI True hierarchy (year) ↑ (year, quarter) ↑ (year, quarter, month) ↑ (year, quarter, month, date) Almost a hierarchy (nation) ↑ (nation, state) ↑ (nation, state, zipcode)
  • 19. Other applications of data profiling Query optimization: ● Planners are poor at estimating selectivity of conditions after N-way join (especially on real data) ● New join-order benchmark: “Movies made by French directors tend to have French actors” ● Predict number of reducers in MapReduce & Spark “Grokking” a data set Identifying problems in normalization, partitioning, quality Applications in machine learning?
  • 20. Further improvements ● Build sketches in parallel ● Run algorithm in a distributed framework (Spark or MapReduce) ● Compute histograms ○ For example, Median age for male/female customers ● Seek out functional dependencies ○ Once you know FDs, a lot of cardinalities are no longer “surprising” ○ FDs occur in denormalized tables, e.g. star schemas ● Smarter criteria for stopping algorithm ● Skew/heavy hitters. Are some values much more frequent than others? ● Conditional cardinalities and functional dependencies ○ Does one partition of the data behave differently from others? (e.g. year=2005, state=LA)