Bucketing is a popular data partitioning technique to pre-shuffle and (optionally) pre-sort data during writes. This is ideal for a variety of write-once and read-many datasets at Facebook, where Spark can automatically avoid expensive shuffles/sorts (when the underlying data is joined/aggregated on its bucketed keys) resulting in substantial savings in both CPU and IO.
Over the last year, we’ve added a series of optimizations in Apache Spark as a means towards achieving feature parity with Hive and Spark. These include avoiding shuffle/sort when joining/aggregating/inserting on tables with mismatching buckets, allowing user to skip shuffle/sort when writing to bucketed tables, adding data validators before writing bucketed data, among many others. As a direct consequence of these efforts, we’ve witnessed over 10x growth (spanning 40% of total compute) in queries that read one or more bucketed tables across the entire data warehouse at Facebook.
In this talk, we’ll take a deep dive into the internals of bucketing support in SparkSQL, describe use-cases where bucketing is useful, touch upon some of the on-going work to automatically suggest bucketing tables based on query column lineage, and summarize the lessons learned from developing bucketing support in Spark at Facebook over the last 2 years
3. About me
Cheng Su
• Software Engineer at Facebook (Data Infrastructure
Organization)
• Working in Spark team
• Previously worked in Hive/Corona team
3#UnifiedDataAnalytics #SparkAISummit
4. Agenda
• Spark at Facebook
• What is Bucketing
• Spark Bucketing Optimizations (JIRA: SPARK-19256)
• Bucketing Compatability across SQL Engines
• The Road Ahead
4#UnifiedDataAnalytics #SparkAISummit
7. What is Bucketing (query plan)
CREATE TABLE user
(id INT, info STRING)
CLUSTERED BY (id)
SORTED BY (id)
INTO 8 BUCKETS
7#UnifiedDataAnalytics #SparkAISummit
SQL query to create
bucketed table
InsertIntoTable
Sort(id)
ShuffleExechange
(id, 8, HashFunc)
. . .
Query plan to write
bucketed table
INSERT OVERWRITE
TABLE user
SELECT id, info
FROM . . .
WHERE . . .
SQL query to write
bucketed table
9. Spark Bucketing Optimizations (join)
9#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when sort-merge-join bucketed tables
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
SortMergeJoin
Sort(id)
Shuffle(id)
Sort(id)
Shuffle(id)
TableScan(L) TableScan(R)
SortMergeJoin
TableScan(L) TableScan(R)
Query plan to sort-merge-
join two bucketed tables
with same buckets
12. 12#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle when shuffled-hash-join bucketed
tables
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
ShuffledHashJoin
Shuffle(id) Shuffle(id)
TableScan(L) TableScan(R)
ShuffledHashJoin
TableScan(L) TableScan(R)
Query plan to shuffled-
hash-join two bucketed
tables with same buckets
Spark Bucketing Optimizations (join)
15. 15#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when joining non-bucketed, and bucketed
table
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
SortMergeJoin
Sort(id)
Shuffle(id)
TableScan(L)
TableScan(R)
Query plan to sort-merge-join
non-bucketed table (L) with
bucketed table (R)
Spark Bucketing Optimizations (join)
17. 17#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when joining bucketed tables with different buckets
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
SortMergeJoin
TableScan(L)
TableScan(R)
Query plan to join 4-buckets-table
(L) with 16-buckets-table (R)
Spark Bucketing Optimizations (join)
SortedCoalesce(4)
SortedCoalesceExec
(physical plan operator
inherits child ordering )
SortedCoalescedRDD
(extends CoalescedRDD
to read children RDDs in
sort-merge-way)
(priority-queue)
18. Table Scan L Table Scan R
Join
Sort merge join of
bucketed sorted
table with different
buckets
- Coalesce the bigger one
in sort-merge way
- Join by buffer one, stream
the bigger one
(1, )
(1, )
(3, )
(0, )
(0, )
(2, )
(1, )
(9, )
(0, )
(4, )
(3, )
(7, )
(7, )
(2, )
(2, )
(6, )
(0, )
(0, )
(2, )
(0, )
(2, )
(2, )
(4, )
(6, )
Sorted-Coalesce
Join
(1, )
(1, )
(3, )
(1, )
(3, )
(7, )
(7, )
(9, )
Sorted-Coalesce
19. 19#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when joining bucketed tables with different buckets
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
SortMergeJoin
TableScan(L)
TableScan(R)
Query plan to join 4-buckets-table
(L) with 16-buckets-table (R)
Spark Bucketing Optimizations (join)
Repartition(16)
RepartitionWithoutShuffleExe
c
(physical plan operator
inherits child ordering)
RepartitionWithoutShuffleRD
D (divide-read-filter children
RDD partitions)
21. Spark Bucketing Optimizations (group-by)
21#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when sort-aggregate bucketed tables
SELECT . . .
FROM t
GROUP BY id
SQL query to group-
by table
SortAggregate
Sort(id)
Shuffle(id)
TableScan(t)
Query plan to sort-
aggregate bucketed table
SortAggregate
TableScan(t)
24. Spark Bucketing Optimizations (group-by)
24#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle when hash-aggregate bucketed tables
SELECT . . .
FROM t
GROUP BY id
SQL query to group-
by table
HashAggregate
Shuffle(id)
TableScan(t)
Query plan to hash-
aggregate bucketed table
HashAggregate
TableScan(t)
27. Spark Bucketing Optimizations (union all)
27#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when join/group-by on union-all of bucketed tables
SELECT . . .
FROM (
SELECT … FROM L
UNION ALL
SELECT … FROM R
)
GROUP BY id
SQL query to group-by
on union-all of tables
SortAggregate
Union
TableScan(L)
Query plan to hash-
aggregate union-all of
bucketed tables
TableScan(R)
Change UnionExec to
produce
SortedCoalescedRDD
instead of CoalescedRDD
30. Spark Bucketing Optimizations (filter)
30#UnifiedDataAnalytics #SparkAISummit
Filter pushdown for bucketed table
SELECT … FROM t
WHERE id = 1
SQL query to read
bucketed table with
filter on bucketed
column (id)
Filter
Query plan to read
bucketed table with filter
pushdown
PushDownBucketFilter
physical plan rule to extract
bucketed column filter from
FilterExec, then filtering out
unnecessary buckets from e.g.
HiveTableScanExec
(i.e. not read unrelated buckets
at all)
TableScan(t)SELECT … FROM t
WHERE id IN (1, 2, 3)
31. Bucket Filter Push
Down
SELECT … FROM t
WHERE id = 1
Normal Filter
Bucket Filter Push Down
. . . . . .(9, )
(1, )
(4 )
(0, )
(4, )
(7, )
(3, )
(7, )
(1, )
- Only read required bucket
files
(9, )
(1, )
(1, )
32. Spark Bucketing Optimizations (validation)
32#UnifiedDataAnalytics #SparkAISummit
Validate bucketing and sorting before writing bucketed tables
INSERT OVERWRITE
TABLE t
SELECT …
FROM …
SQL query to write
bucketed table
InsertIntoTable(t)
SortVerifie
r
Query plan to validate
bucketing and sorting
before writing table
ShuffleVerifierExec
compute bucket-id for each
row on-the-fly, compare
bucket-id with RDD-partition-id
ShuffleVerifie
r
SortVerifierExec
compare ordering
between current and
previous rows
34. Spark Bucketing Optimizations (others)
34#UnifiedDataAnalytics #SparkAISummit
• Sorted-coalesced-read multiple partitions of bucketed table
• Prefer sort-merge-join for bucketed sorted tables
• Prefer sort-aggregate for bucketed sorted tables
• Avoid shuffle for NULL-safe-equal join (<=>) on bucketed tables
• Allow to skip shuffle and sort before writing bucketed table
• Automatically align dynamic allocation maximal executors, with
buckets
• Efficiently hive table sampling support
35. • Hive hash is different from murmur3 hash! (bitwise-and with 2^31-1 in
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getBucketNumber)
• Should use same bucketing hash function (e.g. hive hash) across SQL
engines (Spark/Presto/Hive)
• Number of buckets of all tables should be divisible by each other (e.g.
power-of-two)
35#UnifiedDataAnalytics #SparkAISummit
Bucketing Compatability across SQL Engines
36. • Change number of buckets should be easy and pain-less across
compute engines for SQL users
• When and What to bucket?
• Have more than one query to do join or group-by on some columns
36#UnifiedDataAnalytics #SparkAISummit
Bucketing Compatability across SQL Engines
37. The Road Ahead
• Bucketing should be user-transparent
• Auto-bucketing project
• Audit join/group-by columns information for all warehouse queries
• Recommend bucketed columns and number of buckets based on
computational cost models
• What is problem of bucketing? Can we have better data placement,
besides bucketing and partitioning?
37#UnifiedDataAnalytics #SparkAISummit
38. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT