Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Hive Bucketing in Apache Spark with Tejas Patil

4.834 Aufrufe

Veröffentlicht am

Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.

In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.

Veröffentlicht in: Daten & Analysen
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Hi, just small question.... I've created a bucketed table with 64 buckets, I'm executing an aggregation function "select t1.ifa,count(*) from $tblName t1 where t1.date_ = '2018-01-01' group by ifa" . I can see that 64 tasks in Spark UI , which utilize just 4 executors (each executor has 16 cores) out of 20 . Is there a way I can scale out the task or that's how bucketed queries should run (number of running cores as the number of buckets)?
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Hive Bucketing in Apache Spark with Tejas Patil

  1. 1. Hive Bucketing in Apache Spark Tejas Patil Facebook
  2. 2. Agenda • Why bucketing ? • Why is shuffle bad ? • How to avoid shuffle ? • When to use bucketing ? • Spark’s bucketing support • Bucketing semantics of Spark vs Hive • Hive bucketing support in Spark • SQL Planner improvements
  3. 3. Why bucketing ?
  4. 4. Table A Table B . . . . . . . . . . . . JOIN
  5. 5. Table A Table B . . . . . . . . . . . . JOIN Broadcast hash join - Ship smaller table to all nodes - Stream the other table
  6. 6. Table A Table B . . . . . . . . . . . . JOIN Broadcast hash join Shuffle hash join - Ship smaller table to all nodes - Stream the other table - Shuffle both tables, - Hash smaller one, stream the bigger one
  7. 7. Table A Table B . . . . . . . . . . . . JOIN Broadcast hash join Shuffle hash join - Ship smaller table to all nodes - Stream the other table - Shuffle both tables, - Hash smaller one, stream the bigger one Sort merge join - Shuffle both tables, - Sort both tables, buffer one, stream the bigger one
  8. 8. Table A Table B . . . . . . . . . . . . JOIN Broadcast hash join Shuffle hash join Sort merge join - Ship smaller table to all nodes - Stream the other table - Shuffle both tables, - Hash smaller one, stream the bigger one - Shuffle both tables, - Sort both tables, buffer one, stream the bigger one
  9. 9. Table A Table B . . . . . . . . . . . . Sort Merge Join
  10. 10. Table A Table B . . . . . . . . . . . . Shuffle ShuffleShuffleShuffle Sort Merge Join
  11. 11. Table A Table B . . . . . . . . . . . . Sort Sort Sort Sort Shuffle ShuffleShuffleShuffle Sort Merge Join
  12. 12. Table A Table B . . . . . . . . . . . . Sort Join Sort Join Sort Join Sort Join Shuffle ShuffleShuffleShuffle Sort Merge Join
  13. 13. Sort Merge Join
  14. 14. Why is shuffle bad ?
  15. 15. Table A Table B . . . . . . . . . . . . Disk Disk Disk IO for shuffle outputs Why is shuffle bad ?
  16. 16. Table A Table B . . . . . . . . . . . . Shuffle Shuffle Shuffle Shuffle M * R network connections Why is shuffle bad ?
  17. 17. How to avoid shuffle ?
  18. 18. student s JOIN attendance a ON s.id = a.student_id student s JOIN results r ON s.id = r.student_id student s JOIN course_registrationc ON s.id = c.student_id …….. ……..
  19. 19. student s JOIN attendance a ON s.id = a.student_id student s JOIN results r ON s.id = r.student_id
  20. 20. student s JOIN attendance a ON s.id = a.student_id student s JOIN results r ON s.id = r.student_id Pre-compute at table creation
  21. 21. Bucketing = pre-(shuffle + sort) inputs on join keys
  22. 22. Without bucketing Job(s) populating input table
  23. 23. Without bucketing With bucketing Job(s) populating input table
  24. 24. Performance comparison 0 100 200 300 400 500 600 700 800 900 Before After Bucketing Reserved CPU time (days) 0 1 2 3 4 5 6 7 8 Before After Bucketing Latency (hours)
  25. 25. When to use bucketing ? • Tables used frequently in JOINs with same key • Student -> student_id • Employee -> employee_id • Users -> user_id
  26. 26. When to use bucketing ? • Tables used frequently in JOINs with same key • Student -> student_id • Employee -> employee_id • Users -> user_id …. => Dimension tables
  27. 27. When to use bucketing ? • Tables used frequently in JOINs with same key • Student -> student_id • Employee -> employee_id • Users -> user_id • Loading daily cumulative tables • Both base and delta tables could be bucketed on a common column
  28. 28. When to use bucketing ? • Tables used frequently in JOINs with same key • Student -> student_id • Employee -> employee_id • Users -> user_id • Loading daily cumulative tables • Both base and delta tables could be bucketed on a common column • Indexing capability
  29. 29. Spark’s bucketing support
  30. 30. Creation of bucketed tables df.write .bucketBy(numBuckets, "col1", ...) .sortBy("col1", ...) .saveAsTable("bucketed_table") via Dataframe API
  31. 31. Creation of bucketed tables CREATE TABLE bucketed_table( column1 INT, ... ) USING parquet CLUSTERED BY (column1, …) SORTED BY (column1, …) INTO `n` BUCKETS via SQL statement
  32. 32. Check bucketing spec scala> sparkContext.sql("DESC FORMATTED student").collect.foreach(println) [# col_name,data_type,comment] [student_id,int,null] [name,int,null] [# Detailed Table Information,,] [Database,default,] [Table,table1,] [Owner,tejas,] [Created,Fri May 12 08:06:33 PDT 2017,] [Type,MANAGED,] [Num Buckets,64,] [Bucket Columns,[`student_id`],] [Sort Columns,[`student_id`],] [Properties,[serialization.format=1],] [Serde Library,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,] [InputFormat,org.apache.hadoop.mapred.SequenceFileInputFormat,]
  33. 33. Bucketing config SET spark.sql.sources.bucketing.enabled=true
  34. 34. [SPARK-15453] Extract bucketing info in FileSourceScanExec SELECT * FROM tableA JOIN tableB ON tableA.i= tableB.i AND tableA.j= tableB.j
  35. 35. [SPARK-15453] Extract bucketing info in FileSourceScanExec SELECT * FROM tableA JOIN tableB ON tableA.i= tableB.i AND tableA.j= tableB.j Sort Merge Join Sort Exchange TableScan Sort Exchange TableScan Already bucketed on columns (i, j) Shuffle on columns (i, j) Sort on columns (i, j) Join on [i,j], [i,j] respectively of relations
  36. 36. [SPARK-15453] Extract bucketing info in FileSourceScanExec SELECT * FROM tableA JOIN tableB ON tableA.i= tableB.i AND tableA.j= tableB.j Sort Merge Join Sort Exchange TableScan Sort Exchange TableScan Already bucketed on columns (i, j) Shuffle on columns (i, j) Sort on columns (i, j) Join on [i,j], [i,j] respectively of relations
  37. 37. [SPARK-15453] Extract bucketing info in FileSourceScanExec SELECT * FROM tableA JOIN tableB ON tableA.i= tableB.i AND tableA.j= tableB.j Sort Merge Join TableScan TableScan Sort Merge Join TableScan TableScanSort Exchange Sort Exchange
  38. 38. Bucketing semantics of Spark vs Hive
  39. 39. Bucketing semantics of Spark vs Hive bucket 0 bucket 1 bucket (n-1) ….. Hive my_bucketed_table
  40. 40. Bucketing semantics of Spark vs Hive bucket 0 bucket 1 bucket (n-1) ….. Hive Spark bucket 0 bucket 1 bucket (n-1) ….. bucket 0 bucket 0 bucket 1 bucket (n-1)bucket (n-1)bucket (n-1) my_bucketed_tablemy_bucketed_table
  41. 41. Bucketing semantics of Spark vs Hive bucket 0 bucket 1 bucket (n-1) ….. Hive M M M M M R R R …..
  42. 42. Bucketing semantics of Spark vs Hive bucket 0 bucket 1 bucket (n-1) ….. Hive Spark M M M M M R R R ….. bucket 0 bucket 1 bucket (n-1) ….. M M M M M….. bucket 0 bucket 0 bucket 1 bucket (n-1)bucket (n-1)bucket (n-1)
  43. 43. Bucketing semantics of Spark vs Hive Hive Spark Model Optimizes reads, writes are costly Writes are cheaper, reads are costlier
  44. 44. Bucketing semantics of Spark vs Hive Hive Spark Model Optimizes reads, writes are costly Writes are cheaper, reads are costlier Unit of bucketing A single file represents a single bucket A collection of files together comprise a single bucket
  45. 45. Bucketing semantics of Spark vs Hive Hive Spark Model Optimizes reads, writes are costly Writes are cheaper, reads are costlier Unit of bucketing A single file represents a single bucket A collection of files together comprise a single bucket Sort ordering Each bucket is sorted globally. This makes it easy to optimize queries reading this data Each file is individually sorted but the “bucket” as a whole is not globally sorted
  46. 46. Bucketing semantics of Spark vs Hive Hive Spark Model Optimizes reads, writes are costly Writes are cheaper, reads are costlier Unit of bucketing A single file represents a single bucket A collection of files together comprise a single bucket Sort ordering Each bucket is sorted globally. This makes it easy to optimize queries reading this data Each file is individually sorted but the “bucket” as a whole is not globally sorted Hashing function Hive’s inbuilt hash Murmur3Hash
  47. 47. [SPARK-19256] Hive bucketing support • Introduce Hive’s hashing function [SPARK-17495] • Enable creating hive bucketed tables [SPARK-17729] • Support Hive’s `Bucketed file format` • Propagate Hive bucketing information to planner [SPARK-17654] • Expressing outputPartitioning and requiredChildDistribution • Creating empty bucket files when no data for a bucket • Allow Sort Merge join over tables with number of buckets multiple of each other • Support N-way Sort Merge Join
  48. 48. [SPARK-19256] Hive bucketing support • Introduce Hive’s hashing function [SPARK-17495] • Enable creating hive bucketed tables [SPARK-17729] • Support Hive’s `Bucketed file format` • Propagate Hive bucketing information to planner [SPARK-17654] • Expressing outputPartitioning and requiredChildDistribution • Creating empty bucket files when no data for a bucket • Allow Sort Merge join over tables with number of buckets multiple of each other • Support N-way Sort Merge Join Merged in upstream
  49. 49. [SPARK-19256] Hive bucketing support • Introduce Hive’s hashing function [SPARK-17495] • Enable creating hive bucketed tables [SPARK-17729] • Support Hive’s `Bucketed file format` • Propagate Hive bucketing information to planner [SPARK-17654] • Expressing outputPartitioning and requiredChildDistribution • Creating empty bucket files when no data for a bucket • Allow Sort Merge join over tables with number of buckets multiple of each other • Support N-way Sort Merge Join FB-prod (6 months)
  50. 50. SQL Planner improvements
  51. 51. [SPARK-19122] Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order SELECT * FROM a JOIN b ON a.x=b.xAND a.y=b.y
  52. 52. [SPARK-19122] Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order Sort Merge Join TableScan TableScan SELECT * FROM a JOIN b ON a.x=b.xAND a.y=b.y
  53. 53. [SPARK-19122] Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order SELECT * FROM a JOIN b ON a.x=b.xAND a.y=b.y Sort Merge Join TableScan TableScan SELECT * FROM a JOIN b ON a.y=b.y AND a.x=b.x
  54. 54. [SPARK-19122] Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order Sort Merge Join Exchange TableScan Sort Exchange TableScan Sort Sort Merge Join TableScan TableScan SELECT * FROM a JOIN b ON a.x=b.xAND a.y=b.y SELECT * FROM a JOIN b ON a.y=b.y AND a.x=b.x
  55. 55. [SPARK-19122] Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order Sort Merge Join Exchange TableScan Sort Exchange TableScan Sort Sort Merge Join TableScan TableScan SELECT * FROM a JOIN b ON a.x=b.xAND a.y=b.y SELECT * FROM a JOIN b ON a.y=b.y AND a.x=b.x
  56. 56. [SPARK-19122] Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order Sort Merge Join TableScan TableScan SELECT * FROM a JOIN b ON a.x=b.xAND a.y=b.y SELECT * FROM a JOIN b ON a.y=b.y AND a.x=b.x Sort Merge Join TableScan TableScan
  57. 57. [SPARK-17271] Planner adds un-necessary Sort even if child ordering is semantically same as required ordering SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1
  58. 58. [SPARK-17271] Planner adds un-necessary Sort even if child ordering is semantically same as required ordering SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1 Sort Merge Join Sort TableScan Sort TableScan Already bucketed on column `col1` Sort on columns (a.col1 and b.col1) Join on [col1], [col1] respectively of relations
  59. 59. [SPARK-17271] Planner adds un-necessary Sort even if child ordering is semantically same as required ordering SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1 Sort Merge Join TableScan TableScan Already bucketed on column `col1` Sort on columns (a.col1 and b.col1)Sort Sort Join on [col1], [col1] respectively of relations
  60. 60. [SPARK-17271] Planner adds un-necessary Sort even if child ordering is semantically same as required ordering SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1 Sort Merge Join TableScan TableScan Sort Sort Sort Merge Join TableScan TableScan
  61. 61. [SPARK-17698] Join predicates should not contain filter clauses SELECT a.id, b.id FROM table1 a FULL OUTER JOIN table2 b ON a.id= b.id AND a.id='1' AND b.id='1'
  62. 62. [SPARK-17698] Join predicates should not contain filter clauses SELECT a.id, b.id FROM table1 a FULL OUTER JOIN table2 b ON a.id= b.id AND a.id='1' AND b.id='1' Sort Merge Join TableScan TableScan Sort Exchange Sort Exchange Already bucketed on column [id] Shuffle on columns [id, 1, id] and [id, id, 1] Sort on columns [id, id, 1] and [id, 1, id] Join on [id, id, 1], [id, 1, id] respectively of relations
  63. 63. [SPARK-17698] Join predicates should not contain filter clauses SELECT a.id, b.id FROM table1 a FULL OUTER JOIN table2 b ON a.id = b.id AND a.id='1' AND b.id='1' Sort Merge Join TableScan TableScan Sort Exchange Sort Exchange Already bucketed on column [id] Shuffle on columns [id, 1, id] and [id, id, 1] Sort on columns [id, id, 1] and [id, 1, id] Join on [id, id, 1], [id, 1, id] respectively of relations
  64. 64. [SPARK-17698] Join predicates should not contain filter clauses SELECT a.id, b.id FROM table1 a FULL OUTER JOIN table2 b ON a.id = b.id AND a.id='1' AND b.id='1' Sort Merge Join TableScan TableScan Sort Exchange Sort Exchange Already bucketed on column [id] Shuffle on columns [id, 1, id] and [id, id, 1] Sort on columns [id, id, 1] and [id, 1, id] Join on [id, id, 1], [id, 1, id] respectively of relations
  65. 65. [SPARK-17698] Join predicates should not contain filter clauses SELECT a.id, b.id FROM table1 a FULL OUTER JOIN table2 b ON a.id = b.id AND a.id='1' AND b.id='1' Sort Merge Join TableScan TableScan Sort Merge Join TableScan TableScanSort Exchange Sort Exchange
  66. 66. [SPARK-13450] Introduce ExternalAppendOnlyUnsafeRowArray
  67. 67. [SPARK-13450] Introduce ExternalAppendOnlyUnsafeRowArray Buffered relation Streamed relation If there are a lot of rows to be buffered (eg. Skew), the query would OOM row with key=`foo` rows with key=`foo`
  68. 68. [SPARK-13450] Introduce ExternalAppendOnlyUnsafeRowArray Buffered relation Streamed relation row with key=`foo` rows with key=`foo` Disk buffer would spill to disk after reaching a threshold spill
  69. 69. Summary • Shuffle and sort is costly due to disk and network IO • Bucketing will pre-(shuffle and sort) the inputs • Expect at least 2-5x gains after bucketing input tables for joins • Candidates for bucketing: • Tables used frequently in JOINs with same key • Loading of data in cumulative tables • Spark’s bucketing is not compatible with Hive’s bucketing
  70. 70. Questions ?

×