Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Supporting Over a Thousand Custom Hive User Defined Functions

245 Aufrufe

Veröffentlicht am

Over the years, Facebook has used Hive as the primary query engine to be used by our data engineers. Since Hive uses SQL-like query language called HQL, the list of built-in User Defined Functions (UDFs) did not always satisfy our customer requirements and as a result, an extensive list of custom UDFs was developed over time. As we started migrating pipelines from Hive to Spark SQL, a number of custom UDFs appeared incompatible with Spark, and many others showed bad performance. In this talk will first take a deep dive into how Hive UDFs work with Spark. We will then share what challenges we overcame on the way to support 99.99% of the custom UDFs in Spark.

Speakers: Sergey Makagonov, Xin Yao

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Supporting Over a Thousand Custom Hive User Defined Functions

  1. 1. Supporting Over a Thousand Custom Hive User Defined Functions By Sergey Makagonov and Xin Yao Facebook
  2. 2. • Introduction to User Defined Functions • Hive UDFs at Facebook • Major challenges and improvements • Partial aggregations Agenda
  3. 3. What Are “User Defined Functions”?
  4. 4. • UDFs are used to add custom code logic if built-in functions cannot achieve desired result User Defined Functions SELECT substr(description, 1, 100) AS first_100, count(*) AS cnt FROM tmp_table GROUP BY 1;
  5. 5. • Regular user-defined functions (UDFs): work on a single row in a table and for one or more inputs produce a single output • User-defined table functions (UDTFs): for every row in a table can return multiple values as output • Aggregate functions (UDAFs): work on one or more rows in a table and produce a single output Types of Hive functions
  6. 6. Types of Hive functions. Regular UDFs SELECT FB_ARRAY_CONCAT( arr1, arr2 ) AS zipped FROM dim_two_rows; Output: [“a”,“b”,”c”,”d”,”e”,”f”] [“foo”,”bar”,”baz”,”spam”] arr1 arr2 [‘a’, ‘b’, ‘c’] [‘d’, ‘e’, ‘f’] [’foo’, ‘bar’] [‘baz’, ‘spam’]
  7. 7. Types of Hive functions. UDTFs SELECT id, idx FROM dim_one_row LATERAL VIEW STACK(3, 1, 2, 3) tmp AS idx; Output: 123 1 123 2 123 3 id 123
  8. 8. Types of Hive functions. UDAFs SELECT COLLECT_SET(id) AS all_ids FROM dim_three_rows; Output: [123, 124, 125] id 123 124 125
  9. 9. How Hive UDFs work in Spark • most Hive data types (java types and derivatives of ObjectInspector class) can be converted to Spark’s data types, and vise versa • Instances of Hive’s GenericUDF, SimpleGenericUDAF and GenericUDTF are called via wrapper classes extending Spark’s Expression, ImperativeAggregate and Generator classes respectively
  10. 10. How Hive UDFs work in Spark
  11. 11. UDFs at Facebook
  12. 12. • Hive was primary query engine until we started to migrate jobs to Spark and Presto • Over the course of several years, over a thousand custom User Defined Functions were built • Hive queries that used UDFs accounted for over 70% of CPU time • Supporting Hive UDFs in Spark is important for migration UDFs at Facebook
  13. 13. • At the beginning of Hive to Spark migration – the level of support of UDFs was unclear Identifying Baseline
  14. 14. • Most of UDFs were already covered with basic tests during Hive days • We also had a testing framework built for running those tests in Hive UDFs testing framework
  15. 15. • The framework was extended further to allow running queries against Spark • A temporary scala file is created for each UDF class, containing code to run SQL queries using DataFrame API • spark-shell subprocess is spawned to run the scala file: spark-shell --conf spark.ui.enabled=false … -i /tmp/spark- hive-udf-1139336654093084343.scala • Output is parsed and compared to the expected result UDFs testing framework
  16. 16. • With test coverage in place, baseline support of UDFs by query count and CPU days was identified: 58% • Failed tests helped to identify the common issues UDFs testing framework
  17. 17. Major challenges
  18. 18. • getRequiredJars and getRequiredFiles - functions to automatically include additional resources required by this UDF. • initialize(StructObjectInspector) in GenericUDTF - Spark SQL uses a deprecated interface initialize(ObjectInspector[]) only. • configure (GenericUDF, GenericUDTF, and GenericUDAFEvaluator) - a function to initialize functions with MapredContext, which is inapplicable to Spark. • close (GenericUDF and GenericUDAFEvaluator) is a function to release associated resources. Spark SQL does not call this function when tasks finish. • reset (GenericUDAFEvaluator) - a function to re-initialize aggregation for reusing the same aggregation. Spark SQL currently does not support the reuse of aggregation. • getWindowingEvaluator (GenericUDAFEvaluator) - a function to optimize aggregation by evaluating an aggregate over a fixed window. Unsupported APIs
  19. 19. • getRequiredJars and getRequiredFiles - functions to automatically include additional resources required by this UDF. • initialize(StructObjectInspector) in GenericUDTF - Spark SQL uses a deprecated interface initialize(ObjectInspector[]) only. • configure (GenericUDF, GenericUDTF, and GenericUDAFEvaluator) - a function to initialize functions with MapredContext, which is inapplicable to Spark. • close (GenericUDF and GenericUDAFEvaluator) is a function to release associated resources. Spark SQL does not call this function when tasks finish. • reset (GenericUDAFEvaluator) - a function to re-initialize aggregation for reusing the same aggregation. Spark SQL currently does not support the reuse of aggregation. • getWindowingEvaluator (GenericUDAFEvaluator) - a function to optimize aggregation by evaluating an aggregate over a fixed window. Unsupported APIs
  20. 20. getRequiredFiles and getRequiredJars • functions to automatically include additional resources required by this UDF • UDF code can assume that file is present in the executor working directory
  21. 21. Supporting required files/jars (SPARK-27543) Driver Executor Executor fetches files added to SparkContext from DriverDuring initialization, for each U DF: - Identify required files and jars - Register files for distribution: SparkContext.addFile(…) SparkContext.addJar(…) For each UDF: - If required file is in working dir – do nothing (was distributed) - If file is missing – try create a symlink to absolute path
  22. 22. • Majority of Hive UDFs are written without concurrency in mind • Hive runs tasks in a separate JVM process per each task • Spark runs a separate JVM process for each Executor, and Executor can run multiple tasks concurrently UDFs and Thread Safety Executor Task 1 UDF instance 1 Task 2 UDF instance 2
  23. 23. Thread-unsafe UDF Example • Consider that we have 2 tasks and hence 2 instances of UDF: “instance 1” and “instance 2” • evaluate method is called for each row, both of the instances could pass the null-check inside evaluate method at the same time • Once “instance 1” finishes initialization first, it will call evaluate for the next row • If “instance 2” is still in the middle of initializing the mapping, it could overwrite the data that “instance 1” relied on, which could lead to data corruption or an exception
  24. 24. Approach 1: Introduce Synchronization • Introduce locking (synchronization) on the UDF class when initializing the mapping Cons: • Synchronization is computationally expensive • Requires manual and accurate refactoring of code, which does not scale for hundreds of UDFs
  25. 25. Approach 2: Make Field Non-static • Turn static variable into an instance variable Cons: • Adds more pressure on memory (instances cannot share complex data) Pros: • Minimum changes in the code, which can also be codemoded for all other UDFs that use static fields of non-primitive types
  26. 26. • In Spark, UDF objects are initialized on Driver, serialized, and later deserialized on executors • Some classes cannot be deserialized out of the box • Example: guava’s ImmutableSet. Kryo can successfully serialize the objects on the driver, but fails to deserialized them on executors Kryo serialization/deserialization
  27. 27. • Catch serde issues by running Hive UDF tests in cluster mode • For commonly used classes, write custom or import existing serializers • Mark problematic instance variables as transient Solving Kryo serde problem
  28. 28. • Hive UDFs don’t support data types from Spark out of the box • Similarly, Spark cannot work with Hive’s object inspectors • For each UDF call, Spark’s data types are wrapped into Hive’s inspectors and java types • Same for the results: java types are converted back into Spark’s data types Hive UDFs performance
  29. 29. • This wrapping/unwrapping overhead can lead up to 2x of CPU time spent in UDF compared to a Spark-native implementation • UDFs that work with complex types suffer the most Hive UDFs performance
  30. 30. • UDFs account for 15% of CPU spent for Spark queries • The top most computationally expensive UDFs can be converted to Spark-native UDFs Hive UDFs performance
  31. 31. Partial Aggregation
  32. 32. SELECT id, max(value) FROM table GROUP BY id Aggregation
  33. 33. Aggregation id value 1 100 1 200 2 400 3 100 id value 1 300 2 200 2 300 id value 1 100 1 200 1 300 3 100 id value 2 400 2 200 2 300 id max(value) 1 300 3 100 id max(value) 2 400 Mapper Reducer Shuffle Aggregation Mapper 1 Mapper 2 Reducer 1 Reducer 2
  34. 34. 1. Every row needs to be shuffled through network, which is a heavy operation. 2. Data skew. One reducer need to process more data than others if one key has more rows. 1. For example: key1 has 1 million rows, while other keys each have 10 rows on average What’s the problem
  35. 35. Partial Aggregation is the technique that a system partially aggregates the data in mapper side before shuffle, in order to reduce the shuffle size. Partial Aggregation
  36. 36. SELECT id, max(value) FROM table GROUP BY id Partial Aggregation
  37. 37. Aggregation id value 1 100 1 200 2 400 3 100 id value 1 300 2 200 2 300 id value 1 100 1 200 1 300 3 100 id value 2 400 2 200 2 300 id max(value) 1 300 3 100 id max(value) 2 400 Mapper Reducer Shuffle Aggregation Mapper 1 Mapper 2 Reducer 1 Reducer 2
  38. 38. Partial Aggregation id partial_max (value) 1 200 2 400 3 100 id partial_max (value) 1 300 2 300 id partial_max (value) 1 100 1 300 3 100 id partial_max (value) 2 400 2 300 id max(value) 1 300 3 100 id max(value) 2 400 Mapper Reducer Shuffle Final Aggregationid value 1 100 1 200 2 400 3 100 id value 1 300 2 200 2 300 Partial Aggregation Mapper 1 Mapper 2 Reducer 1 Reducer 2
  39. 39. Aggregation vs Partial Aggregation Aggregation Partial Aggregation Shuffle Data (# Rows) All rows Reduced number of rows Computation Aggregation happens all in reducer side Extra CPU for partial aggregation, distributed across Mappers and Reducers
  40. 40. • Why partial aggregation is important • It impacts CPU and Shuffle size • It could help data skew Partial Aggregation is important
  41. 41. 1. Partial aggregation support is already in Spark 2. Fixed some issues to make it work with FB UDAFs What we did
  42. 42. Partial Aggregation Production Result
  43. 43. 1. Partial aggregation improved CPU by 20%, shuffle data size 17% 2. However, we also observed some heavy pipelines regressed as much as 300% FB Production Result
  44. 44. 1. Query shape 2. Data distribution What could go wrong?
  45. 45. • Column Expansion • Partial aggregation expands the number of columns at the Mapper side, results in a larger shuffle data size SELECT key, max(value), min(value), count(value), avg(value) FROM table GROUP BY key When partial aggregation doesn’t work
  46. 46. Column Expansion id p_max p_min P_count P_avg 1 200 100 2 (300, 2) 2 400 400 1 (400, 1) 3 100 100 1 (100, 1) id value 1 100 1 200 2 400 3 100 Partial Aggregation id value 1 100 1 200 2 400 3 100 Shuffle Shuffle 2 columns Mapper Reducer Shuffle 5 columns Aggregation Partial Aggregation
  47. 47. • Query Shape • Column Expansion • Data distribution • No row to aggregate in mapper side SELECT key, max(value) FROM table GROUP BY key When partial aggregation doesn’t work
  48. 48. Data Distribution id value 1 100 2 200 3 400 4 100 Partial Aggregation id value 1 100 2 200 3 400 4 100 Shuffle Shuffle 4 rows Mapper Reducer Shuffle 4 rows Extra CPU with NO Row Reduction id Partial_max(value) 1 100 2 200 3 400 4 100 Aggregation Partial Aggregation
  49. 49. Partial Aggregation Computation Cost-based optimization
  50. 50. 1. Each UDAF function partial aggregation performance 2. Column Expansion 3. Row Reduction Partial Aggregation Computation Cost Factors
  51. 51. • Computation cost-based optimizer for partial aggregation 1. Use multiple features to calculate the computation cost of partial aggregation 1. input column number 2. output column number 3. computation cost of UDAF partial aggregation function 4. … 2. Use the calculated computation cost to decided the configuration for partial aggregation. How we solved the problem
  52. 52. 1. It improves the efficiency over the board 2. However, there are still queries don’t have most optimized partial aggregation configuration. Result
  53. 53. 1. Each UDAF function partial aggregation performance 2. Column Expansion 3. Row Reduction Partial Aggregation Computation Cost Factors
  54. 54. • It’s hard to know the row reduction • It depends on the data distribution which might be different for different day • For different group by keys, the row reduction is different Row Reduction
  55. 55. • History based tuning • Use history data of the query to predict the best configuration for future runs • Perfect for partial aggregation because it operates at query level. It could try different config and use them to direct the config of future run Future work
  56. 56. Recap
  57. 57. Questions?

×