SlideShare a Scribd company logo
1 of 57
Download to read offline
Supporting Over a Thousand
Custom Hive User Defined
Functions
By Sergey Makagonov and Xin Yao
Facebook
• Introduction to User Defined Functions
• Hive UDFs at Facebook
• Major challenges and improvements
• Partial aggregations
Agenda
What Are “User Defined
Functions”?
• UDFs are used to add custom code logic if built-in
functions cannot achieve desired result
User Defined Functions
SELECT
substr(description, 1, 100) AS first_100,
count(*) AS cnt
FROM tmp_table
GROUP BY 1;
• Regular user-defined functions (UDFs): work on a
single row in a table and for one or more inputs
produce a single output
• User-defined table functions (UDTFs): for every row
in a table can return multiple values as output
• Aggregate functions (UDAFs): work on one or more
rows in a table and produce a single output
Types of Hive functions
Types of Hive functions. Regular UDFs
SELECT FB_ARRAY_CONCAT(
arr1, arr2
) AS zipped
FROM dim_two_rows;
Output:
[“a”,“b”,”c”,”d”,”e”,”f”]
[“foo”,”bar”,”baz”,”spam”]
arr1 arr2
[‘a’, ‘b’, ‘c’] [‘d’, ‘e’, ‘f’]
[’foo’, ‘bar’] [‘baz’, ‘spam’]
Types of Hive functions. UDTFs
SELECT id, idx
FROM dim_one_row
LATERAL VIEW
STACK(3, 1, 2, 3) tmp AS idx;
Output:
123 1
123 2
123 3
id
123
Types of Hive functions. UDAFs
SELECT
COLLECT_SET(id) AS all_ids
FROM dim_three_rows;
Output:
[123, 124, 125]
id
123
124
125
How Hive UDFs work in Spark
• most Hive data types (java types and derivatives of
ObjectInspector class) can be converted to
Spark’s data types, and vise versa
• Instances of Hive’s GenericUDF,
SimpleGenericUDAF and GenericUDTF are
called via wrapper classes extending Spark’s
Expression, ImperativeAggregate and
Generator classes respectively
How Hive UDFs work in Spark
UDFs at Facebook
• Hive was primary query engine until we started to
migrate jobs to Spark and Presto
• Over the course of several years, over a thousand
custom User Defined Functions were built
• Hive queries that used UDFs accounted for over 70%
of CPU time
• Supporting Hive UDFs in Spark is important for
migration
UDFs at Facebook
• At the beginning of Hive to Spark migration – the level
of support of UDFs was unclear
Identifying Baseline
• Most of UDFs were already covered with basic tests during Hive days
• We also had a testing framework built for running those tests in Hive
UDFs testing framework
• The framework was extended further to allow running queries against Spark
• A temporary scala file is created for each UDF class, containing code to run SQL
queries using DataFrame API
• spark-shell subprocess is spawned to run the scala file:
spark-shell --conf spark.ui.enabled=false … -i /tmp/spark-
hive-udf-1139336654093084343.scala
• Output is parsed and compared to the expected result
UDFs testing framework
• With test coverage in place, baseline support of UDFs by query count and CPU days
was identified: 58%
• Failed tests helped to identify the common issues
UDFs testing framework
Major challenges
• getRequiredJars and getRequiredFiles - functions to automatically include
additional resources required by this UDF.
• initialize(StructObjectInspector) in GenericUDTF - Spark SQL uses a
deprecated interface initialize(ObjectInspector[]) only.
• configure (GenericUDF, GenericUDTF, and GenericUDAFEvaluator) - a function to initialize
functions with MapredContext, which is inapplicable to Spark.
• close (GenericUDF and GenericUDAFEvaluator) is a function to release
associated resources. Spark SQL does not call this function when tasks finish.
• reset (GenericUDAFEvaluator) - a function to re-initialize aggregation for reusing the
same aggregation. Spark SQL currently does not support the reuse of aggregation.
• getWindowingEvaluator (GenericUDAFEvaluator) - a function to optimize
aggregation by evaluating an aggregate over a fixed window.
Unsupported APIs
• getRequiredJars and getRequiredFiles - functions to automatically include
additional resources required by this UDF.
• initialize(StructObjectInspector) in GenericUDTF - Spark SQL uses a
deprecated interface initialize(ObjectInspector[]) only.
• configure (GenericUDF, GenericUDTF, and GenericUDAFEvaluator) - a function to initialize
functions with MapredContext, which is inapplicable to Spark.
• close (GenericUDF and GenericUDAFEvaluator) is a function to release
associated resources. Spark SQL does not call this function when tasks finish.
• reset (GenericUDAFEvaluator) - a function to re-initialize aggregation for reusing the
same aggregation. Spark SQL currently does not support the reuse of aggregation.
• getWindowingEvaluator (GenericUDAFEvaluator) - a function to optimize
aggregation by evaluating an aggregate over a fixed window.
Unsupported APIs
getRequiredFiles and getRequiredJars
• functions to automatically include additional resources required by this UDF
• UDF code can assume that file is present in the executor working directory
Supporting required files/jars (SPARK-27543)
Driver Executor
Executor fetches files added to
SparkContext from DriverDuring initialization, for each U
DF:
- Identify required files and jars
- Register files for distribution:
SparkContext.addFile(…)
SparkContext.addJar(…)
For each UDF:
- If required file is in working
dir – do nothing (was distributed)
- If file is missing – try create a
symlink to absolute path
• Majority of Hive UDFs are written without concurrency in mind
• Hive runs tasks in a separate JVM process per each task
• Spark runs a separate JVM process for each Executor, and Executor can run multiple tasks
concurrently
UDFs and Thread Safety
Executor
Task 1
UDF instance 1
Task 2
UDF instance 2
Thread-unsafe UDF Example
• Consider that we have 2 tasks and hence 2 instances
of UDF: “instance 1” and “instance 2”
• evaluate method is called for each row, both of the
instances could pass the null-check inside evaluate
method at the same time
• Once “instance 1” finishes initialization first, it will call
evaluate for the next row
• If “instance 2” is still in the middle of initializing the
mapping, it could overwrite the data that “instance 1”
relied on, which could lead to data corruption or an
exception
Approach 1: Introduce Synchronization
• Introduce locking (synchronization) on the UDF
class when initializing the mapping
Cons:
• Synchronization is computationally expensive
• Requires manual and accurate refactoring of code,
which does not scale for hundreds of UDFs
Approach 2: Make Field Non-static
• Turn static variable into an instance variable
Cons:
• Adds more pressure on memory (instances cannot
share complex data)
Pros:
• Minimum changes in the code, which can also be
codemoded for all other UDFs that use static fields
of non-primitive types
• In Spark, UDF objects are initialized on Driver, serialized, and
later deserialized on executors
• Some classes cannot be deserialized out of the box
• Example: guava’s ImmutableSet. Kryo can successfully
serialize the objects on the driver, but fails to deserialized them
on executors
Kryo serialization/deserialization
• Catch serde issues by running Hive UDF tests in cluster mode
• For commonly used classes, write custom or import existing
serializers
• Mark problematic instance variables as transient
Solving Kryo serde problem
• Hive UDFs don’t support data types from Spark out of the box
• Similarly, Spark cannot work with Hive’s object inspectors
• For each UDF call, Spark’s data types are wrapped into Hive’s
inspectors and java types
• Same for the results: java types are converted back into Spark’s data
types
Hive UDFs performance
• This wrapping/unwrapping overhead can lead up to 2x of CPU time
spent in UDF compared to a Spark-native implementation
• UDFs that work with complex types suffer the most
Hive UDFs performance
• UDFs account for 15% of CPU spent for Spark queries
• The top most computationally expensive UDFs can be converted to
Spark-native UDFs
Hive UDFs performance
Partial Aggregation
SELECT id, max(value)
FROM table
GROUP BY id
Aggregation
Aggregation
id value
1 100
1 200
2 400
3 100
id value
1 300
2 200
2 300
id value
1 100
1 200
1 300
3 100
id value
2 400
2 200
2 300
id max(value)
1 300
3 100
id max(value)
2 400
Mapper
Reducer
Shuffle Aggregation
Mapper 1
Mapper 2
Reducer 1
Reducer 2
1. Every row needs to be shuffled through network, which
is a heavy operation.
2. Data skew. One reducer need to process more data than
others if one key has more rows.
1. For example: key1 has 1 million rows, while other keys
each have 10 rows on average
What’s the problem
Partial Aggregation is the technique that a system partially
aggregates the data in mapper side before shuffle, in order
to reduce the shuffle size.
Partial Aggregation
SELECT id, max(value)
FROM table
GROUP BY id
Partial Aggregation
Aggregation
id value
1 100
1 200
2 400
3 100
id value
1 300
2 200
2 300
id value
1 100
1 200
1 300
3 100
id value
2 400
2 200
2 300
id max(value)
1 300
3 100
id max(value)
2 400
Mapper
Reducer
Shuffle Aggregation
Mapper 1
Mapper 2
Reducer 1
Reducer 2
Partial Aggregation
id partial_max
(value)
1 200
2 400
3 100
id partial_max
(value)
1 300
2 300
id partial_max
(value)
1 100
1 300
3 100
id partial_max
(value)
2 400
2 300
id max(value)
1 300
3 100
id max(value)
2 400
Mapper Reducer
Shuffle Final
Aggregationid value
1 100
1 200
2 400
3 100
id value
1 300
2 200
2 300
Partial
Aggregation Mapper 1
Mapper 2
Reducer 1
Reducer 2
Aggregation vs Partial Aggregation
Aggregation Partial Aggregation
Shuffle Data (# Rows) All rows Reduced number of rows
Computation
Aggregation happens all in
reducer side
Extra CPU for partial
aggregation, distributed across
Mappers and Reducers
• Why partial aggregation is important
• It impacts CPU and Shuffle size
• It could help data skew
Partial Aggregation is important
1. Partial aggregation support is already in Spark
2. Fixed some issues to make it work with FB UDAFs
What we did
Partial Aggregation
Production Result
1. Partial aggregation improved CPU by 20%, shuffle data
size 17%
2. However, we also observed some heavy pipelines
regressed as much as 300%
FB Production Result
1. Query shape
2. Data distribution
What could go wrong?
• Column Expansion
• Partial aggregation expands the number of columns at the
Mapper side, results in a larger shuffle data size
SELECT
key, max(value), min(value), count(value), avg(value)
FROM table
GROUP BY key
When partial aggregation doesn’t work
Column Expansion
id p_max p_min P_count P_avg
1 200 100 2 (300, 2)
2 400 400 1 (400, 1)
3 100 100 1 (100, 1)
id value
1 100
1 200
2 400
3 100
Partial
Aggregation
id value
1 100
1 200
2 400
3 100
Shuffle
Shuffle 2 columns
Mapper Reducer
Shuffle 5 columns
Aggregation
Partial Aggregation
• Query Shape
• Column Expansion
• Data distribution
• No row to aggregate in mapper side
SELECT
key, max(value)
FROM table
GROUP BY key
When partial aggregation doesn’t work
Data Distribution
id value
1 100
2 200
3 400
4 100
Partial
Aggregation
id value
1 100
2 200
3 400
4 100
Shuffle
Shuffle 4 rows
Mapper Reducer
Shuffle 4 rows
Extra CPU with NO Row Reduction
id Partial_max(value)
1 100
2 200
3 400
4 100
Aggregation
Partial Aggregation
Partial Aggregation
Computation Cost-based
optimization
1. Each UDAF function partial aggregation performance
2. Column Expansion
3. Row Reduction
Partial Aggregation Computation Cost
Factors
• Computation cost-based optimizer for partial aggregation
1. Use multiple features to calculate the computation cost of
partial aggregation
1. input column number
2. output column number
3. computation cost of UDAF partial aggregation function
4. …
2. Use the calculated computation cost to decided the
configuration for partial aggregation.
How we solved the problem
1. It improves the efficiency over the board
2. However, there are still queries don’t have most
optimized partial aggregation configuration.
Result
1. Each UDAF function partial aggregation performance
2. Column Expansion
3. Row Reduction
Partial Aggregation Computation Cost
Factors
• It’s hard to know the row reduction
• It depends on the data distribution which might be
different for different day
• For different group by keys, the row reduction is different
Row Reduction
• History based tuning
• Use history data of the query to predict the best
configuration for future runs
• Perfect for partial aggregation because it operates at
query level. It could try different config and use them to
direct the config of future run
Future work
Recap
Questions?

More Related Content

What's hot

Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 

What's hot (20)

Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Spark
SparkSpark
Spark
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache spark
Apache sparkApache spark
Apache spark
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 

Similar to Supporting Over a Thousand Custom Hive User Defined Functions

User defined-functions-cassandra-summit-eu-2014
User defined-functions-cassandra-summit-eu-2014User defined-functions-cassandra-summit-eu-2014
User defined-functions-cassandra-summit-eu-2014Robert Stupp
 
Spark UDFs are EviL, Catalyst to the rEsCue!
Spark UDFs are EviL, Catalyst to the rEsCue!Spark UDFs are EviL, Catalyst to the rEsCue!
Spark UDFs are EviL, Catalyst to the rEsCue!Adi Polak
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...Codemotion
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on CloudQubole
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...Codemotion
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...
Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...
Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...TAISEEREISA
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native CompilationPGConf APAC
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Rajeev Rastogi (KRR)
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 

Similar to Supporting Over a Thousand Custom Hive User Defined Functions (20)

User defined-functions-cassandra-summit-eu-2014
User defined-functions-cassandra-summit-eu-2014User defined-functions-cassandra-summit-eu-2014
User defined-functions-cassandra-summit-eu-2014
 
Spark UDFs are EviL, Catalyst to the rEsCue!
Spark UDFs are EviL, Catalyst to the rEsCue!Spark UDFs are EviL, Catalyst to the rEsCue!
Spark UDFs are EviL, Catalyst to the rEsCue!
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Sql lite android
Sql lite androidSql lite android
Sql lite android
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 
01 oracle architecture
01 oracle architecture01 oracle architecture
01 oracle architecture
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...
Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...
Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 

Recently uploaded (20)

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 

Supporting Over a Thousand Custom Hive User Defined Functions

  • 1. Supporting Over a Thousand Custom Hive User Defined Functions By Sergey Makagonov and Xin Yao Facebook
  • 2. • Introduction to User Defined Functions • Hive UDFs at Facebook • Major challenges and improvements • Partial aggregations Agenda
  • 3. What Are “User Defined Functions”?
  • 4. • UDFs are used to add custom code logic if built-in functions cannot achieve desired result User Defined Functions SELECT substr(description, 1, 100) AS first_100, count(*) AS cnt FROM tmp_table GROUP BY 1;
  • 5. • Regular user-defined functions (UDFs): work on a single row in a table and for one or more inputs produce a single output • User-defined table functions (UDTFs): for every row in a table can return multiple values as output • Aggregate functions (UDAFs): work on one or more rows in a table and produce a single output Types of Hive functions
  • 6. Types of Hive functions. Regular UDFs SELECT FB_ARRAY_CONCAT( arr1, arr2 ) AS zipped FROM dim_two_rows; Output: [“a”,“b”,”c”,”d”,”e”,”f”] [“foo”,”bar”,”baz”,”spam”] arr1 arr2 [‘a’, ‘b’, ‘c’] [‘d’, ‘e’, ‘f’] [’foo’, ‘bar’] [‘baz’, ‘spam’]
  • 7. Types of Hive functions. UDTFs SELECT id, idx FROM dim_one_row LATERAL VIEW STACK(3, 1, 2, 3) tmp AS idx; Output: 123 1 123 2 123 3 id 123
  • 8. Types of Hive functions. UDAFs SELECT COLLECT_SET(id) AS all_ids FROM dim_three_rows; Output: [123, 124, 125] id 123 124 125
  • 9. How Hive UDFs work in Spark • most Hive data types (java types and derivatives of ObjectInspector class) can be converted to Spark’s data types, and vise versa • Instances of Hive’s GenericUDF, SimpleGenericUDAF and GenericUDTF are called via wrapper classes extending Spark’s Expression, ImperativeAggregate and Generator classes respectively
  • 10. How Hive UDFs work in Spark
  • 12. • Hive was primary query engine until we started to migrate jobs to Spark and Presto • Over the course of several years, over a thousand custom User Defined Functions were built • Hive queries that used UDFs accounted for over 70% of CPU time • Supporting Hive UDFs in Spark is important for migration UDFs at Facebook
  • 13. • At the beginning of Hive to Spark migration – the level of support of UDFs was unclear Identifying Baseline
  • 14. • Most of UDFs were already covered with basic tests during Hive days • We also had a testing framework built for running those tests in Hive UDFs testing framework
  • 15. • The framework was extended further to allow running queries against Spark • A temporary scala file is created for each UDF class, containing code to run SQL queries using DataFrame API • spark-shell subprocess is spawned to run the scala file: spark-shell --conf spark.ui.enabled=false … -i /tmp/spark- hive-udf-1139336654093084343.scala • Output is parsed and compared to the expected result UDFs testing framework
  • 16. • With test coverage in place, baseline support of UDFs by query count and CPU days was identified: 58% • Failed tests helped to identify the common issues UDFs testing framework
  • 18. • getRequiredJars and getRequiredFiles - functions to automatically include additional resources required by this UDF. • initialize(StructObjectInspector) in GenericUDTF - Spark SQL uses a deprecated interface initialize(ObjectInspector[]) only. • configure (GenericUDF, GenericUDTF, and GenericUDAFEvaluator) - a function to initialize functions with MapredContext, which is inapplicable to Spark. • close (GenericUDF and GenericUDAFEvaluator) is a function to release associated resources. Spark SQL does not call this function when tasks finish. • reset (GenericUDAFEvaluator) - a function to re-initialize aggregation for reusing the same aggregation. Spark SQL currently does not support the reuse of aggregation. • getWindowingEvaluator (GenericUDAFEvaluator) - a function to optimize aggregation by evaluating an aggregate over a fixed window. Unsupported APIs
  • 19. • getRequiredJars and getRequiredFiles - functions to automatically include additional resources required by this UDF. • initialize(StructObjectInspector) in GenericUDTF - Spark SQL uses a deprecated interface initialize(ObjectInspector[]) only. • configure (GenericUDF, GenericUDTF, and GenericUDAFEvaluator) - a function to initialize functions with MapredContext, which is inapplicable to Spark. • close (GenericUDF and GenericUDAFEvaluator) is a function to release associated resources. Spark SQL does not call this function when tasks finish. • reset (GenericUDAFEvaluator) - a function to re-initialize aggregation for reusing the same aggregation. Spark SQL currently does not support the reuse of aggregation. • getWindowingEvaluator (GenericUDAFEvaluator) - a function to optimize aggregation by evaluating an aggregate over a fixed window. Unsupported APIs
  • 20. getRequiredFiles and getRequiredJars • functions to automatically include additional resources required by this UDF • UDF code can assume that file is present in the executor working directory
  • 21. Supporting required files/jars (SPARK-27543) Driver Executor Executor fetches files added to SparkContext from DriverDuring initialization, for each U DF: - Identify required files and jars - Register files for distribution: SparkContext.addFile(…) SparkContext.addJar(…) For each UDF: - If required file is in working dir – do nothing (was distributed) - If file is missing – try create a symlink to absolute path
  • 22. • Majority of Hive UDFs are written without concurrency in mind • Hive runs tasks in a separate JVM process per each task • Spark runs a separate JVM process for each Executor, and Executor can run multiple tasks concurrently UDFs and Thread Safety Executor Task 1 UDF instance 1 Task 2 UDF instance 2
  • 23. Thread-unsafe UDF Example • Consider that we have 2 tasks and hence 2 instances of UDF: “instance 1” and “instance 2” • evaluate method is called for each row, both of the instances could pass the null-check inside evaluate method at the same time • Once “instance 1” finishes initialization first, it will call evaluate for the next row • If “instance 2” is still in the middle of initializing the mapping, it could overwrite the data that “instance 1” relied on, which could lead to data corruption or an exception
  • 24. Approach 1: Introduce Synchronization • Introduce locking (synchronization) on the UDF class when initializing the mapping Cons: • Synchronization is computationally expensive • Requires manual and accurate refactoring of code, which does not scale for hundreds of UDFs
  • 25. Approach 2: Make Field Non-static • Turn static variable into an instance variable Cons: • Adds more pressure on memory (instances cannot share complex data) Pros: • Minimum changes in the code, which can also be codemoded for all other UDFs that use static fields of non-primitive types
  • 26. • In Spark, UDF objects are initialized on Driver, serialized, and later deserialized on executors • Some classes cannot be deserialized out of the box • Example: guava’s ImmutableSet. Kryo can successfully serialize the objects on the driver, but fails to deserialized them on executors Kryo serialization/deserialization
  • 27. • Catch serde issues by running Hive UDF tests in cluster mode • For commonly used classes, write custom or import existing serializers • Mark problematic instance variables as transient Solving Kryo serde problem
  • 28. • Hive UDFs don’t support data types from Spark out of the box • Similarly, Spark cannot work with Hive’s object inspectors • For each UDF call, Spark’s data types are wrapped into Hive’s inspectors and java types • Same for the results: java types are converted back into Spark’s data types Hive UDFs performance
  • 29. • This wrapping/unwrapping overhead can lead up to 2x of CPU time spent in UDF compared to a Spark-native implementation • UDFs that work with complex types suffer the most Hive UDFs performance
  • 30. • UDFs account for 15% of CPU spent for Spark queries • The top most computationally expensive UDFs can be converted to Spark-native UDFs Hive UDFs performance
  • 32. SELECT id, max(value) FROM table GROUP BY id Aggregation
  • 33. Aggregation id value 1 100 1 200 2 400 3 100 id value 1 300 2 200 2 300 id value 1 100 1 200 1 300 3 100 id value 2 400 2 200 2 300 id max(value) 1 300 3 100 id max(value) 2 400 Mapper Reducer Shuffle Aggregation Mapper 1 Mapper 2 Reducer 1 Reducer 2
  • 34. 1. Every row needs to be shuffled through network, which is a heavy operation. 2. Data skew. One reducer need to process more data than others if one key has more rows. 1. For example: key1 has 1 million rows, while other keys each have 10 rows on average What’s the problem
  • 35. Partial Aggregation is the technique that a system partially aggregates the data in mapper side before shuffle, in order to reduce the shuffle size. Partial Aggregation
  • 36. SELECT id, max(value) FROM table GROUP BY id Partial Aggregation
  • 37. Aggregation id value 1 100 1 200 2 400 3 100 id value 1 300 2 200 2 300 id value 1 100 1 200 1 300 3 100 id value 2 400 2 200 2 300 id max(value) 1 300 3 100 id max(value) 2 400 Mapper Reducer Shuffle Aggregation Mapper 1 Mapper 2 Reducer 1 Reducer 2
  • 38. Partial Aggregation id partial_max (value) 1 200 2 400 3 100 id partial_max (value) 1 300 2 300 id partial_max (value) 1 100 1 300 3 100 id partial_max (value) 2 400 2 300 id max(value) 1 300 3 100 id max(value) 2 400 Mapper Reducer Shuffle Final Aggregationid value 1 100 1 200 2 400 3 100 id value 1 300 2 200 2 300 Partial Aggregation Mapper 1 Mapper 2 Reducer 1 Reducer 2
  • 39. Aggregation vs Partial Aggregation Aggregation Partial Aggregation Shuffle Data (# Rows) All rows Reduced number of rows Computation Aggregation happens all in reducer side Extra CPU for partial aggregation, distributed across Mappers and Reducers
  • 40. • Why partial aggregation is important • It impacts CPU and Shuffle size • It could help data skew Partial Aggregation is important
  • 41. 1. Partial aggregation support is already in Spark 2. Fixed some issues to make it work with FB UDAFs What we did
  • 43. 1. Partial aggregation improved CPU by 20%, shuffle data size 17% 2. However, we also observed some heavy pipelines regressed as much as 300% FB Production Result
  • 44. 1. Query shape 2. Data distribution What could go wrong?
  • 45. • Column Expansion • Partial aggregation expands the number of columns at the Mapper side, results in a larger shuffle data size SELECT key, max(value), min(value), count(value), avg(value) FROM table GROUP BY key When partial aggregation doesn’t work
  • 46. Column Expansion id p_max p_min P_count P_avg 1 200 100 2 (300, 2) 2 400 400 1 (400, 1) 3 100 100 1 (100, 1) id value 1 100 1 200 2 400 3 100 Partial Aggregation id value 1 100 1 200 2 400 3 100 Shuffle Shuffle 2 columns Mapper Reducer Shuffle 5 columns Aggregation Partial Aggregation
  • 47. • Query Shape • Column Expansion • Data distribution • No row to aggregate in mapper side SELECT key, max(value) FROM table GROUP BY key When partial aggregation doesn’t work
  • 48. Data Distribution id value 1 100 2 200 3 400 4 100 Partial Aggregation id value 1 100 2 200 3 400 4 100 Shuffle Shuffle 4 rows Mapper Reducer Shuffle 4 rows Extra CPU with NO Row Reduction id Partial_max(value) 1 100 2 200 3 400 4 100 Aggregation Partial Aggregation
  • 50. 1. Each UDAF function partial aggregation performance 2. Column Expansion 3. Row Reduction Partial Aggregation Computation Cost Factors
  • 51. • Computation cost-based optimizer for partial aggregation 1. Use multiple features to calculate the computation cost of partial aggregation 1. input column number 2. output column number 3. computation cost of UDAF partial aggregation function 4. … 2. Use the calculated computation cost to decided the configuration for partial aggregation. How we solved the problem
  • 52. 1. It improves the efficiency over the board 2. However, there are still queries don’t have most optimized partial aggregation configuration. Result
  • 53. 1. Each UDAF function partial aggregation performance 2. Column Expansion 3. Row Reduction Partial Aggregation Computation Cost Factors
  • 54. • It’s hard to know the row reduction • It depends on the data distribution which might be different for different day • For different group by keys, the row reduction is different Row Reduction
  • 55. • History based tuning • Use history data of the query to predict the best configuration for future runs • Perfect for partial aggregation because it operates at query level. It could try different config and use them to direct the config of future run Future work
  • 56. Recap