SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Scaling ML Feature Engineering with
Apache Spark at Facebook
Cheng Su & Sameer Agarwal
Facebook Inc.
About Us
▪ Sameer Agarwal
▪ Software Engineer at Facebook (Data Platform Team)
▪ Apache Spark Committer (Spark Core/SQL)
▪ Previously at Databricks and UC Berkeley
▪ Cheng Su
▪ Software Engineer at Facebook (Data Platform Team)
▪ Apache Spark Contributor (Spark SQL)
▪ Previously worked on Hive & Hadoop at Facebook
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Machine Learning at Facebook1
Data Features Training Inference
1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018
PredictionsModel
Machine Learning at Facebook1
Data Features Training
Inferenc
e
PredictionsModel
This Talk
1. Data Layouts (Tables and Physical Encodings)
2. Feature Reaping
3. Feature Injection
1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Data Layouts (Tables and Physical Encodings)
Training Data Table
- Table to store data for ML training
- Huge volume (multiple PBs/day)
userId: BIGINT
adId: BIGINT
features: MAP<INT, DOUBLE>
…
Feature Tables
- Tables to store all possible features (many of them aren’t promoted in training data
table)
- Smaller volume (low-100s of TBs/ day)
userId: BIGINT
features: MAP<INT, DOUBLE>
…
gender likes …
age
state
country
Data Layouts (Tables and Physical Encodings)
1. Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
gender likes …
age
state
country
Feature Injection
Training Data Table Feature Tables
Data Layouts (Tables and Physical Encodings)
1. Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
2. Feature Reaping: Removing unnecessary features (id and value) from training
data. Think “deleting existing keys from a map”
gender likes …
age
state
country
Feature Injection
Feature Reaping
Training Data Table Feature Tables
Background: Apache ORC
▪ Stripe (Row Group)
▪ Rows are divided into multiple groups
▪ Stream
▪ Columns are stored separately
▪ PRESET, DATA, LENGTH stream for each column
▪ Different encoding and compression
strategy for each column
How is a Feature Map Stored in ORC?
▪ Key and value are stored as separate streams/columns
- Raw Data
- Row 1: (k1, v1)
- Row 2: (k1, v2), (k2, v3)
- Row 3: (k1, v5), (k2, v4)
- Streams
- Key stream: k1, k1, k2, k1, k2
- Value stream: v1, v2, v3 v5, v4
▪ Each stream is individually encoded and compressed
▪ Reading or deleting specific keys (i.e., feature reaping) becomes a
problem
- Need to read (decompress and decode) and re-write ALL keys and values
features: MAP<INT, DOUBLE>
STRUCT
col -1, node: 0
MAP
INT
col 0, node: 2
DOUBLE
col 0, node: 3
col 0, node: 1
k1, k1, k2, k1, k2 v1, v2, v3 v5, v4
Introducing: ORC Flattened Map
▪ Values that correspond to each key are stored as separate streams
- Raw Data
- Row 1: (k1, v1)
- Row 2: (k1, v2), (k2, v3)
- Row 3: (k1, v5), (k2, v4)
- Streams
- k1 stream: v1, v2, v5
- k2 stream: NULL, v3, v4
- Stores map like a struct
▪ Each key’s value stream is individually encoded and compressed
▪ Reading or deleting specific keys becomes very efficient!
features: MAP<INT, DOUBLE>
STRUCT
col -1, node: 0
MAP
Value (k1)
col 0, node: 3, seq: 1
Value (k2)
col 0, node: 1
v1, v2, v5 NULL, v3, v4
col 0, node: 3, seq: 2
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Feature Reaping
▪ Feature Reaping frameworks generate Spark
SQL queries based on table name, partitions,
and reaped feature ids
▪ For each reaping SQL query, Spark has special
customization in query planner, execution
engine and commit protocol
▪ Each Spark task launches a SQL transform
process, and uses native/C++ binary to do
efficient flat map operations
SparkJavaExecutor
c++ reaper
transform
SparkJavaExecutor
c++ reaper
transform
training_data_v1_1.orc training_data_v1_2.orc
training_data_v2_1.orc training_data_v2_2.orc
Performance
0
10000
20000
30000
40000
50000
20PB
CPU(days)
CPU cost for flat map vs naïve solution*
(14x better on 20PB data)
Naïve Flat Map
0
500000
1000000
1500000
2000000
300PB
CPU(days)
CPU cost for flat map vs naïve solution*
(89x better on 300PB data)
Naïve Flat Map
▪ Case 1
▪ Input data size: 20PB
▪ # of reaped features: 200
▪ # total features: ~1k
▪ Case 2
▪ Input data size: 300PB
▪ # of reaped features: 200
▪ # total features: ~10k
*Naïve solution: A Spark SQL query to re-write all data
with removing required features from map column with
UDF/Lambda.
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Data Layouts (Tables and Physical Encodings)
Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
Requirements:
1. Allow fast ML training experimentation
2. Save storage space
gender likes …
age
state
country
Feature Injection
Training Data Table Feature Tables
Data Layouts (Tables and Physical Encodings)
Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
Requirements:
1. Allow fast ML training experimentation
2. Save storage space
gender likes …
age
state
country
Feature Injection
Introducing: Aligned Tables!
Training Data Table Feature Tables
Introducing: Aligned Table
▪ Intuition: Store the output of the join between the training table
and the feature table in 2 separate row-by-row aligned tables
▪ An aligned table is a table that has the same layout as the original
table
- Same number of files
- Same file names
- Same number of rows (and their order) in each file.
col -1, node: 0
col 0, node: 3, seq: 1
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
id feature
1 f1
2 f2
4 f4
6 f6
training table
feature table
file_1.orc file_2.orc
file_1.orc
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
file_1.orc file_2.orc
aligned table
Query Plan for Aligned Table
col -1, node: 0
col 0, node: 3, seq: 1
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
id feature
1 f1
2 f2
4 f4
6 f6
training table
feature table
file_1.orc file_2.orc
file_1.orc
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
file_1.orc file_2.orc
aligned table
Scan
(training table)
Scan
(feature table)
Project
(…, file_name,
row_order)
Join
(LEFT OUTER)
Shuffle
(file_name)
Sort
(file_name,
row_order)
InsertIntoHadoopFsRelationComman
d (Aligned Table)
Reading Aligned Tables
▪ FB-ORC aligned table row-by-row merge reader
▪ Read each aligned table file with the corresponding original table file in one task
▪ Read row-by-row according to row order
▪ Merge aligned table columns per row with corresponding original table columns per row
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
training table
file_1.orc file_2.orc
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
file_1.orc file_2.orc
aligned table aligned tabletraining table
reader task 1 reader task 2
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
Aligned Tables vs Left Outer Join
Compute Savings: 15x
Storage Savings: 30x
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
2. Baseline 2: Lookup Hash Join
▪ Load feature table(s) into a distributed hash table (Laser1)
▪ Lookup hash join while reading training table
▪ Cons:
▪ Adds an external dependency on a distributed hash table; impacts latency, reliability &
efficiency
▪ Needs a lookup hash join each time the training table is read
1
Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp-
content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details
Aligned Tables vs Left Outer Join
Compute Savings: 15x
Storage Savings: 30x
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
2. Baseline 2: Lookup Hash Join
▪ Load feature table(s) into a distributed hash table (Laser1)
▪ Lookup hash join while reading training table
▪ Cons:
▪ Adds an external dependency on a distributed hash table; impacts latency, reliability &
efficiency
▪ Needs a lookup hash join each time the training table is read
1
Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp-
content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details
Aligned Tables vs Left Outer Join
Compute Savings: 15x
Storage Savings: 30x
Aligned Tables vs Lookup Hash
Join
Compute Savings: 1.5x
Storage Savings: 2.1x
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Future Work
▪ Better Spark SQL interface for ML primitives (e.g., UPSERTs)
▪ Onboarding more ML use cases to Spark
▪ Batch Inference
▪ Training
MERGE training_table
PARTITION(ds='2020-10-28', pipeline='...', ts)
USING (
SELECT ...) AS f
ON features[0][0] = f.key
WHEN MATCHED THEN UPDATE
SET float_features = MAP_CONCAT(float_features,
f.densefeatures)
Thank you!
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Weitere ähnliche Inhalte

Was ist angesagt?

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoDatabricks
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...inwin stack
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationOri Reshef
 
Couchbase presentation
Couchbase presentationCouchbase presentation
Couchbase presentationsharonyb
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Grant McAlister
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureScyllaDB
 

Was ist angesagt? (20)

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Druid
DruidDruid
Druid
 
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 
Couchbase presentation
Couchbase presentationCouchbase presentation
Couchbase presentation
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 

Ähnlich wie Scaling Machine Learning Feature Engineering in Apache Spark at Facebook

Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationDatabricks
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Databricks
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQLSatoshi Nagayasu
 

Ähnlich wie Scaling Machine Learning Feature Engineering in Apache Spark at Facebook (20)

Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Meetup talk
Meetup talkMeetup talk
Meetup talk
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Kürzlich hochgeladen (20)

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

Scaling Machine Learning Feature Engineering in Apache Spark at Facebook

  • 1. Scaling ML Feature Engineering with Apache Spark at Facebook Cheng Su & Sameer Agarwal Facebook Inc.
  • 2. About Us ▪ Sameer Agarwal ▪ Software Engineer at Facebook (Data Platform Team) ▪ Apache Spark Committer (Spark Core/SQL) ▪ Previously at Databricks and UC Berkeley ▪ Cheng Su ▪ Software Engineer at Facebook (Data Platform Team) ▪ Apache Spark Contributor (Spark SQL) ▪ Previously worked on Hive & Hadoop at Facebook
  • 3. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 4. Machine Learning at Facebook1 Data Features Training Inference 1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018 PredictionsModel
  • 5. Machine Learning at Facebook1 Data Features Training Inferenc e PredictionsModel This Talk 1. Data Layouts (Tables and Physical Encodings) 2. Feature Reaping 3. Feature Injection 1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018
  • 6. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 7. Data Layouts (Tables and Physical Encodings) Training Data Table - Table to store data for ML training - Huge volume (multiple PBs/day) userId: BIGINT adId: BIGINT features: MAP<INT, DOUBLE> … Feature Tables - Tables to store all possible features (many of them aren’t promoted in training data table) - Smaller volume (low-100s of TBs/ day) userId: BIGINT features: MAP<INT, DOUBLE> … gender likes … age state country
  • 8. Data Layouts (Tables and Physical Encodings) 1. Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” gender likes … age state country Feature Injection Training Data Table Feature Tables
  • 9. Data Layouts (Tables and Physical Encodings) 1. Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” 2. Feature Reaping: Removing unnecessary features (id and value) from training data. Think “deleting existing keys from a map” gender likes … age state country Feature Injection Feature Reaping Training Data Table Feature Tables
  • 10. Background: Apache ORC ▪ Stripe (Row Group) ▪ Rows are divided into multiple groups ▪ Stream ▪ Columns are stored separately ▪ PRESET, DATA, LENGTH stream for each column ▪ Different encoding and compression strategy for each column
  • 11. How is a Feature Map Stored in ORC? ▪ Key and value are stored as separate streams/columns - Raw Data - Row 1: (k1, v1) - Row 2: (k1, v2), (k2, v3) - Row 3: (k1, v5), (k2, v4) - Streams - Key stream: k1, k1, k2, k1, k2 - Value stream: v1, v2, v3 v5, v4 ▪ Each stream is individually encoded and compressed ▪ Reading or deleting specific keys (i.e., feature reaping) becomes a problem - Need to read (decompress and decode) and re-write ALL keys and values features: MAP<INT, DOUBLE> STRUCT col -1, node: 0 MAP INT col 0, node: 2 DOUBLE col 0, node: 3 col 0, node: 1 k1, k1, k2, k1, k2 v1, v2, v3 v5, v4
  • 12. Introducing: ORC Flattened Map ▪ Values that correspond to each key are stored as separate streams - Raw Data - Row 1: (k1, v1) - Row 2: (k1, v2), (k2, v3) - Row 3: (k1, v5), (k2, v4) - Streams - k1 stream: v1, v2, v5 - k2 stream: NULL, v3, v4 - Stores map like a struct ▪ Each key’s value stream is individually encoded and compressed ▪ Reading or deleting specific keys becomes very efficient! features: MAP<INT, DOUBLE> STRUCT col -1, node: 0 MAP Value (k1) col 0, node: 3, seq: 1 Value (k2) col 0, node: 1 v1, v2, v5 NULL, v3, v4 col 0, node: 3, seq: 2
  • 13. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 14. Feature Reaping ▪ Feature Reaping frameworks generate Spark SQL queries based on table name, partitions, and reaped feature ids ▪ For each reaping SQL query, Spark has special customization in query planner, execution engine and commit protocol ▪ Each Spark task launches a SQL transform process, and uses native/C++ binary to do efficient flat map operations SparkJavaExecutor c++ reaper transform SparkJavaExecutor c++ reaper transform training_data_v1_1.orc training_data_v1_2.orc training_data_v2_1.orc training_data_v2_2.orc
  • 15. Performance 0 10000 20000 30000 40000 50000 20PB CPU(days) CPU cost for flat map vs naïve solution* (14x better on 20PB data) Naïve Flat Map 0 500000 1000000 1500000 2000000 300PB CPU(days) CPU cost for flat map vs naïve solution* (89x better on 300PB data) Naïve Flat Map ▪ Case 1 ▪ Input data size: 20PB ▪ # of reaped features: 200 ▪ # total features: ~1k ▪ Case 2 ▪ Input data size: 300PB ▪ # of reaped features: 200 ▪ # total features: ~10k *Naïve solution: A Spark SQL query to re-write all data with removing required features from map column with UDF/Lambda.
  • 16. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 17. Data Layouts (Tables and Physical Encodings) Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” Requirements: 1. Allow fast ML training experimentation 2. Save storage space gender likes … age state country Feature Injection Training Data Table Feature Tables
  • 18. Data Layouts (Tables and Physical Encodings) Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” Requirements: 1. Allow fast ML training experimentation 2. Save storage space gender likes … age state country Feature Injection Introducing: Aligned Tables! Training Data Table Feature Tables
  • 19. Introducing: Aligned Table ▪ Intuition: Store the output of the join between the training table and the feature table in 2 separate row-by-row aligned tables ▪ An aligned table is a table that has the same layout as the original table - Same number of files - Same file names - Same number of rows (and their order) in each file. col -1, node: 0 col 0, node: 3, seq: 1 id features 1 ... 2 ... 5 ... id features 3 ... 4 ... 6 ... id feature 1 f1 2 f2 4 f4 6 f6 training table feature table file_1.orc file_2.orc file_1.orc id feature 1 f1 2 f2 5 NULL id feature 3 NULL 4 f4 6 f6 file_1.orc file_2.orc aligned table
  • 20. Query Plan for Aligned Table col -1, node: 0 col 0, node: 3, seq: 1 id features 1 ... 2 ... 5 ... id features 3 ... 4 ... 6 ... id feature 1 f1 2 f2 4 f4 6 f6 training table feature table file_1.orc file_2.orc file_1.orc id feature 1 f1 2 f2 5 NULL id feature 3 NULL 4 f4 6 f6 file_1.orc file_2.orc aligned table Scan (training table) Scan (feature table) Project (…, file_name, row_order) Join (LEFT OUTER) Shuffle (file_name) Sort (file_name, row_order) InsertIntoHadoopFsRelationComman d (Aligned Table)
  • 21. Reading Aligned Tables ▪ FB-ORC aligned table row-by-row merge reader ▪ Read each aligned table file with the corresponding original table file in one task ▪ Read row-by-row according to row order ▪ Merge aligned table columns per row with corresponding original table columns per row id features 1 ... 2 ... 5 ... id features 3 ... 4 ... 6 ... training table file_1.orc file_2.orc id feature 1 f1 2 f2 5 NULL id feature 3 NULL 4 f4 6 f6 file_1.orc file_2.orc aligned table aligned tabletraining table reader task 1 reader task 2
  • 22. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time
  • 23. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time Aligned Tables vs Left Outer Join Compute Savings: 15x Storage Savings: 30x
  • 24. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time 2. Baseline 2: Lookup Hash Join ▪ Load feature table(s) into a distributed hash table (Laser1) ▪ Lookup hash join while reading training table ▪ Cons: ▪ Adds an external dependency on a distributed hash table; impacts latency, reliability & efficiency ▪ Needs a lookup hash join each time the training table is read 1 Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp- content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details Aligned Tables vs Left Outer Join Compute Savings: 15x Storage Savings: 30x
  • 25. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time 2. Baseline 2: Lookup Hash Join ▪ Load feature table(s) into a distributed hash table (Laser1) ▪ Lookup hash join while reading training table ▪ Cons: ▪ Adds an external dependency on a distributed hash table; impacts latency, reliability & efficiency ▪ Needs a lookup hash join each time the training table is read 1 Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp- content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details Aligned Tables vs Left Outer Join Compute Savings: 15x Storage Savings: 30x Aligned Tables vs Lookup Hash Join Compute Savings: 1.5x Storage Savings: 2.1x
  • 26. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 27. Future Work ▪ Better Spark SQL interface for ML primitives (e.g., UPSERTs) ▪ Onboarding more ML use cases to Spark ▪ Batch Inference ▪ Training MERGE training_table PARTITION(ds='2020-10-28', pipeline='...', ts) USING ( SELECT ...) AS f ON features[0][0] = f.key WHEN MATCHED THEN UPDATE SET float_features = MAP_CONCAT(float_features, f.densefeatures)
  • 28. Thank you! Your feedback is important to us. Don’t forget to rate and review the sessions.