How We Optimize Spark SQL Jobs With parallel and sync IO

•

0 gefällt mir•596 views

Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing. In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30% In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.

Daten & Analysen

How we optimize
Spark SQL jobs with
parallel and
asynchronous I/O
Guo, Jun (jason.guo.vip@gmail.com)
Lead of Data Engine Team, ByteDance

Who we are
▪ Data Engine team at
ByteDance
▪ Build a platform of
one-stop experience for
OLAP , on which users can
analyze PB level data by
writing SQL without caring
about the underlying
execution engine

What we do
▪ Manage Spark SQL /
Presto / Hive workloads
▪ Offer Open API and
self-serve platform
▪ Optimize Spark SQL /
Presto / Hive engine
▪ Design data architecture
for most business lines in
ByteDance

Agenda
• Spark SQL at ByteDance
• Why does I/O matter for Spark SQL
• How we boost Spark SQL jobs by parallel and
asynchronous I/O
• Prospects

Spark SQL at ByteDance
2016 2017 2018 2019 2020
Small Scale Experiments
Ad-hoc workload
Few ETL pipelines in production
Full-production deployment
Main engine in DW area
2021
Totally replace Hive for ETL

▪ NVMe SSD perform better than HDD
by two magnitude
▪ More and more new hardware have
been invented in past years, such as
AEP
▪ Many papers show that ‘I/O is faster
than CPU’
▪ TCO is one of the most important
factors for huge data storage
▪ Most of servers have a lot of HDD,
especially for Hadoop cluster
▪ I/O cost contribute more that 30% of
total latency of Spark ETL jobs
I/O is still the bottleneck for big data
processing
I/O performance has been improved
Why does I/O matter for Spark SQL

How we boost Spark SQL jobs by
parallel and asynchronous IO

Parallel IO
• Spark SQL will split a large
Parquet ﬁle into a group of
splits, each of which contains
one or a few row groups
• Each task will read these row
group sequentially

Parallel IO
• Spark SQL can combine a
group of small parquet ﬁles
into a single split
• Each task will read these ﬁles
in a single group sequentially

Parallel I/O
▪ I/O and computation are handled
sequentially by the same thread
▪ Tuples in a single task are computed
sequentially
▪ I/O for different ﬁles or row groups
are handled sequentially
▪ Introduce a buffer to separate I/O and
computation
▪ I/O and computation will be handled
in separated threads
▪ I/O for different ﬁles or row groups
can be done in a parallel approach
I/O and computation in separated
threads
I/O and computation in a single thread

Parallel I/O
File level parallel I/O
Row Group level parallel I/O

Parallel I/O
• Column level parallel I/O
o Split a logical Parquet ﬁle into a
group of column family, which is a
physical Parquet ﬁle
o Each column family contains a few
columns
o Spark SQL will read different column
family in parallel

The future work
• I/O
• Adaptive column family
• Smart cache
• Computation
• Vectorized computation
• Native engine

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Empfohlen

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Understanding Query Plans and Spark UIsDatabricks

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Understanding and Improving Code GenerationDatabricks

Deep Dive: Memory Management in Apache SparkDatabricks

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Empfohlen

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Understanding Query Plans and Spark UIsDatabricks

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Understanding and Improving Code GenerationDatabricks

Deep Dive: Memory Management in Apache SparkDatabricks

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Vectorized Query Execution in Apache Spark at FacebookDatabricks

Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks

Deep Dive into the New Features of Apache Spark 3.0Databricks

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks

Dynamic Partition Pruning in Apache SparkDatabricks

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks

Parquet performance tuning: the missing guideRyan Blue

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

Optimizing Apache Spark SQL JoinsDatabricks

Spark SQL Join Improvement at FacebookDatabricks

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki

Memory Management in Apache SparkDatabricks

Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks

The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit

Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks

Data Pipeline for The Big Data/Data Science OKCMark Smith

Weitere ähnliche Inhalte

Was ist angesagt?