Using Apache Spark as ETL engine. Pros and Cons

Using Apache Spark as ETL engine
Pros and cons
Maksym Doroshenko
Big Data Software Engineer
LeadGenius, Provectus

Agenda
1. What is Spark
2. Spark components
3. Spark pillars
4. What is ETL pipeline
5. Using Spark SQL for ETL
6. Customer use case
7. Demo

Spark, who are you?
I am is a fast and
general engine
for large-scale data
processing.

Prove it
Hadoop MR Spark Spark
Data Size 102.5 TB 100 TB 1000 TB
Elapsed Time 72 mins 23 mins 234 mins
# Nodes 2100 206 190
# Cores 50400 6592 6080
# Reducers 10,000 29,000 250,000
Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
Environment dedicated data
center
EC2 (i2.8xlarge) EC2 (i2.8xlarge)
Apache Spark has an advanced DAG execution engine

that supports acyclic data ﬂow and in-memory computing.

Spark use cases
- Simplify the challenging and compute-intensive task of
processing high volumes of data

- Real time data processing

- Seamlessly integrating complex capabilities such as machine
learning and graph algorithms

- Spark brings Big Data processing to the masses

Survey: Why companies use Spark ?
91% use Apache Spark because of its performance gains

77% use Apache Spark as it is easy to use

71% use Apache Spark due to the ease of deployment

64% use Apache Spark to leverage advanced analytics

52% use Apache Spark for real-time streaming.

What is RDD
Resilient Distributed Dataset - a big collection of data with following
properties:
- Immutable
- Distributed
- Lazily evaluated
- Fault tolerante

Narrow transformations
- map
- ﬂatMap
- ﬁlter
- etc.

Wide transformations
- reduceByKey
- groupByKey
- sortByKey
- etc.

Spark Dataframes
Dataframes is distributed collection of data grouped into named columns
(RDD with schema) with more eﬃcient storage options, advanced
optimizer, and direct operations on serialized data. These components are
super important for getting the best of Spark performance

What is ETL?
1. Sequence of transformation on data

2. Source data is typically semi-structured/unstructured
(Text, JSON, CSV etc.) and structured (JDBC, Parquet,
ORC, AVRO, etc.)

3. Output data is clean, structured, integrated and ready
for further data processing, analysis and reporting.

Why is ETL Hard?
1. Various sources/formats

2. Schema mismatch

3. Diﬀerent representation

4. Corrupted ﬁles and data

5. Scalability

6. Schema evolution

This is why ETL is important
Consumers of this data do not want to deal
with this messiness and complexity

Spark SQL's ﬂexible APIs,
support for a wide variety of
datasources,
build-in support for structured
streaming,
state of art catalyst optimizer
and tungsten execution engine
make it a great framework for
building end-to-end ETL
pipelines.
Spark SQL

Data sources
https://spark-packages.org/

Schema inference: semi structured data

User speciﬁed schema
Faster
No scan to infer schema
More ﬂexible
Easily handle schema evolving
More robust
Handle type errors ASAP

Deal with bad data
java.io.IOException: org.apache.hadoop.io.compress.DecompressorStream.decompress
java.io.EOFException: Unexpected end of input stream
java.lang.RuntimeException: file:/temp/path/c000.json is not a Parquet file (too small)
[SPARK-17850] If true, the Spark jobs
will continue to run even when it
encounters corrupt files. The contents
that have been read will still be
returned.
spark.sql.files.ignoreCorruptFiles = true

Deal with bad data
[SPARK-12833] [SPARK-13764]
TextFile formats (JSON and CSV)
supports 3 Parse modes while reading data:
PERMISSIVE
DROPMALFORMED
FAILFAST

Better JSON and CSV support
[SPARK-18352] [SPARK-19610]
Multiline JSON and CSV support
Spark SQL reads JSON/CSV one line at time
Before Spark 2.2 it requires custom ETL

Transformations: Higher order functions in SQL
Transformations on complex objects like arrays, maps and
structures inside of columns.

1. Check for element existence
SELECT EXIST(values, e->e>30) AS v
FROM tbl_nested;
2. Transform an array
SELECT TRANSFORM(values, e->e*e) AS v FROM tbl_nested;

Transformations: Higher order functions in SQL
3. Filter an array
SELECT FILTER(values, e->e>30) AS v FROM tbl_nested;
4. Aggregate an array
SELECT REDUCE(values, 0, (value, acc)->acc+value) AS v FROM
tbl_nested;

Load
Diﬀerent modes:
Error
Append
Overwrite
Ignore
Wide functionality:
df.write
.partitionBy(“favorite_color”)
.bucketBy(42, “name")
.sortBy(“age”)
.saveAsTable("people_partitioned_bucketed”))

Customer Use Case
- Data sources in diﬀerent formats

- Mapping data to golden customer schema

- Normalize all data (e.g. email, phone)

- Link and merge same entities

Spark ETL Pros & Cons
- Pros:

Open source

Great community

Easy to scale

Strong transformation engine

Support different languages

Unified API for different components
- Cons:

No File management system

Resource consuming

Manual configuration tuning

No ETL UI

QA & DEMO TIME
mdoroshenko@provectus.com
skype: maxdor3

Using Apache Spark as ETL engine. Pros and Cons

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Using Apache Spark as ETL engine. Pros and Cons

Ähnlich wie Using Apache Spark as ETL engine. Pros and Cons (20)

Mehr von Provectus

Mehr von Provectus (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Using Apache Spark as ETL engine. Pros and Cons