The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
Using Apache Spark as ETL engine. Pros and Cons
1. Using Apache Spark as ETL engine
Pros and cons
Maksym Doroshenko
Big Data Software Engineer
LeadGenius, Provectus
2. Agenda
1. What is Spark
2. Spark components
3. Spark pillars
4. What is ETL pipeline
5. Using Spark SQL for ETL
6. Customer use case
7. Demo
3. Spark, who are you?
I am is a fast and
general engine
for large-scale data
processing.
4. Prove it
Hadoop MR Spark Spark
Data Size 102.5 TB 100 TB 1000 TB
Elapsed Time 72 mins 23 mins 234 mins
# Nodes 2100 206 190
# Cores 50400 6592 6080
# Reducers 10,000 29,000 250,000
Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
Environment dedicated data
center
EC2 (i2.8xlarge) EC2 (i2.8xlarge)
Apache Spark has an advanced DAG execution engine
that supports acyclic data flow and in-memory computing.
5. Spark use cases
- Simplify the challenging and compute-intensive task of
processing high volumes of data
- Real time data processing
- Seamlessly integrating complex capabilities such as machine
learning and graph algorithms
- Spark brings Big Data processing to the masses
6. Survey: Why companies use Spark ?
91% use Apache Spark because of its performance gains
77% use Apache Spark as it is easy to use
71% use Apache Spark due to the ease of deployment
64% use Apache Spark to leverage advanced analytics
52% use Apache Spark for real-time streaming.
8. What is RDD
Resilient Distributed Dataset - a big collection of data with following
properties:
- Immutable
- Distributed
- Lazily evaluated
- Fault tolerante
13. Spark Dataframes
Dataframes is distributed collection of data grouped into named columns
(RDD with schema) with more efficient storage options, advanced
optimizer, and direct operations on serialized data. These components are
super important for getting the best of Spark performance
14. What is ETL?
1. Sequence of transformation on data
2. Source data is typically semi-structured/unstructured
(Text, JSON, CSV etc.) and structured (JDBC, Parquet,
ORC, AVRO, etc.)
3. Output data is clean, structured, integrated and ready
for further data processing, analysis and reporting.
16. Why is ETL Hard?
1. Various sources/formats
2. Schema mismatch
3. Different representation
4. Corrupted files and data
5. Scalability
6. Schema evolution
17. This is why ETL is important
Consumers of this data do not want to deal
with this messiness and complexity
18. Spark SQL's flexible APIs,
support for a wide variety of
datasources,
build-in support for structured
streaming,
state of art catalyst optimizer
and tungsten execution engine
make it a great framework for
building end-to-end ETL
pipelines.
Spark SQL
22. User specified schema
Faster
No scan to infer schema
More flexible
Easily handle schema evolving
More robust
Handle type errors ASAP
23. Deal with bad data
java.io.IOException: org.apache.hadoop.io.compress.DecompressorStream.decompress
java.io.EOFException: Unexpected end of input stream
java.lang.RuntimeException: file:/temp/path/c000.json is not a Parquet file (too small)
[SPARK-17850] If true, the Spark jobs
will continue to run even when it
encounters corrupt files. The contents
that have been read will still be
returned.
spark.sql.files.ignoreCorruptFiles = true
24. Deal with bad data
[SPARK-12833] [SPARK-13764]
TextFile formats (JSON and CSV)
supports 3 Parse modes while reading data:
PERMISSIVE
DROPMALFORMED
FAILFAST
25. Better JSON and CSV support
[SPARK-18352] [SPARK-19610]
Multiline JSON and CSV support
Spark SQL reads JSON/CSV one line at time
Before Spark 2.2 it requires custom ETL
26. Transformations: Higher order functions in SQL
Transformations on complex objects like arrays, maps and
structures inside of columns.
1. Check for element existence
SELECT EXIST(values, e->e>30) AS v
FROM tbl_nested;
2. Transform an array
SELECT TRANSFORM(values, e->e*e) AS v FROM tbl_nested;
27. Transformations: Higher order functions in SQL
3. Filter an array
SELECT FILTER(values, e->e>30) AS v FROM tbl_nested;
4. Aggregate an array
SELECT REDUCE(values, 0, (value, acc)->acc+value) AS v FROM
tbl_nested;
29. Customer Use Case
- Data sources in different formats
- Mapping data to golden customer schema
- Normalize all data (e.g. email, phone)
- Link and merge same entities
30. Spark ETL Pros & Cons
- Pros:
Open source
Great community
Easy to scale
Strong transformation engine
Support different languages
Unified API for different components
- Cons:
No File management system
Resource consuming
Manual configuration tuning
No ETL UI
31. QA & DEMO TIME
mdoroshenko@provectus.com
skype: maxdor3