SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Using Apache Spark as ETL engine
Pros and cons
Maksym Doroshenko
Big Data Software Engineer
LeadGenius, Provectus
Agenda
1. What is Spark
2. Spark components
3. Spark pillars
4. What is ETL pipeline
5. Using Spark SQL for ETL
6. Customer use case
7. Demo
Spark, who are you?
I am is a fast and
general engine
for large-scale data
processing.
Prove it
Hadoop MR Spark Spark
Data Size 102.5 TB 100 TB 1000 TB
Elapsed Time 72 mins 23 mins 234 mins
# Nodes 2100 206 190
# Cores 50400 6592 6080
# Reducers 10,000 29,000 250,000
Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
Environment dedicated data
center
EC2 (i2.8xlarge) EC2 (i2.8xlarge)
Apache Spark has an advanced DAG execution engine

that supports acyclic data flow and in-memory computing.
Spark use cases
- Simplify the challenging and compute-intensive task of
processing high volumes of data

- Real time data processing

- Seamlessly integrating complex capabilities such as machine
learning and graph algorithms

- Spark brings Big Data processing to the masses
Survey: Why companies use Spark ?
91% use Apache Spark because of its performance gains

77% use Apache Spark as it is easy to use

71% use Apache Spark due to the ease of deployment

64% use Apache Spark to leverage advanced analytics

52% use Apache Spark for real-time streaming.
Spark components
What is RDD
Resilient Distributed Dataset - a big collection of data with following
properties:
- Immutable
- Distributed
- Lazily evaluated
- Fault tolerante
Operations
Narrow transformations
- map
- flatMap
- filter
- etc.
Wide transformations
- reduceByKey
- groupByKey
- sortByKey
- etc.
Spark application tree
Spark Dataframes
Dataframes is distributed collection of data grouped into named columns
(RDD with schema) with more efficient storage options, advanced
optimizer, and direct operations on serialized data. These components are
super important for getting the best of Spark performance
What is ETL?
1. Sequence of transformation on data

2. Source data is typically semi-structured/unstructured
(Text, JSON, CSV etc.) and structured (JDBC, Parquet,
ORC, AVRO, etc.)

3. Output data is clean, structured, integrated and ready
for further data processing, analysis and reporting.
ETL query in Spark
Why is ETL Hard?
1. Various sources/formats

2. Schema mismatch

3. Different representation

4. Corrupted files and data

5. Scalability

6. Schema evolution
This is why ETL is important
Consumers of this data do not want to deal
with this messiness and complexity
Spark SQL's flexible APIs,
support for a wide variety of
datasources,
build-in support for structured
streaming,
state of art catalyst optimizer
and tungsten execution engine
make it a great framework for
building end-to-end ETL
pipelines.
Spark SQL
Data sources
https://spark-packages.org/
Schema inference: semi structured data
Schema inference: semi structured data
User specified schema
Faster
No scan to infer schema
More flexible
Easily handle schema evolving
More robust
Handle type errors ASAP
Deal with bad data
java.io.IOException: org.apache.hadoop.io.compress.DecompressorStream.decompress
java.io.EOFException: Unexpected end of input stream
java.lang.RuntimeException: file:/temp/path/c000.json is not a Parquet file (too small)
[SPARK-17850] If true, the Spark jobs
will continue to run even when it
encounters corrupt files. The contents
that have been read will still be
returned.
spark.sql.files.ignoreCorruptFiles = true
Deal with bad data
[SPARK-12833] [SPARK-13764]
TextFile formats (JSON and CSV)
supports 3 Parse modes while reading data:
PERMISSIVE
DROPMALFORMED
FAILFAST
Better JSON and CSV support
[SPARK-18352] [SPARK-19610]
Multiline JSON and CSV support
Spark SQL reads JSON/CSV one line at time
Before Spark 2.2 it requires custom ETL
Transformations: Higher order functions in SQL
Transformations on complex objects like arrays, maps and
structures inside of columns.

1. Check for element existence
SELECT EXIST(values, e->e>30) AS v
FROM tbl_nested;
2. Transform an array
SELECT TRANSFORM(values, e->e*e) AS v FROM tbl_nested;
Transformations: Higher order functions in SQL
3. Filter an array
SELECT FILTER(values, e->e>30) AS v FROM tbl_nested;
4. Aggregate an array
SELECT REDUCE(values, 0, (value, acc)->acc+value) AS v FROM
tbl_nested;
Load
Different modes:
Error
Append
Overwrite
Ignore
Wide functionality:
df.write
.partitionBy(“favorite_color”)
.bucketBy(42, “name")
.sortBy(“age”)
.saveAsTable("people_partitioned_bucketed”))
Customer Use Case
- Data sources in different formats

- Mapping data to golden customer schema

- Normalize all data (e.g. email, phone)

- Link and merge same entities
Spark ETL Pros & Cons
- Pros:

Open source

Great community

Easy to scale 

Strong transformation engine

Support different languages

Unified API for different components
- Cons:

No File management system

Resource consuming

Manual configuration tuning

No ETL UI
QA & DEMO TIME
mdoroshenko@provectus.com
skype: maxdor3

Weitere ähnliche Inhalte

Was ist angesagt?

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 

Was ist angesagt? (20)

Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 

Ähnlich wie Using Apache Spark as ETL engine. Pros and Cons

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 

Ähnlich wie Using Apache Spark as ETL engine. Pros and Cons (20)

Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Let's start with Spark
Let's start with SparkLet's start with Spark
Let's start with Spark
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 

Mehr von Provectus

AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
Provectus
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Provectus
 

Mehr von Provectus (20)

Choosing the right IDP Solution
Choosing the right IDP SolutionChoosing the right IDP Solution
Choosing the right IDP Solution
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
 
Choosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare OrganizationsChoosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare Organizations
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
 
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRCost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
 
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
 
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ..."How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
 
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky..."Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
 
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2..."Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
 
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma..."Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
 
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ..."Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
 
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
 
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
 
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti..."Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
 
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
 
How to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAMHow to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAM
 

Kürzlich hochgeladen

result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 

Kürzlich hochgeladen (20)

result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 

Using Apache Spark as ETL engine. Pros and Cons

  • 1. Using Apache Spark as ETL engine Pros and cons Maksym Doroshenko Big Data Software Engineer LeadGenius, Provectus
  • 2. Agenda 1. What is Spark 2. Spark components 3. Spark pillars 4. What is ETL pipeline 5. Using Spark SQL for ETL 6. Customer use case 7. Demo
  • 3. Spark, who are you? I am is a fast and general engine for large-scale data processing.
  • 4. Prove it Hadoop MR Spark Spark Data Size 102.5 TB 100 TB 1000 TB Elapsed Time 72 mins 23 mins 234 mins # Nodes 2100 206 190 # Cores 50400 6592 6080 # Reducers 10,000 29,000 250,000 Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min Environment dedicated data center EC2 (i2.8xlarge) EC2 (i2.8xlarge) Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing.
  • 5. Spark use cases - Simplify the challenging and compute-intensive task of processing high volumes of data - Real time data processing - Seamlessly integrating complex capabilities such as machine learning and graph algorithms - Spark brings Big Data processing to the masses
  • 6. Survey: Why companies use Spark ? 91% use Apache Spark because of its performance gains 77% use Apache Spark as it is easy to use 71% use Apache Spark due to the ease of deployment 64% use Apache Spark to leverage advanced analytics 52% use Apache Spark for real-time streaming.
  • 8. What is RDD Resilient Distributed Dataset - a big collection of data with following properties: - Immutable - Distributed - Lazily evaluated - Fault tolerante
  • 10. Narrow transformations - map - flatMap - filter - etc.
  • 11. Wide transformations - reduceByKey - groupByKey - sortByKey - etc.
  • 13. Spark Dataframes Dataframes is distributed collection of data grouped into named columns (RDD with schema) with more efficient storage options, advanced optimizer, and direct operations on serialized data. These components are super important for getting the best of Spark performance
  • 14. What is ETL? 1. Sequence of transformation on data 2. Source data is typically semi-structured/unstructured (Text, JSON, CSV etc.) and structured (JDBC, Parquet, ORC, AVRO, etc.) 3. Output data is clean, structured, integrated and ready for further data processing, analysis and reporting.
  • 15. ETL query in Spark
  • 16. Why is ETL Hard? 1. Various sources/formats 2. Schema mismatch 3. Different representation 4. Corrupted files and data 5. Scalability 6. Schema evolution
  • 17. This is why ETL is important Consumers of this data do not want to deal with this messiness and complexity
  • 18. Spark SQL's flexible APIs, support for a wide variety of datasources, build-in support for structured streaming, state of art catalyst optimizer and tungsten execution engine make it a great framework for building end-to-end ETL pipelines. Spark SQL
  • 20. Schema inference: semi structured data
  • 21. Schema inference: semi structured data
  • 22. User specified schema Faster No scan to infer schema More flexible Easily handle schema evolving More robust Handle type errors ASAP
  • 23. Deal with bad data java.io.IOException: org.apache.hadoop.io.compress.DecompressorStream.decompress java.io.EOFException: Unexpected end of input stream java.lang.RuntimeException: file:/temp/path/c000.json is not a Parquet file (too small) [SPARK-17850] If true, the Spark jobs will continue to run even when it encounters corrupt files. The contents that have been read will still be returned. spark.sql.files.ignoreCorruptFiles = true
  • 24. Deal with bad data [SPARK-12833] [SPARK-13764] TextFile formats (JSON and CSV) supports 3 Parse modes while reading data: PERMISSIVE DROPMALFORMED FAILFAST
  • 25. Better JSON and CSV support [SPARK-18352] [SPARK-19610] Multiline JSON and CSV support Spark SQL reads JSON/CSV one line at time Before Spark 2.2 it requires custom ETL
  • 26. Transformations: Higher order functions in SQL Transformations on complex objects like arrays, maps and structures inside of columns. 1. Check for element existence SELECT EXIST(values, e->e>30) AS v FROM tbl_nested; 2. Transform an array SELECT TRANSFORM(values, e->e*e) AS v FROM tbl_nested;
  • 27. Transformations: Higher order functions in SQL 3. Filter an array SELECT FILTER(values, e->e>30) AS v FROM tbl_nested; 4. Aggregate an array SELECT REDUCE(values, 0, (value, acc)->acc+value) AS v FROM tbl_nested;
  • 29. Customer Use Case - Data sources in different formats - Mapping data to golden customer schema - Normalize all data (e.g. email, phone) - Link and merge same entities
  • 30. Spark ETL Pros & Cons - Pros: Open source Great community Easy to scale Strong transformation engine Support different languages Unified API for different components - Cons: No File management system Resource consuming Manual configuration tuning No ETL UI
  • 31. QA & DEMO TIME mdoroshenko@provectus.com skype: maxdor3