SlideShare ist ein Scribd-Unternehmen logo
1 von 40
1© Cloudera, Inc. All rights reserved.
Tips for Writing ETL Pipelines
with Spark
Imran Rashid|Cloudera, Apache Spark PMC
2© Cloudera, Inc. All rights reserved.
Outline
• Quick Refresher
• Tips for Pipelines
• Spark Performance
• Using the UI
• Understanding Stage Boundaries
• Baby photos
3© Cloudera, Inc. All rights reserved.
About Me
• Member of the Spark PMC
• User of Spark from v0.5 at Quantifind
• Built ETL pipelines, prototype to production
• Supported Data Scientists
• Now work on Spark full time at Cloudera
4© Cloudera, Inc. All rights reserved.
RDDs: Resilient Distributed Dataset
• Data is distributed into partitions spread across a cluster
• Each partition is processed independently and in parallel
• Logical view of the data – not materialized
Image from Dean Wampler, Typesafe
5© Cloudera, Inc. All rights reserved.
Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
• sample
• take
• first
• partitionBy
• mapWith
• pipe
• save
• ...
6© Cloudera, Inc. All rights reserved.
Cheap!
• No serialization
• No IO
• Pipelined
Expensive!
• Serialize Data
• Write to disk
• Transfer over
network
• Deserialize Data
7© Cloudera, Inc. All rights reserved.
Compare to MapReduce Word Count
Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
8© Cloudera, Inc. All rights reserved.
Useful Patterns
9© Cloudera, Inc. All rights reserved.
Pipelines get complicated
• Pipelines get messy
• Input data is messy
• Things go wrong
• Never fast enough
• Need stability for months to
years
• Need Forecasting / Capacity
Planning
Alice one
year ago
Bob 6
months ago
Connie 3
months ago
Derrick last
month
Alice last week
10© Cloudera, Inc. All rights reserved.
Design Goals
• Modularity
• Error Handling
• Understand where and how
11© Cloudera, Inc. All rights reserved.
Catching Errors (1)
sc.textFile(…).map{ line =>
//blows up with parse exception
parse(line)
}
sc.textFile(…).flatMap { line =>
//now we’re safe, right?
Try(parse(line)).toOption
}
How many errors?
1 record? 100 records?
90% of our data?
12© Cloudera, Inc. All rights reserved.
Catching Errors (2)
val parseErrors = sc.accumulator(0L)
val parsed = sc.textFile(…).flatMap { line =>
Try(parse(line)) match {
case Success(s) => Some(s)
case Failure(f) =>
parseErrors += 1
None
}
// parse errors is always 0
if (parseErrors > 500) fail(…)
// and what if we want to see those errors?
13© Cloudera, Inc. All rights reserved.
Catching Errors (3)
• Accumulators break the
RDD abstraction
• You care about when
an action has taken
place
• Force action, or pass
error handling on
• SparkListener to deal w/
failures
• https://gist.github.com/squito/2f7cc02c313
e4c9e7df4#file-accumulatorlistener-scala
case class ParsedWithErrorCounts(val parsed:
RDD[LogLine], errors: Accumulator[Long])
def parseCountErrors(path: String, sc: SparkContext):
ParsedWithErrorCounts = {
val parseErrorCounter =
sc.accumulator(0L).setName(“parseErrors”)
val parsed = sc.textFile(path).flatMap { line =>
line match {
case LogPattern(date, thread, level, source, msg)
=>
Some(LogLine(date, thread, level, source, msg))
case _ =>
parseErrorCounter += 1
None
}
}
ParsedWithErrorCounts(parsed, parseErrorCounter)
}
14© Cloudera, Inc. All rights reserved.
Catching Errors (4)
• Accumulators can give
you “multiple output”
• Create sample of error
records
• You can look at them for
debugging
• WARNING: accumulators
are not scalable
class ReservoirSample[T] {...}
class ReservoirSampleAccumulableParam[T] extends
AccumulableParam[ReservoirSample[T], T]{...}
def parseCountErrors(path: String, sc: SparkContext):
ParsedWithErrorCounts = {
val parseErrors = sc.accumulable(
new ReservoirSample[String](100))(…)
val parsed = sc.textFile(path).flatMap { line =>
line match {
case LogPattern(date, thread, level, source, msg)
=>
Some(LogLine(date, thread, level, source, msg))
case _ =>
parseErrors += line
None
}
}
ParsedWithErrorCounts(parsed, parseErrors)
}
15© Cloudera, Inc. All rights reserved.
Catching Errors (5)
• What if instead, we just filter out each condition?
• Beware deep pipelines
• Eg. RDD.randomSplit
Huge Raw Data
Filter
FlatMap
…parsed
Error 1
Error 2
16© Cloudera, Inc. All rights reserved.
Modularity with RDDs
• Who is caching what?
• What resources should each component?
• What assumptions are made on inputs?
17© Cloudera, Inc. All rights reserved.
Win By Cheating
• Fastest way to shuffle a lot of data:
• Don’t shuffle
• Second fastest way to shuffle a lot of data:
• Shuffle a small amount of data
• ReduceByKey
• Approximate Algorithms
• Same as MapReduce
• BloomFilters, HyperLogLog, Tdigest
• Joins with Narrow Dependencies
18© Cloudera, Inc. All rights reserved.
ReduceByKey when Possible
• ReduceByKey allows a map-side-combine
• Data is merged together before its
serialized & sent over network
• GroupByKey transfers all the data
• Higher serialization and network transfer
costs
parsed
.map{line =>(line.level, 1)}
.reduceByKey{(a, b) => a + b}
.collect()
parsed
.map{line =>(line.level, 1)}
.groupByKey.map{case(word,counts) =>
(word,counts.sum)}
.collect()
19© Cloudera, Inc. All rights reserved.
But I need groupBy
• Eg., incoming transaction logs from user
• 10 TB of historical data
• 50 GB of new data each day
Historical Logs
Day 1
logs
Day 2
Logs
Day 3
Logs
Grouped Logs
20© Cloudera, Inc. All rights reserved.
Using Partitioners for Narrow Joins
• Sort the Historical Logs once
• Each day, sort the small new data
• Join – narrow dependency
• Write data to hdfs
• Day 2 – now what?
• SPARK-1061
• Read from hdfs
• “Remember” data was written
with a partitioner
Wide Join Narrow Join
21© Cloudera, Inc. All rights reserved.
Assume Partitioned
• Day 2 – now what?
• SPARK-1061
• Read from hdfs
• “Remember” data was
written with a partitioner
// Day 1
val myPartitioner = …
val historical =
sc.hadoopFile(“…/mergedLogs/2015/05/19”, …)
.partitionBy(myPartitioner)
val newData =
sc.hadoopFile(“…/newData/2015/05/20”, …)
.partitionBy(myPartitioner)
val grouped = myRdd.cogroup(newData)
grouped.saveAsHadoopFile(
“…/mergedLogs/2015/05/20”)
//Day 2 – new spark context
val historical =
sc.hadoopFile(“…/mergedLogs/2015/05/20”, …)
.assumePartitionedBy(myPartitioner)
22© Cloudera, Inc. All rights reserved.
Recovering from Errors
• I write bugs
• You write bugs
• Spark has bugs
• The bugs might appear after 17 hours in stage 78 of your application
• Spark’s failure recovery might not help you
23© Cloudera, Inc. All rights reserved.
HDFS: Its not so bad
• DiskCachedRDD
• Before doing any work, check if it exists on disk
• If so, just load it
• If not, create it and write it to disk
24© Cloudera, Inc. All rights reserved.
Partitions, Partitions, Partitions …
• Partitions should be small
• Max partition size is 2GB*
• Small partitions help deal w/ stragglers
• Small partitions avoid overhead – take a closer look at internals …
• Partitions should be big
• “For ML applications, the best setting to set the number of partitions to match
the number of cores to reduce shuffle size.” Xiangrui Meng on user@
• Why? Take a closer look at internals …
25© Cloudera, Inc. All rights reserved.
Parameterize Partition Numbers
• Many transformations take a second parameter
• reduceByKey(…, nPartitions)
• sc.textFile(…, nPartitions)
• Both sides of shuffle matter!
• Shuffle read (aka “reduce”)
• Shuffle write (aka “map”) – controlled by previous stage
• As datasets change, you might need to change the numbers
• Make this a parameter to your application
• Yes, you may need to expose a LOT of parameters
26© Cloudera, Inc. All rights reserved.
Using the UI
27© Cloudera, Inc. All rights reserved.
Some Demos
• Collect a lot of data
• Slow tasks
• DAG visualization
• RDD names
28© Cloudera, Inc. All rights reserved.
Understanding Performance
29© Cloudera, Inc. All rights reserved.
What data and where is it going?
• Narrow Dependencies (aka “OneToOneDependency”)
• cheap
• Wide Dependencies (aka shuffles)
• how much is shuffled
• Is it skewed
• Driver bottleneck
30© Cloudera, Inc. All rights reserved.
Driver can be a bottleneck
Credit: Sandy Ryza, Cloudera
31© Cloudera, Inc. All rights reserved.
Driver can be a bottleneck
GOOD BAD
rdd.collect() Exploratory data analysis; merging a
small set of results.
Sequentially scan entire data set on driver.
No parallelism, OOM on driver.
rdd.reduce() Summarize the results from a small
dataset.
Big Data Structures, from lots of
partitions.
sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of
partitions. Set of a million “most
interesting” user ids from each partition.
32© Cloudera, Inc. All rights reserved.
Stage Boundaries
33© Cloudera, Inc. All rights reserved.
Stages are not MapReduce Steps!
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
ReduceByKey
(mapside
combine)
Shuffle
Filter
MapReduce
Step
ReduceByKey
FlatMap
GroupByKey
Collect
Shuffle
34© Cloudera, Inc. All rights reserved.
I still get confused
(discussion in a code review, testing a large sortByKey)
WP: … then we wait for completion of stage 3 …
ME: hang on, stage 3? Why are there 3 stages?
SortByKey does one extra pass to find the range of the
keys, but that’s two stages
WP: The other stage is data generation
ME: That can’t be right. Data Generation is pipelined,
its just part of the first stage
…
ME: duh – the final sort is two stages – shuffle write
then shuffle read
InputRDD
Sample
data to find
range of
keys
ShuffleMap
for Sort
ShuffleRead
for Sort
Stage 1
Stage 2
Stage 3
NB:
computed twice!
35© Cloudera, Inc. All rights reserved.
Tip grab bag
• Minimize data volume
• Compact formats: avro, parquet
• Kryo Serialization
• require registration in development, but not in production
• Look at data skew, key cardinality
• Tune your cluster
• Use the UI to tune your job
• Set names on all cached RDDs
36© Cloudera, Inc. All rights reserved.
More Resources
• Very active and friendly community
• http://spark.apache.org/community.html
• Dean Wampler’s self-paced spark workshop
• https://github.com/deanwampler/spark-workshop
• Tips for Better Spark Jobs
• http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-
better-spark-programs
• Tuning & Debugging Spark (with another explanation of internals)
• http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spark
• Tuning Spark On Yarn
• http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
37© Cloudera, Inc. All rights reserved.
Thank you
38© Cloudera, Inc. All rights reserved.
Cleaning Up Resources (Try 1)
39© Cloudera, Inc. All rights reserved.
Cleaning Up Resources (Try 2)
40© Cloudera, Inc. All rights reserved.
Cleaning Up Resources (Success)

Weitere ähnliche Inhalte

Was ist angesagt?

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 

Was ist angesagt? (20)

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 

Andere mochten auch

ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesAlfredo Abate
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAiougVizagChapter
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesNAYATech
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingKristian Alexander
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5Yan Zhou
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data ScienceKrishna Sankar
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 

Andere mochten auch (20)

ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Oracle's BigData solutions
Oracle's BigData solutionsOracle's BigData solutions
Oracle's BigData solutions
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA Technologies
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 

Ähnlich wie Spark etl

Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisFelicia Haggarty
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataJohn Beresniewicz
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 

Ähnlich wie Spark etl (20)

Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH data
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 

Kürzlich hochgeladen

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 

Kürzlich hochgeladen (20)

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 

Spark etl

  • 1. 1© Cloudera, Inc. All rights reserved. Tips for Writing ETL Pipelines with Spark Imran Rashid|Cloudera, Apache Spark PMC
  • 2. 2© Cloudera, Inc. All rights reserved. Outline • Quick Refresher • Tips for Pipelines • Spark Performance • Using the UI • Understanding Stage Boundaries • Baby photos
  • 3. 3© Cloudera, Inc. All rights reserved. About Me • Member of the Spark PMC • User of Spark from v0.5 at Quantifind • Built ETL pipelines, prototype to production • Supported Data Scientists • Now work on Spark full time at Cloudera
  • 4. 4© Cloudera, Inc. All rights reserved. RDDs: Resilient Distributed Dataset • Data is distributed into partitions spread across a cluster • Each partition is processed independently and in parallel • Logical view of the data – not materialized Image from Dean Wampler, Typesafe
  • 5. 5© Cloudera, Inc. All rights reserved. Expressive API • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip • sample • take • first • partitionBy • mapWith • pipe • save • ...
  • 6. 6© Cloudera, Inc. All rights reserved. Cheap! • No serialization • No IO • Pipelined Expensive! • Serialize Data • Write to disk • Transfer over network • Deserialize Data
  • 7. 7© Cloudera, Inc. All rights reserved. Compare to MapReduce Word Count Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 8. 8© Cloudera, Inc. All rights reserved. Useful Patterns
  • 9. 9© Cloudera, Inc. All rights reserved. Pipelines get complicated • Pipelines get messy • Input data is messy • Things go wrong • Never fast enough • Need stability for months to years • Need Forecasting / Capacity Planning Alice one year ago Bob 6 months ago Connie 3 months ago Derrick last month Alice last week
  • 10. 10© Cloudera, Inc. All rights reserved. Design Goals • Modularity • Error Handling • Understand where and how
  • 11. 11© Cloudera, Inc. All rights reserved. Catching Errors (1) sc.textFile(…).map{ line => //blows up with parse exception parse(line) } sc.textFile(…).flatMap { line => //now we’re safe, right? Try(parse(line)).toOption } How many errors? 1 record? 100 records? 90% of our data?
  • 12. 12© Cloudera, Inc. All rights reserved. Catching Errors (2) val parseErrors = sc.accumulator(0L) val parsed = sc.textFile(…).flatMap { line => Try(parse(line)) match { case Success(s) => Some(s) case Failure(f) => parseErrors += 1 None } // parse errors is always 0 if (parseErrors > 500) fail(…) // and what if we want to see those errors?
  • 13. 13© Cloudera, Inc. All rights reserved. Catching Errors (3) • Accumulators break the RDD abstraction • You care about when an action has taken place • Force action, or pass error handling on • SparkListener to deal w/ failures • https://gist.github.com/squito/2f7cc02c313 e4c9e7df4#file-accumulatorlistener-scala case class ParsedWithErrorCounts(val parsed: RDD[LogLine], errors: Accumulator[Long]) def parseCountErrors(path: String, sc: SparkContext): ParsedWithErrorCounts = { val parseErrorCounter = sc.accumulator(0L).setName(“parseErrors”) val parsed = sc.textFile(path).flatMap { line => line match { case LogPattern(date, thread, level, source, msg) => Some(LogLine(date, thread, level, source, msg)) case _ => parseErrorCounter += 1 None } } ParsedWithErrorCounts(parsed, parseErrorCounter) }
  • 14. 14© Cloudera, Inc. All rights reserved. Catching Errors (4) • Accumulators can give you “multiple output” • Create sample of error records • You can look at them for debugging • WARNING: accumulators are not scalable class ReservoirSample[T] {...} class ReservoirSampleAccumulableParam[T] extends AccumulableParam[ReservoirSample[T], T]{...} def parseCountErrors(path: String, sc: SparkContext): ParsedWithErrorCounts = { val parseErrors = sc.accumulable( new ReservoirSample[String](100))(…) val parsed = sc.textFile(path).flatMap { line => line match { case LogPattern(date, thread, level, source, msg) => Some(LogLine(date, thread, level, source, msg)) case _ => parseErrors += line None } } ParsedWithErrorCounts(parsed, parseErrors) }
  • 15. 15© Cloudera, Inc. All rights reserved. Catching Errors (5) • What if instead, we just filter out each condition? • Beware deep pipelines • Eg. RDD.randomSplit Huge Raw Data Filter FlatMap …parsed Error 1 Error 2
  • 16. 16© Cloudera, Inc. All rights reserved. Modularity with RDDs • Who is caching what? • What resources should each component? • What assumptions are made on inputs?
  • 17. 17© Cloudera, Inc. All rights reserved. Win By Cheating • Fastest way to shuffle a lot of data: • Don’t shuffle • Second fastest way to shuffle a lot of data: • Shuffle a small amount of data • ReduceByKey • Approximate Algorithms • Same as MapReduce • BloomFilters, HyperLogLog, Tdigest • Joins with Narrow Dependencies
  • 18. 18© Cloudera, Inc. All rights reserved. ReduceByKey when Possible • ReduceByKey allows a map-side-combine • Data is merged together before its serialized & sent over network • GroupByKey transfers all the data • Higher serialization and network transfer costs parsed .map{line =>(line.level, 1)} .reduceByKey{(a, b) => a + b} .collect() parsed .map{line =>(line.level, 1)} .groupByKey.map{case(word,counts) => (word,counts.sum)} .collect()
  • 19. 19© Cloudera, Inc. All rights reserved. But I need groupBy • Eg., incoming transaction logs from user • 10 TB of historical data • 50 GB of new data each day Historical Logs Day 1 logs Day 2 Logs Day 3 Logs Grouped Logs
  • 20. 20© Cloudera, Inc. All rights reserved. Using Partitioners for Narrow Joins • Sort the Historical Logs once • Each day, sort the small new data • Join – narrow dependency • Write data to hdfs • Day 2 – now what? • SPARK-1061 • Read from hdfs • “Remember” data was written with a partitioner Wide Join Narrow Join
  • 21. 21© Cloudera, Inc. All rights reserved. Assume Partitioned • Day 2 – now what? • SPARK-1061 • Read from hdfs • “Remember” data was written with a partitioner // Day 1 val myPartitioner = … val historical = sc.hadoopFile(“…/mergedLogs/2015/05/19”, …) .partitionBy(myPartitioner) val newData = sc.hadoopFile(“…/newData/2015/05/20”, …) .partitionBy(myPartitioner) val grouped = myRdd.cogroup(newData) grouped.saveAsHadoopFile( “…/mergedLogs/2015/05/20”) //Day 2 – new spark context val historical = sc.hadoopFile(“…/mergedLogs/2015/05/20”, …) .assumePartitionedBy(myPartitioner)
  • 22. 22© Cloudera, Inc. All rights reserved. Recovering from Errors • I write bugs • You write bugs • Spark has bugs • The bugs might appear after 17 hours in stage 78 of your application • Spark’s failure recovery might not help you
  • 23. 23© Cloudera, Inc. All rights reserved. HDFS: Its not so bad • DiskCachedRDD • Before doing any work, check if it exists on disk • If so, just load it • If not, create it and write it to disk
  • 24. 24© Cloudera, Inc. All rights reserved. Partitions, Partitions, Partitions … • Partitions should be small • Max partition size is 2GB* • Small partitions help deal w/ stragglers • Small partitions avoid overhead – take a closer look at internals … • Partitions should be big • “For ML applications, the best setting to set the number of partitions to match the number of cores to reduce shuffle size.” Xiangrui Meng on user@ • Why? Take a closer look at internals …
  • 25. 25© Cloudera, Inc. All rights reserved. Parameterize Partition Numbers • Many transformations take a second parameter • reduceByKey(…, nPartitions) • sc.textFile(…, nPartitions) • Both sides of shuffle matter! • Shuffle read (aka “reduce”) • Shuffle write (aka “map”) – controlled by previous stage • As datasets change, you might need to change the numbers • Make this a parameter to your application • Yes, you may need to expose a LOT of parameters
  • 26. 26© Cloudera, Inc. All rights reserved. Using the UI
  • 27. 27© Cloudera, Inc. All rights reserved. Some Demos • Collect a lot of data • Slow tasks • DAG visualization • RDD names
  • 28. 28© Cloudera, Inc. All rights reserved. Understanding Performance
  • 29. 29© Cloudera, Inc. All rights reserved. What data and where is it going? • Narrow Dependencies (aka “OneToOneDependency”) • cheap • Wide Dependencies (aka shuffles) • how much is shuffled • Is it skewed • Driver bottleneck
  • 30. 30© Cloudera, Inc. All rights reserved. Driver can be a bottleneck Credit: Sandy Ryza, Cloudera
  • 31. 31© Cloudera, Inc. All rights reserved. Driver can be a bottleneck GOOD BAD rdd.collect() Exploratory data analysis; merging a small set of results. Sequentially scan entire data set on driver. No parallelism, OOM on driver. rdd.reduce() Summarize the results from a small dataset. Big Data Structures, from lots of partitions. sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of partitions. Set of a million “most interesting” user ids from each partition.
  • 32. 32© Cloudera, Inc. All rights reserved. Stage Boundaries
  • 33. 33© Cloudera, Inc. All rights reserved. Stages are not MapReduce Steps! Map Reduce Shuffle Map Reduce Shuffle Map Reduce Shuffle Map Reduce Shuffle Map ReduceByKey (mapside combine) Shuffle Filter MapReduce Step ReduceByKey FlatMap GroupByKey Collect Shuffle
  • 34. 34© Cloudera, Inc. All rights reserved. I still get confused (discussion in a code review, testing a large sortByKey) WP: … then we wait for completion of stage 3 … ME: hang on, stage 3? Why are there 3 stages? SortByKey does one extra pass to find the range of the keys, but that’s two stages WP: The other stage is data generation ME: That can’t be right. Data Generation is pipelined, its just part of the first stage … ME: duh – the final sort is two stages – shuffle write then shuffle read InputRDD Sample data to find range of keys ShuffleMap for Sort ShuffleRead for Sort Stage 1 Stage 2 Stage 3 NB: computed twice!
  • 35. 35© Cloudera, Inc. All rights reserved. Tip grab bag • Minimize data volume • Compact formats: avro, parquet • Kryo Serialization • require registration in development, but not in production • Look at data skew, key cardinality • Tune your cluster • Use the UI to tune your job • Set names on all cached RDDs
  • 36. 36© Cloudera, Inc. All rights reserved. More Resources • Very active and friendly community • http://spark.apache.org/community.html • Dean Wampler’s self-paced spark workshop • https://github.com/deanwampler/spark-workshop • Tips for Better Spark Jobs • http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing- better-spark-programs • Tuning & Debugging Spark (with another explanation of internals) • http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spark • Tuning Spark On Yarn • http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
  • 37. 37© Cloudera, Inc. All rights reserved. Thank you
  • 38. 38© Cloudera, Inc. All rights reserved. Cleaning Up Resources (Try 1)
  • 39. 39© Cloudera, Inc. All rights reserved. Cleaning Up Resources (Try 2)
  • 40. 40© Cloudera, Inc. All rights reserved. Cleaning Up Resources (Success)