Intro to Apache Spark

1© Cloudera, Inc. All rights reserved.
Intro to Apache Spark
Anand Iyer
Senior Product Manager, Cloudera

Target Audience
• New to Spark, or have very rudimentary knowledge of Spark.
• Have basic knowledge of Map-Reduce
If you are an advanced Spark developer, you are unlikely to get much out of this
talk.
• No performance tuning or debugging tips

Spark: Easy and Fast Big Data
• Easy to Develop
• Rich APIs in Java, Scala, Python
• Interactive shell
• Fast to Run
• General execution graphs
• In-memory Caching

Easy to code API

RDD: Resilient Distributed Datasets
Abstraction to represent the large distributed sets of data that are being processed.
RDDs are:
• Broken up into partitions, which are distributed across nodes
• In practice, RDDs usually have between 100 to 10K partitions
• Partitions operated upon in parallel
• Immutable
• Fault-Tolerant via concept of lineage

Spark jobs are DAGs of operations on RDDs
Operations on RDDs
• Transformations: Create a new RDD from existing RDDs
• Actions: Run computation on RDD, return values to the driver
= RDD
join
filter
groupBy
B:
C: D: E:
G:
Ç√Ω
map
A:
map
take
F:

Rich Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
sample
take
first
partitionBy
mapWith
pipe
save ...

Example: Logistic Regression
sc = SparkContext(…)
rawData = sc.textFile(“hdfs://…”)
data = rawData.map(parserFunc).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w

Execution model and Spark
Internals

Driver & Executors
• Driver: Master node
• One Driver per Spark App
• Runs the main(…) function of your app
• Executors: Worker nodes

Logical graph to physical execution plan
= cached partition
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
• Execution graph is broken into Stages
• Each Stage consists of multiple Tasks
• Task is unit of computation that is scheduled on an Executor
• A Stage consists of multiple operations that can be pipelined
• Stages are split when data needs to be “shuffled”

Shuffle
• Redistributes data among partitions
• reduce, groupBy, join
• Hash keys to buckets
• Identical to MapReduce Shuffle
• Shuffle entails writes to disk

Spark WebUI lets you visualize DAG

Drivers & Executors revisited
• Driver
• One Driver per Spark App
• Runs the main(…) function of your app
• Creates logical DAG and physical execution plan
• Schedules Tasks
• Driver receives and collects the results of Actions
• Executors
• Hold RDD partitions
• Execute Tasks as scheduled by Driver

Spark runs on Cluster Managers
• Spark does not manage cluster of machines
• Runs on YARN, Mesos or Standalone (cluster manager built specifically for Spark)

Why is Spark Fast?

Memory management leads to greater performance
Trends:
½ price every 18 months
2x bandwidth every 3 years
128 – 384 GB
12-24 cores
50 GB per sec
Memory can be enabler for high performance big data
applications

Persisting or Caching RDDs
• If an RDD will be re-used, persist it to prevent re-computation
• Very common in iterative algorithms
• By default, cached RDDs held in memory
• But memory may not suffice
• MEMORY_AND_DISK persistence: Spill the partitions that don’t fit in memory
to disk

Lineage for Fault-Tolerance
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take

Lineage
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take

join
filter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:
• RDD is persisted to memory or disk
• RDD already materialized on disk due to shuffle
G:

join
filter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
G:

Summary of what makes Spark fast
• Maximize use of memory
• Re-used RDDs can be explicitly cached to prevent re-computation
• Leverage Lineage & Pipelining to minimize writing intermediate data to disk
• Efficient Task Scheduler
• Ensure worker nodes are kept busy via quick scheduling of Tasks
• More optimizations coming in Spark SQL
• Compact binary in-memory data representation, etc
• More details in subsequent slides

Spark will replace MapReduce
To become the standard execution engine for Hadoop

Spark Streaming

Spark Streaming
• Incoming data represented as DStreams (Discretized Streams)
• Data commonly read from streaming data channels like Kafka or Flume
• A spark-streaming application is a DAG of Transformations and Actions on
DStreams (and RDDs)

Discretized Stream
• Incoming data stream is broken down into micro-batches
• Micro-batch size is user defined, usually 0.3 to 1 second
• Micro-batches are disjoint
• Each micro-batch is an RDD
• Effectively, a DStream is a sequence of RDDs, one per micro-batch
• Spark Streaming known for high throughput

Windowed DStreams
• Defined by specifying a window size and a step size
• Both are multiples of micro-batch size
• Operations invoked on each window’s data

Maintain and update arbitrary state
updateStateByKey(...)
• Define initial state
• Provide state update function
• Continuously update with new information
• State maintained as RDD, updated via Transformation
Examples:
• Running count of words seen in text stream
• Per user session state from activity stream
Note: Requires periodic check-pointing to fault-tolerant storage, every N (~10-15)
micro-batches

Spark SQL & Dataframes

Dataframes
• Distributed collection of data organized as named typed columns
• Like RDDs, they consist of partitions, can be cached, and have fault-tolerance via
lineage
• Can be constructed from:
• Structured data files: Json, avro, parquet, etc
• Tables in Hive
• Tables in a RDBMS
• Existing RDDs by programmatically applying schema

Spark SQL
• SQL statements to process Dataframes
• Embed SQL statements in your scala, java, python Spark application
• Queries can also be issued via JDBC/ODBC

Why Spark SQL? Ease of programming
• Easy to code against schema’d records
• SQL is often an easier alternative to code, for non-complex operations on
relational data
• Embed SQL in your scala, java or python applications to seamlessly mix “regular”
spark for complex operations, along with SQL

Why Spark SQL? Performance
SQL processed by Query Optimizer  Automatic Optimizations
• Compressed memory format (as against java serialized objects in RDDs)
• Predicate pushdown (read less data to reduce IO)
• Optimal pipelining of operations
• Cost based optimizer
• …

MLlib
Collection of popular machine learning algorithms:
Classifiers: logistic regression, boosted trees, random forests,etc
Clustering: k-means, LDA
Recommender Systems: ALS
Dimensionality Reduction: PCA and SVD
Feature Engineering: TF-IDF, Word2Vec, etc
Statistical Functions: Chi-Squared Test, Pearson Correlation,etc
Pipelines API: Chain together feature engineering, training, model validation into
one pipeline

Thank You
And of course….we are hiring!!!

Intro to Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Intro to Apache Spark

Similar to Intro to Apache Spark (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Intro to Apache Spark

Editor's Notes