More Related Content
Similar to Intro to Apache Spark (20)
More from Cloudera, Inc. (20)
Intro to Apache Spark
- 1. 1© Cloudera, Inc. All rights reserved.
Intro to Apache Spark
Anand Iyer
Senior Product Manager, Cloudera
- 2. 2© Cloudera, Inc. All rights reserved.
Target Audience
• New to Spark, or have very rudimentary knowledge of Spark.
• Have basic knowledge of Map-Reduce
If you are an advanced Spark developer, you are unlikely to get much out of this
talk.
• No performance tuning or debugging tips
- 3. 3© Cloudera, Inc. All rights reserved.
Spark: Easy and Fast Big Data
• Easy to Develop
• Rich APIs in Java, Scala, Python
• Interactive shell
• Fast to Run
• General execution graphs
• In-memory Caching
- 5. 5© Cloudera, Inc. All rights reserved.
RDD: Resilient Distributed Datasets
Abstraction to represent the large distributed sets of data that are being processed.
RDDs are:
• Broken up into partitions, which are distributed across nodes
• In practice, RDDs usually have between 100 to 10K partitions
• Partitions operated upon in parallel
• Immutable
• Fault-Tolerant via concept of lineage
- 6. 6© Cloudera, Inc. All rights reserved.
Spark jobs are DAGs of operations on RDDs
Operations on RDDs
• Transformations: Create a new RDD from existing RDDs
• Actions: Run computation on RDD, return values to the driver
= RDD
join
filter
groupBy
B:
C: D: E:
G:
Ç√Ω
map
A:
map
take
F:
- 7. 7© Cloudera, Inc. All rights reserved.
Rich Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
sample
take
first
partitionBy
mapWith
pipe
save ...
- 8. 8© Cloudera, Inc. All rights reserved.
Example: Logistic Regression
sc = SparkContext(…)
rawData = sc.textFile(“hdfs://…”)
data = rawData.map(parserFunc).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w
- 10. 10© Cloudera, Inc. All rights reserved.
Driver & Executors
• Driver: Master node
• One Driver per Spark App
• Runs the main(…) function of your app
• Executors: Worker nodes
- 11. 11© Cloudera, Inc. All rights reserved.
Logical graph to physical execution plan
= cached partition
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
• Execution graph is broken into Stages
• Each Stage consists of multiple Tasks
• Task is unit of computation that is scheduled on an Executor
• A Stage consists of multiple operations that can be pipelined
• Stages are split when data needs to be “shuffled”
- 12. 12© Cloudera, Inc. All rights reserved.
Shuffle
• Redistributes data among partitions
• reduce, groupBy, join
• Hash keys to buckets
• Identical to MapReduce Shuffle
• Shuffle entails writes to disk
- 14. 14© Cloudera, Inc. All rights reserved.
Drivers & Executors revisited
• Driver
• One Driver per Spark App
• Runs the main(…) function of your app
• Creates logical DAG and physical execution plan
• Schedules Tasks
• Driver receives and collects the results of Actions
• Executors
• Hold RDD partitions
• Execute Tasks as scheduled by Driver
- 15. 15© Cloudera, Inc. All rights reserved.
Spark runs on Cluster Managers
• Spark does not manage cluster of machines
• Runs on YARN, Mesos or Standalone (cluster manager built specifically for Spark)
- 17. 17© Cloudera, Inc. All rights reserved.
Memory management leads to greater performance
Trends:
½ price every 18 months
2x bandwidth every 3 years
128 – 384 GB
12-24 cores
50 GB per sec
Memory can be enabler for high performance big data
applications
- 18. 18© Cloudera, Inc. All rights reserved.
Persisting or Caching RDDs
• If an RDD will be re-used, persist it to prevent re-computation
• Very common in iterative algorithms
• By default, cached RDDs held in memory
• But memory may not suffice
• MEMORY_AND_DISK persistence: Spill the partitions that don’t fit in memory
to disk
- 19. 19© Cloudera, Inc. All rights reserved.
Lineage for Fault-Tolerance
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
- 20. 20© Cloudera, Inc. All rights reserved.
Lineage
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
- 21. 21© Cloudera, Inc. All rights reserved.
Lineage
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
- 22. 22© Cloudera, Inc. All rights reserved.
Lineage
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
- 23. 23© Cloudera, Inc. All rights reserved.
Lineage
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
- 24. 24© Cloudera, Inc. All rights reserved.
join
filter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:
• RDD is persisted to memory or disk
• RDD already materialized on disk due to shuffle
G:
- 25. 25© Cloudera, Inc. All rights reserved.
join
filter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:
• RDD is persisted to memory or disk
• RDD already materialized on disk due to shuffle
G:
- 26. 26© Cloudera, Inc. All rights reserved.
join
filter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:
• RDD is persisted to memory or disk
• RDD already materialized on disk due to shuffle
G:
- 27. 27© Cloudera, Inc. All rights reserved.
join
filter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:
• RDD is persisted to memory or disk
• RDD already materialized on disk due to shuffle
G:
- 28. 28© Cloudera, Inc. All rights reserved.
join
filter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:
• RDD is persisted to memory or disk
• RDD already materialized on disk due to shuffle
G:
- 29. 29© Cloudera, Inc. All rights reserved.
Summary of what makes Spark fast
• Maximize use of memory
• Re-used RDDs can be explicitly cached to prevent re-computation
• Leverage Lineage & Pipelining to minimize writing intermediate data to disk
• Efficient Task Scheduler
• Ensure worker nodes are kept busy via quick scheduling of Tasks
• More optimizations coming in Spark SQL
• Compact binary in-memory data representation, etc
• More details in subsequent slides
- 30. 30© Cloudera, Inc. All rights reserved.
Spark will replace MapReduce
To become the standard execution engine for Hadoop
- 32. 32© Cloudera, Inc. All rights reserved.
Spark Streaming
• Incoming data represented as DStreams (Discretized Streams)
• Data commonly read from streaming data channels like Kafka or Flume
• A spark-streaming application is a DAG of Transformations and Actions on
DStreams (and RDDs)
- 33. 33© Cloudera, Inc. All rights reserved.
Discretized Stream
• Incoming data stream is broken down into micro-batches
• Micro-batch size is user defined, usually 0.3 to 1 second
• Micro-batches are disjoint
• Each micro-batch is an RDD
• Effectively, a DStream is a sequence of RDDs, one per micro-batch
• Spark Streaming known for high throughput
- 34. 34© Cloudera, Inc. All rights reserved.
Windowed DStreams
• Defined by specifying a window size and a step size
• Both are multiples of micro-batch size
• Operations invoked on each window’s data
- 35. 35© Cloudera, Inc. All rights reserved.
Maintain and update arbitrary state
updateStateByKey(...)
• Define initial state
• Provide state update function
• Continuously update with new information
• State maintained as RDD, updated via Transformation
Examples:
• Running count of words seen in text stream
• Per user session state from activity stream
Note: Requires periodic check-pointing to fault-tolerant storage, every N (~10-15)
micro-batches
- 37. 37© Cloudera, Inc. All rights reserved.
Dataframes
• Distributed collection of data organized as named typed columns
• Like RDDs, they consist of partitions, can be cached, and have fault-tolerance via
lineage
• Can be constructed from:
• Structured data files: Json, avro, parquet, etc
• Tables in Hive
• Tables in a RDBMS
• Existing RDDs by programmatically applying schema
- 38. 38© Cloudera, Inc. All rights reserved.
Spark SQL
• SQL statements to process Dataframes
• Embed SQL statements in your scala, java, python Spark application
• Queries can also be issued via JDBC/ODBC
- 39. 39© Cloudera, Inc. All rights reserved.
Why Spark SQL? Ease of programming
• Easy to code against schema’d records
• SQL is often an easier alternative to code, for non-complex operations on
relational data
• Embed SQL in your scala, java or python applications to seamlessly mix “regular”
spark for complex operations, along with SQL
- 40. 40© Cloudera, Inc. All rights reserved.
Why Spark SQL? Performance
SQL processed by Query Optimizer Automatic Optimizations
• Compressed memory format (as against java serialized objects in RDDs)
• Predicate pushdown (read less data to reduce IO)
• Optimal pipelining of operations
• Cost based optimizer
• …
- 41. 41© Cloudera, Inc. All rights reserved.
MLlib
Collection of popular machine learning algorithms:
Classifiers: logistic regression, boosted trees, random forests,etc
Clustering: k-means, LDA
Recommender Systems: ALS
Dimensionality Reduction: PCA and SVD
Feature Engineering: TF-IDF, Word2Vec, etc
Statistical Functions: Chi-Squared Test, Pearson Correlation,etc
Pipelines API: Chain together feature engineering, training, model validation into
one pipeline
Editor's Notes
- Compared to 10c/GB, 100 MBps for disk storage
Hot data often a small fraction of total data
- Show example usage of lineage with caching.
Then show example usage of lineage where it goes to the shuffle files
- Show example usage of lineage with caching.
Then show example usage of lineage where it goes to the shuffle files
- Show example usage of lineage with caching.
Then show example usage of lineage where it goes to the shuffle files
- Show example usage of lineage with caching.
Then show example usage of lineage where it goes to the shuffle files
- Show example usage of lineage with caching.
Then show example usage of lineage where it goes to the shuffle files
- Create a logical execution plan for DAG
- Create a logical execution plan for DAG
- Create a logical execution plan for DAG
- Create a logical execution plan for DAG
- Create a logical execution plan for DAG
- Dstream is the abstraction and each Dstream has transformation and actions like RDDs….subset of transformations and actions.
- Dstream is the abstraction and each Dstream has transformation and actions like RDDs….subset of transformations and actions.