3. Introduction
Cluster computing frameworks like MapReduce is not
well in iterative machine learning and graph algorithms
because data replication,disk I/O,serialization
4. Introduction
Pregel is a system for iterative graph computations that
keeps intermediate data in memory, while HaLoop
offers an iterative MapReduce interface.
but only support specific computation patterns
They do not provide abstractions for more general
reuse.
5. Introduction
RDD is defining a programming interface that can
provide fault tolerance efficiently
RDD v.s distributed shared memory
coarse-grained transformations
(e.g., map, filter and join)
fine-grained updates to mutable state
lineage
6. Resilient Distributed
Datasets (RDDs)
RDD’s transformation are lazy operations that define a
new RDD, while actions launch a computation to
return a value to the program or write data to external
storage.
8. Resilient Distributed
Datasets (RDDs)
RDD is a read-only, partitioned collection of records,
only be created (1) data in stable storage (2) other
RDDs.
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.count()
9. Resilient Distributed
Datasets (RDDs)
RDD1
lines = spark.textFile(“hdfs://...")
RDD2
errors = lines.filter(_.startsWith(“ERROR"))
Long
number = errors.count()
RDD1 RDD2
Long
tranformation action
11. Resilient Distributed
Datasets (RDDs)
RDD1
lines = spark.textFile(“hdfs://...")
RDD2
errors = lines.filter(_.startsWith(“ERROR"))
RDD3
error = errors.persist() or cache()
RDD3 error will in memory
12. Resilient Distributed
Datasets (RDDs)
Lineage: fault tolerance
if RDD2 lost
tranformation action
RDD1 RDD2 Long
recompute RDD1 and produce new RDD2
13. Resilient Distributed
Datasets (RDDs)
Spark provides the RDD abstraction through a
language-integrated API
scala
a functional programming language for the Java VM
14. Representing RDDs
dependencies between RDDs
narrow dependencies:allow for pipelined execution on
one cluster node
wide dependencies:require data from all parent
partitions to be available and to be shuffled across the
nodes using a MapReduce-like operation
17. Resilient Distributed
Datasets (RDDs)
Each stage contains as many pipelined transformations
with narrow dependencies as possible.
because avoid shuffled across the nodes
19. Evaluation
10 iterations on 100 GB datasets using 25–100
machines.
logistic regression k-means
logistic regression is less compute-intensive and thus more
sensitive to time spent in deserialization and I/O.
27. Conclusion
RDDs,an efficient, general-purpose and fault-tolerant
abstraction for sharing data in cluster applications.
RDDs offer an API based on coarse- grained
transformations that lets them recover data efficiently
using lineage.
Spark v.s Hadoop fast to 20× in iterative applications and
can be used interactively to query hundreds of gigabytes
of data.