Apache Spark RDDs

Apache Spark RDDs
Dean Chen
eBay Inc.

http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf

Spark
• 2010 paper Berkley's AMPLab
• resilient distributed datasets (RDDs)
• Generalized distributed computation engine/
platform
• Fault tolerant in memory caching
• Extensible interface for various work loads

https://amplab.cs.berkeley.edu/software/

RDDs
• Resilient distributed datasets
• "read-only collection of objects partitioned
across a set of machines that can be rebuilt if a
partition is lost"
• Familiar Scala collections API for distributed data
and computation
• Monadic expression of lazy transformations, but
not monads

Spark Shell
• Interactive queries and prototyping
• Local, YARN, Mesos
• Static type checking and auto complete
• Lambdas

val titles = sc.textFile("titles.txt")
val countsRdd = titles
.flatMap(tokenize)
.map(word => (cleanse(word), 1))
.reduceByKey(_ + _)
val counts = countsRdd
.filter{case(_, total) => total > 10000}
.sortBy{case(_, total) => total}
.filter{case(word, _) => word.length >= 5}
.collect

Transformations
map
filter
flatMap
sample
union
intersection
distinct
groupByKey
reduceByKey
sortByKey
join
cogroup
cartesian

Actions
reduce
collect
count
first
take
takeSample
saveAsTextFile
foreach

val sqlContext =
new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Count(word: String, total: Int)
val schemaRdd =
countsRdd.map(c => Count(c._1, c._2))
val count = schemaRdd
.where('word === "scala")
.select('total)
.collect

schemaRdd.registerTempTable("counts")
sql("
SELECT total FROM counts
WHERE word = 'scala'
").collect
schemaRdd
.filter(_.word == "scala")
.map(_.total)
.collect

registerFunction("LEN", (_: String).length)
val queryRdd = sql("
SELECT * FROM counts
WHERE LEN(word) = 10
ORDER BY total DESC
LIMIT 10
")
queryRdd
.map(c =>
s"word: ${c(0)} t| total: ${c(1)}")
.collect()
.foreach(println)

Spark Streaming
• Realtime computation similar to Storm
• Input distributed to memory for fault tolerance
• Streaming input in to sliding windows of RDDs
• Kafka, Flume, Kinesis, HDFS

TwitterUtils.createStream()
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))

GraphX
• Optimally partitions and indexes vertices and
edges represented as RDDs
• APIs to join and traverse graphs
• PageRank, connected components, triangle
counting

val graph = Graph(userIdRDD, assocRDD)
val ranks = graph.pageRank(0.0001).vertices
val userRDD = sc.textFile("graphx/data/users.txt")
val users = userRdd.map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1))
}
val ranksByUsername = users.join(ranks).map {
case (id, (username, rank)) => (username, rank)
}

MLib
• Machine learning library similar to Mahout
• Statistics, regression, decision trees, clustering,
PCA, gradient descent
• Iterative algorithms much faster due to in
memory caching

val data = sc.textFile("data.txt")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(
parts(0).toDouble,
Vectors.dense(parts(1).split(' ').map(_.toDouble))
)
}
val model = LinearRegressionWithSGD.train(
parsedData, 100
)
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds
.map{case(v, p) => math.pow((v - p), 2)}.mean()

RDDs
• Resilient distributed datasets
• Familiar Scala collections API
• Distributed data and computation
• Monadic expression of transformations
• But not monads

Pseudo Monad
• Wraps iterator + partitions distribution
• Keeps track of history for fault tolerance
• Lazily evaluated, chaining of expressions

https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf

RDD Interface
• compute: transformation applied to iterable(s)
• getPartitions: partition data for parallel
computation
• getDependencies: lineage of parent RDDs and if
shuffle is required

HadoopRDD
• compute: read HDFS block or file split
• getPartitions: HDFS block or file split
• getDependencies: None

MappedRDD
• compute: compute parent and map result
• getPartitions: parent partition
• getDependencies: single dependency on parent

CoGroupedRDD
• compute: compute, shuffle then group parent
RDDs
• getPartitions: one per reduce task
• getDependencies: shuffle each parent RDD

Summary
• Simple Unified API through RDDs
• Interactive Analysis
• Hadoop Integration
• Performance

References
• http://www.cs.berkeley.edu/~matei/papers/2010/
hotcloud_spark.pdf
• https://www.youtube.com/watch?v=HG2Yd-3r4-M
• https://www.youtube.com/watch?v=e-Ys-2uVxM0
• RDD, MappedRDD, SchemaRDD, RDDFunctions,
GraphOps, DStream

deanchen5@gmail.com
@deanchen

Apache Spark RDDs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Apache Spark RDDs

Similar to Apache Spark RDDs (20)

Recently uploaded

Recently uploaded (20)

Apache Spark RDDs