This document discusses Spark, an open-source cluster computing framework. It notes that while Hadoop is useful for batch processing, it has limitations for interactive and iterative algorithms. Spark addresses these issues through its resilient distributed datasets (RDDs) which can be operated on in parallel and rebuilt if lost. RDDs support transformations like map and filter as well as actions that return values. The document provides examples of using Spark from Scala and discusses its architecture involving a DAG scheduler and task scheduler.
2. HTTP://ABOUT.ME/PRZEMEK.MACIOLEK/
•
Data Scientist, Hadoop user since 2009
•
Did research for Academia, data mined for oil&gas exploration industry,
cofounded Data Science startup, built Big Data team in Base CRM, …
•
A lot of different tools used meanwhile (Mahout, HBase, Cassandra,
Redis, Pig, Storm, …)
•
Dreaming about something powerful and concise for Big Data…
•
AD 2014: Head of Analytics & Data @ Toptal - researching new ways of
doing Big Data Analytics, rediscovered Storm.
P.S. Ever considered doing Analytics & Data Science for a
very cool startup? Drop me a note at: prze@toptal.com
4. HADOOP IS COOL
(BUT SOMETIMES IT’S NOT)
•
High latency (interactive, anyone?)
•
Challenging expressibility of business logic
•
Iterative algorithms? (think: PageRank)
10. RESILIENT DISTRIBUTED
DATASET (RDD)
•
A collection of elements that can be operated in
parallel
•
Parallel Collection, e.g. sc.paralellize(Array(1,2,3))
•
Hadoop Dataset
•
Lazily evaluated, able to rebuild lost data any time
•
Can be stored in memory without replication
11. ACTIONS
TRANSFORMATIONS
•
Creates a new dataset from
an existing one
•
•
Return the value to the
driver after computation
finishes
•
Runs all required
transformations
Lazily evaluated
•
Recomputed each time an
action runs on it, but might be
persisted (in memory or disk)
•
Broadcast Variables and
Accumulators for cluster-level
sharing
13. HOW TO USE IT?
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
!
scala> textFile.count() // Number of items in this RDD
res0: Long = 74
!
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
!
scala> textFile.map(line => line.split(" ").size).reduce((a, b) =>
Math.max(a, b)) // How many words are in the longest line
res2: Int = 16
!
scala> textFile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b).collect
res3: Array[(java.lang.String, Int)] = Array((need,2), ("",43), (Extra,
3), (using,1), (passed,1), (etc.,1), (its,1), (`/usr/local/lib/
libmesos.so`,1), (`SCALA_HOME`,1), (option,1), (these,1), (#,1),
(`PATH`,,2), (200,1), (To,3),...
15. RDD Objects
DAG Scheduler
Split graph into stages
of tasks. Submit each
one when ready.
rdd.filter().map(…).
groupBy(…).filter(…)
t
Se
sk
Ta
Worker
Execute tasks. Store
and serve blocks.
Task
Task Scheduler
Lunch tasks via cluster
manager. Retry.