Data Scientist Shares Insights on Spark RDDs

HTTP://ABOUT.ME/PRZEMEK.MACIOLEK/
•

Data Scientist, Hadoop user since 2009

•

Did research for Academia, data mined for oil&gas exploration industry,
cofounded Data Science startup, built Big Data team in Base CRM, …

•

A lot of different tools used meanwhile (Mahout, HBase, Cassandra,
Redis, Pig, Storm, …)

•

Dreaming about something powerful and concise for Big Data…

•

AD 2014: Head of Analytics & Data @ Toptal - researching new ways of
doing Big Data Analytics, rediscovered Storm.
P.S. Ever considered doing Analytics & Data Science for a
very cool startup? Drop me a note at: prze@toptal.com

HADOOP IS COOL

(BUT SOMETIMES IT’S NOT)
•

High latency (interactive, anyone?)

•

Challenging expressibility of business logic

•

Iterative algorithms? (think: PageRank)

SOLUTION?
Giraph
MapReduce

Pig
S4
Hive

General batch
processing

Pregel
Storm

Drill

…

Specialized
systems

Impala

Map

Reduce

Data

Data

Data
Data
Data

Data

Data

Data

Data

Data

Data

Data

Data
Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data
Data

Data

Data

Data

Data

Data

Data

MAYBE MAP REDUCE IS NOT
ALWAYS THE BEST SOLUTION?

GENERALIZE FTW!

Spark

Task DAG and
Data Sharing

MapReduce

…

Batch  
processing

Specialized
systems

RESILIENT DISTRIBUTED
DATASET (RDD)
•

A collection of elements that can be operated in
parallel

•

Parallel Collection, e.g. sc.paralellize(Array(1,2,3))

•

Hadoop Dataset

•

Lazily evaluated, able to rebuild lost data any time

•

Can be stored in memory without replication

ACTIONS

TRANSFORMATIONS
•

Creates a new dataset from
an existing one

•

•

Return the value to the
driver after computation
ﬁnishes

•

Runs all required
transformations

Lazily evaluated

•

Recomputed each time an
action runs on it, but might be
persisted (in memory or disk)

•

Broadcast Variables and
Accumulators for cluster-level
sharing

HOW TO USE IT?
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

!

scala> textFile.count() // Number of items in this RDD
res0: Long = 74

!

scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark

!

scala> textFile.map(line => line.split(" ").size).reduce((a, b) =>
Math.max(a, b)) // How many words are in the longest line
res2: Int = 16

!

scala> textFile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b).collect
res3: Array[(java.lang.String, Int)] = Array((need,2), ("",43), (Extra,
3), (using,1), (passed,1), (etc.,1), (its,1), (`/usr/local/lib/
libmesos.so`,1), (`SCALA_HOME`,1), (option,1), (these,1), (#,1),
(`PATH`,,2), (200,1), (To,3),...

RDD Objects

DAG Scheduler

Split graph into stages
of tasks. Submit each
one when ready.

rdd.filter().map(…).
groupBy(…).filter(…)

t
Se
sk
Ta

Worker
Execute tasks. Store
and serve blocks.

Task

Task Scheduler
Lunch tasks via cluster
manager. Retry.

NARROW
DEPENDENCIES

WIDE (SHUFFLE)
DEPENDENCIES

map, ﬁlter

groupByKey

union
join (inputs not
co-partitioned)

* http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

How much code is needed to implement Big Data Page Rank?

* http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf

* http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

BERKELEY DATA
ANALYTICS STACK

* https://amplab.cs.berkeley.edu/software/

REFERENCES
•

http://spark.incubator.apache.org/

•

https://amplab.cs.berkeley.edu/software/

•

http://ampcamp.berkeley.edu/3/exercises/index.html

•

http://www.mlbase.org/

•

https://amplab.cs.berkeley.edu/benchmark/

•

http://ﬁles.meetup.com/3138542/dev-meetup-dec-2012.pptx

•

http://spark-summit.org/wp-content/uploads/2013/10/Tully-SparkSummit4.pdf

•

http://spark-summit.org/wp-content/uploads/2013/10/Kay_Sparrow_Spark_Summit.pdf

•

http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

•

http://spark-summit.org/wp-content/uploads/2013/10/Wendell-Spark-Performance.pdf

•

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

•

http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf

Data Scientist Shares Insights on Spark RDDs

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Data Scientist Shares Insights on Spark RDDs

Ähnlich wie Data Scientist Shares Insights on Spark RDDs (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Scientist Shares Insights on Spark RDDs