Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014
1. Boston Apache Spark
User Group
(the Spahk group)
Microsoft NERD Center - Horace Mann
Tuesday, 15 July 2014
2. Intro to Apache Spark
Matthew Farrellee, @spinningmatt
Updated: July 2014
3. Background - MapReduce / Hadoop
â Map & reduce around for 5+ decades
(McCarthy, 1960)
â Dean and Ghemawat demonstrate map and
reduce for distributed data processing
(Google, 2004)
â MapReduce paper timed well with
commodity hardware capabilities of the early
2000s
â Open source implementation in 2006
â Years of innovation improving, simplifying,
expanding
4. MapReduce / Hadoop difficulties
â Hardware evolved
â Networks became fast
â Memory became cheap
â Programming model proved non-trivial
â Gave birth to multiple attempts to simplify, e.g. Pig,
Hive, ...
â Primarily batch execution mode
â Begat specialized (non-batch) modes, e.g. Storm,
Drill, Giraph, ...
5. Some history - Spark
â Started in UC Berkeley AMPLab by Matei
Zaharia, 2009
â AMP = Algorithms Machines People
â AMPLab is integrating Algorithms, Machines, and
People to make sense of Big Data
â Open sourced, 2010
â Donated to Apache Software Foundation,
2013
â Graduated to top level project, 2014
â 1.0 release, May 2014
6. What is Apache Spark?
An open source, efficient and productive cluster
computing system that is interoperable with
Hadoop
7. Open source
â Top level Apache project
â http://www.ohloh.net/p/apache-spark
â In a Nutshell, Apache SparkâŠ
â has had 7,366 commits made by 299 contributors
representing 117,823 lines of code
â is mostly written in Scala with a well-commented
source code
â has a codebase with a long source history
maintained by a very large development team with
increasing Y-O-Y commits
â took an estimated 30 years of effort (COCOMO
model) starting with its first commit in March, 2010
8. Efficient
â In-memory primitives
â Use cluster memory and spill to disk only when
necessary
â High performance
â https://amplab.cs.berkeley.edu/benchmark/
â General compute graphs, DAGs
â Not just: Load -> Map -> Reduce -> Store -> Load ->
Map -> Reduce -> Store
â Rich and pipelined: Load -> Map -> Union -> Reduce
-> Filter -> Group -> Sample -> Store
9. Interoperable
â Read and write data from HDFS (or any
storage system with an HDFS-like API)
â Read and write Hadoop file formats
â Run on YARN
â Interact with Hive, HBase, etc
10. Productive
â Unified data model, the RDD
â Multiple execution modes
â Batch, interactive, streaming
â Multiple languages
â Scala, Java, Python, R, SQL
â Rich standard library
â Machine learning, streaming, graph processing, ETL
â Consistent API across languages
â Significant code reduction compared to
MapReduce
11. Consistent API, less code...
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
import org.apache.spark._
val sc = new SparkContext(new SparkConf().setAppName
(âword countâ))
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark (Scala) -MapReduce -
from operator import add
from pyspark import SparkContext
sc = SparkContext(conf=SparkConf().setAppName(âword
countâ))
file = sc.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(add)
counts.saveAsTextFile("hdfs://...")
Spark (Python) -
13. The RDD
The resilient distributed dataset
A lazily evaluated, fault-tolerant collection of
elements that can be operated on in parallel
Value
mattf
mattf
RDD
Transform
Action
Load Save
14. RDDs technically
1. a set of partitions ("splits" in hadoop terms)
2. list of dependencies on parent RDDs
3. function to compute a partition given its
parents
4. optional partitioner (hash, range)
5. optional preferred location(s) for each
partition
Value
mattf
mattf
RDD
Transform
Action
Load Save
15. Load Value
mattf
mattf
RDD
Transform
Action
Load Save
Create an RDD.
â parallelize - convert a collection
â textFile - load a text file
â wholeTextFiles - load a dir of text files
â sequenceFile / hadoopFile - load using
Hadoop file formats
â More, http://spark.apache.
org/docs/latest/programming-guide.
html#external-datasets
16. Lazy operations.
Build compute DAG.
Donât trigger computation.
â map(func) - elements passed through func
â flatMap(func) - func can return >=0 elements
â filter(func) - subset of elements
â sample(..., fraction, âŠ) - select fraction
â union(other) - union of two RDDs
â distinct - new RDD w/ distinct elements
Transform Value
mattf
mattf
RDD
Transform
Action
Load Save
18. More available in documentation...
http://spark.apache.
org/docs/latest/programming-guide.
html#transformations
Transform (cont) Value
mattf
mattf
RDD
Transform
Action
Load Save
19. Action Value
mattf
mattf
RDD
Transform
Action
Load Save
Active operations.
Trigger execution of DAG.
Result in a value.
â reduce(func) - reduce elements w/ func
â collect - convert to native collection
â count - count elements
â foreach(func) - apply func to elements
â take(n) - return some elements
21. Save Value
mattf
mattf
RDD
Transform
Action
Load Save
Action that results in
data stored to file system.
â saveAsTextFile
â saveAsSequenceFile
â saveAsObjectFile/saveAsPickleFile
â More, http://spark.apache.
org/docs/latest/programming-guide.
html#actions
24. Spark SQL
â Components
â Catalyst - generic optimization for relational algebra
â Core - RDD execution; formats: Parquet, JSON
â Hive support - run HiveQL and use Hive warehouse
â SchemaRDDs and SQL
Spark
SQL Stream MLlib GraphX ...
User
User
User
Name Age Height
Name Age Height
Name Age Height
sqlCtx.sql(âSELECT name FROM people WHERE age >= 13 AND age <= 19â)
RDD SchemaRDD
25. Spark Streaming
Spark
SQL Stream MLlib GraphX ...
â Run a streaming computation as a series of time bound,
deterministic batch jobs
â Time bound used to break stream into RDDs
Spark
Streaming
Spark
Stream RDDs Results
X seconds
wide
26. MLlib
Spark
SQL Stream MLlib GraphX ...
â Machine learning algorithms over RDDs
â Classification - logistic regression, linear support vector
machines, naive Bayes, decision trees
â Regression - linear regression, regression trees
â Collaborative filtering - alternating least squares
â Clustering - K-Means
â Optimization - stochastic gradient descent, limited-
memory BFGS
â Dimensionality reduction - singular value
decomposition, principal component analysis
28. Deploying Spark
â Driver program - shell or standalone
program that creates a SparkContext
and works with RDDs
â Cluster Manager - standalone, Mesos
or YARN
â Standalone - the default, simple setup,
master + worker processes on nodes
â Mesos - a general purpose manager that
runs Hadoop and other services. Two
modes of operation, fine & coarse.
â YARN - Hadoop 2âs resource manager
29. Highlights from Spark Summit 2014
http://spark-summit.org/east/2015
New York in early 2015