Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Boston Apache Spark
User Group
(the Spahk group)
Microsoft NERD Center - Horace Mann
Tuesday, 15 July 2014

Intro to Apache Spark
Matthew Farrellee, @spinningmatt
Updated: July 2014

Background - MapReduce / Hadoop
● Map & reduce around for 5+ decades
(McCarthy, 1960)
● Dean and Ghemawat demonstrate map and
reduce for distributed data processing
(Google, 2004)
● MapReduce paper timed well with
commodity hardware capabilities of the early
2000s
● Open source implementation in 2006
● Years of innovation improving, simplifying,
expanding

MapReduce / Hadoop difficulties
● Hardware evolved
○ Networks became fast
○ Memory became cheap
● Programming model proved non-trivial
○ Gave birth to multiple attempts to simplify, e.g. Pig,
Hive, ...
● Primarily batch execution mode
○ Begat specialized (non-batch) modes, e.g. Storm,
Drill, Giraph, ...

Some history - Spark
● Started in UC Berkeley AMPLab by Matei
Zaharia, 2009
○ AMP = Algorithms Machines People
○ AMPLab is integrating Algorithms, Machines, and
People to make sense of Big Data
● Open sourced, 2010
● Donated to Apache Software Foundation,
2013
● Graduated to top level project, 2014
● 1.0 release, May 2014

What is Apache Spark?
An open source, efficient and productive cluster
computing system that is interoperable with
Hadoop

Open source
● Top level Apache project
● http://www.ohloh.net/p/apache-spark
○ In a Nutshell, Apache Spark…
○ has had 7,366 commits made by 299 contributors
representing 117,823 lines of code
○ is mostly written in Scala with a well-commented
source code
○ has a codebase with a long source history
maintained by a very large development team with
increasing Y-O-Y commits
○ took an estimated 30 years of effort (COCOMO
model) starting with its first commit in March, 2010

Efficient
● In-memory primitives
○ Use cluster memory and spill to disk only when
necessary
● High performance
○ https://amplab.cs.berkeley.edu/benchmark/
● General compute graphs, DAGs
○ Not just: Load -> Map -> Reduce -> Store -> Load ->
Map -> Reduce -> Store
○ Rich and pipelined: Load -> Map -> Union -> Reduce
-> Filter -> Group -> Sample -> Store

Interoperable
● Read and write data from HDFS (or any
storage system with an HDFS-like API)
● Read and write Hadoop file formats
● Run on YARN
● Interact with Hive, HBase, etc

Productive
● Unified data model, the RDD
● Multiple execution modes
○ Batch, interactive, streaming
● Multiple languages
○ Scala, Java, Python, R, SQL
● Rich standard library
○ Machine learning, streaming, graph processing, ETL
● Consistent API across languages
● Significant code reduction compared to
MapReduce

Consistent API, less code...
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
import org.apache.spark._
val sc = new SparkContext(new SparkConf().setAppName
(“word count”))
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark (Scala) -MapReduce -
from operator import add
from pyspark import SparkContext
sc = SparkContext(conf=SparkConf().setAppName(“word
count”))
file = sc.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(add)
Spark (Python) -

Spark workflow
Value
mattf
mattf
RDD
Transform
Action
Load Save

The RDD
The resilient distributed dataset
A lazily evaluated, fault-tolerant collection of
elements that can be operated on in parallel
Value
mattf
mattf
RDD
Transform
Action
Load Save

RDDs technically
1. a set of partitions ("splits" in hadoop terms)
2. list of dependencies on parent RDDs
3. function to compute a partition given its
parents
4. optional partitioner (hash, range)
5. optional preferred location(s) for each
partition
Value
mattf
mattf
RDD
Transform
Action
Load Save

Load Value
mattf
mattf
RDD
Transform
Action
Load Save
Create an RDD.
● parallelize - convert a collection
● textFile - load a text file
● wholeTextFiles - load a dir of text files
● sequenceFile / hadoopFile - load using
Hadoop file formats
● More, http://spark.apache.
org/docs/latest/programming-guide.
html#external-datasets

Lazy operations.
Build compute DAG.
Don’t trigger computation.
● map(func) - elements passed through func
● flatMap(func) - func can return >=0 elements
● filter(func) - subset of elements
● sample(..., fraction, …) - select fraction
● union(other) - union of two RDDs
● distinct - new RDD w/ distinct elements
Transform Value
mattf
mattf
RDD
Transform
Action
Load Save

● groupByKey - (K, V) -> (K, Seq[V])
● reduceByKey(func) - (K, V) -> (K, func(V...))
● sortByKey - order (K, V) by K
● join(other) - (K, V) + (K, W) -> (K, (V, W))
● cogroup/groupWith(other) - (K, V) + (K, W) -
> (K, (Seq[V], Seq[W]))
● cartesian(other) - cartesian product (all
pairs)
Transform (cont) Value
mattf
mattf
RDD
Transform
Action
Load Save

More available in documentation...
http://spark.apache.
html#transformations
Transform (cont) Value
mattf
mattf
RDD
Transform
Action
Load Save

Action Value
mattf
mattf
RDD
Transform
Action
Load Save
Active operations.
Trigger execution of DAG.
Result in a value.
● reduce(func) - reduce elements w/ func
● collect - convert to native collection
● count - count elements
● foreach(func) - apply func to elements
● take(n) - return some elements

Action (cont) Value
mattf
mattf
RDD
Transform
Action
Load Save
More available in documentation...
http://spark.apache.
html#actions

Save Value
mattf
mattf
RDD
Transform
Action
Load Save
Action that results in
data stored to file system.
● saveAsTextFile
● saveAsSequenceFile
● saveAsObjectFile/saveAsPickleFile
● More, http://spark.apache.
html#actions

Spark workflow
Value
mattf
mattf
RDD
Transform
Action
Load Save
file = sc.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(add)

Multiple modes, rich standard library
Spark Core
SQL Streaming MLLib GraphX ...
Apache Spark

Spark SQL
● Components
○ Catalyst - generic optimization for relational algebra
○ Core - RDD execution; formats: Parquet, JSON
○ Hive support - run HiveQL and use Hive warehouse
● SchemaRDDs and SQL
Spark
SQL Stream MLlib GraphX ...
User
User
User
Name Age Height
Name Age Height
Name Age Height
sqlCtx.sql(“SELECT name FROM people WHERE age >= 13 AND age <= 19”)
RDD SchemaRDD

Spark Streaming
Spark
● Run a streaming computation as a series of time bound,
deterministic batch jobs
● Time bound used to break stream into RDDs
Spark
Streaming
Spark
Stream RDDs Results
X seconds
wide

MLlib
Spark
● Machine learning algorithms over RDDs
● Classification - logistic regression, linear support vector
machines, naive Bayes, decision trees
● Regression - linear regression, regression trees
● Collaborative filtering - alternating least squares
● Clustering - K-Means
● Optimization - stochastic gradient descent, limited-
memory BFGS
● Dimensionality reduction - singular value
decomposition, principal component analysis

Deploying Spark
Source: http://spark.apache.org/docs/latest/cluster-overview.html

Deploying Spark
● Driver program - shell or standalone
program that creates a SparkContext
and works with RDDs
● Cluster Manager - standalone, Mesos
or YARN
○ Standalone - the default, simple setup,
master + worker processes on nodes
○ Mesos - a general purpose manager that
runs Hadoop and other services. Two
modes of operation, fine & coarse.
○ YARN - Hadoop 2’s resource manager

Highlights from Spark Summit 2014
http://spark-summit.org/east/2015
New York in early 2015

Thanks to...
@pacoid, @rxin and @michaelarmbrust for
letting me crib slides for this introduction

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Ähnlich wie Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014