Spark Intro @ analytics big data summit

2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Apache Spark : Fast and Easy Data
Processing
Sujee Maniyam
Founder / Principal
Elephant Scale LLC
sujee@elephantscale.com
http://elephantscale.com

Outline
r  Background
r  Spark Architecture
r  Demo

Spark
r  Fast & Expressive Cluster computing engine
r  Compatible with Hadoop
r  Came out of Berkeley AMP Lab
r  Now Apache project
r  Version 1.2 just released (Dec 2014)

Timeline Hadoop & Spark

Hypo-meter J

Spark Job Trends

Comparison With Hadoop
Hadoop Spark
Distributed Storage + Distributed
Compute
Distributed Compute Only
MapReduce framework Generalized computation
Usually data on disk (HDFS) On disk / in memory
Not ideal for iterative work Great at Iterative workloads
(machine learning ..etc)
Batch process - Upto 2x -10x faster for data on disk
- Upto 100x faster for data in memory
Compact code
Java, Python, Scala supported
Shell for ad-hoc exploration

Hadoop + Yarn : Universal OS for
Cluster Computing
HDFS
YARN
Batch
(mapreduce)
Streaming
(storm, S4)
In-memory
(spark)
Storage
Cluster
Management
Applications

Spark Vs Hadoop
r  Spark is ‘easier’ than Hadoop
r  ‘friendlier’ for data scientists / analysts
r  Interactive shell
r fast development cycles
r adhoc exploration
r  API supports multiple languages
r Java, Scala, Python
r  Great for small (Gigs) to medium (100s of Gigs)
data

Spark Vs. Hadoop
Fast!

Is Spark Replacing Hadoop?
r  Right now, Spark runs on Hadoop / YARN
r Complimentary
r  Can be seen as generic MapReduce
r  Spark is really great if data fits in memory (few
hundred gigs),
r  Spark can be used as compute platform with
various storage types (see next slide)

Spark & Pluggable Storage
Spark
(compute engine)
HDFS Amazon S3 Cassandra ???

Hadoop & Spark Future ???

Outline
r  Background
r  à Spark Architecture
r  Demo

Spark Eco-System
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Schema / sql Real Time
Machine
Learning

Spark Core
r  Distributed compute engine
r  Sort / shuffle algorithms
r  Handles node failures -> re-computes missing
pieces

Spark SQL
r  Adhoc / interactive analytics
r  ETL workflow
r  Reporting
r  Natively understands
r JSON : {“name”: “mark”, “age”: 40}
r Parquet : on disk columnar format, heavily
compressed

Spark SQL Quick Example
{"name":"mark", "age":50, "sex":"M"},
{"name":"mary", "age":45, "sex":"F"},
{"name":"brian", "age":15, "sex":"M"},
{"name":"nancy", "age":17, "sex":"F"}
// … setup …
val teenagers = sqlContext.sql("SELECT name
FROM people WHERE age >= 13 AND age <= 19")
è  [brian, nancy]

Spark Streaming
r  Process data streams in real time, in micro-
batches
r  Low latency, high throughput (1000s events /
sec)
r  Stock ticks / sensor data (connected devices /
IoT – Internet of Things)
Streaming
Sources
Storage

Machine Learning (ML Lib)
r  Machine learning at scale
r  Out of the box ML capabilities !
r  Java / Scala / Python language support
r  Lots of common algorithms are supported
r  Classification / Regressions
r  Linear models (linear R, logistic regression, SVM)
r  Decision trees
r  Collaborative filtering (recommendations)
r  K-Means clustering
r  More to come

Spark Cluster Deployment
1)  Standalone
2)  Yarn
3)  Mesos
Client
Node

Spark Cluster Architecture
r  Multiple ‘applications’ can run at the same time
r  Driver (or ‘main’) launches an application
r  Each application gets its own ‘executor’
r  Isolated (runs in different JVMs)
r  Also means data can not be shared across applications
r  Cluster Managers:
r  multiple cluster managers are supported
r  1) Standalone : simple to setup
r  2) YARN : on top of Hadoop
r  3) Mesos : General cluster manager (AMP lab)

Spark Data Model : RDD
r  Resilient Distributed Dataset (RDD)
r  Can live in
r Memory (best case scenario)
r Or on disk (FS, HDFS, S3 …etc)
r  Each RDD is split into multiple partitions
r  Partitions may live on different nodes
r  Partitions can be computed in parallel on
different nodes

Partitions Explained
1G file
64 M 64 M 64 M 64 M
Task Task Task Task
Result
parallelism

RDD : Loading
r  Use Spark context to load RDDs from disk /
external storage
val sc = new SparkContext(…)
val f = sc.textFile(“/data/input1.txt”) // single file
sc.textFile(“/data/”) // load all files under dir
sc.textFile(“/data/*.log”) // wild card matching

RDD Operations
r  Two kinds of operations on RDDs
r 1) Transformations
r Create a new RDD from existing ones (e.g. Map)
r 2) Actions
r E.g. Returns the results to clients (e.g. Reduce)
r  Transformations are lazy.. Actions force
transformations

Transformations / Actions
RDD 1 RDD 2
Client
RDD 3
Transformation 1
(map)
Action1
(collect)

RDD Transformations
Transformation Description Example
filter Filters through each record
(aka grep)
f.filter( line =>
line.contains(“ERROR”))
union Merges two RDDs rdd1.union(rdd2)
…see docs …

RDD Actions
Action Description Example
count() Counts all records in an rdd f.count()
first() Extract the first record f.first ()
take(n) Take first N lines f.take(10)
collect() Gathers all records for RDD.
All data has to fit in memory of
ONE machine (don’t use for big
data sets)
f.collect()
…. See
documentation
..

RDD : Saving
r  saveAsTextFile () and saveAsSequenceFile()
f.saveAsTextFile(“/output/directory”) // a directory
r  Output usually is a directory
r RDDs will be saved as multiple files in the dir
r Each partition à one output file

Caching of RDDs
r  RDDs can be loaded from disk and computed
r Hadoop mapreduce model
r  Also RDDs can be cached in memory
r  Subsequent operations are much faster
f.persist() // on disk or memory
f.cache() // memory only
r  In memory RDDs are
great for iterative workloads
r Machine learning algorithms

Demo Time !

Demo : Spark-shell
r  Invoke spark shell
r  Load a data set
r  Do basic operations (count / filter)

Demo: RDD Caching
r  From Spark shell
r  Load an RDD
r  Demonstrate the difference between cached and
non-cached

Demo : Map Reduce
r  Quick Word count
val input = sc.textFile(“…”)
val counts = input.flatMap(
line => line.split(“ “)).
map(word => (word, 1)).
reduceByKey(_+_)

Thanks !
Sujee Maniyam
sujee@elephantscale.com
http://elephantscale.com
Expert consulting & training in Big Data

Credits
r  http://spark.apache.org/
r  http://www.strategictechplanning.com
r  Tuningpp.com
r  Kidzworld.com

Spark Intro @ analytics big data summit

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Spark Intro @ analytics big data summit

Similar to Spark Intro @ analytics big data summit (20)

More from Sujee Maniyam

More from Sujee Maniyam (6)

Recently uploaded

Recently uploaded (20)

Spark Intro @ analytics big data summit