SlideShare ist ein Scribd-Unternehmen logo
1 von 73
Downloaden Sie, um offline zu lesen
®
© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Apache Spark
Keys Botzum
Senior Principal Technologist, MapR Technologies
June 2014
®
© 2014 MapR Technologies 2
Agenda
•  MapReduce
•  Apache Spark
•  How Spark Works
•  Fault Tolerance and Performance
•  Examples
•  Spark and More
®
© 2014 MapR Technologies 3
MapR: Best Product, Best Business & Best
Customers
Top Ranked
Exponential
Growth
500+
Customers Cloud Leaders
3X bookings Q1 ‘13 – Q1 ‘14
80% of accounts expand 3X
90% software licenses
<1% lifetime churn
>$1B
in incremental revenue
generated by 1 customer
®
© 2014 MapR Technologies 4© 2014 MapR Technologies
®
Review: MapReduce
®
© 2014 MapR Technologies 5
MapReduce: A Programming Model
•  MapReduce:
Simplified Data
Processing on Large
Clusters
(published 2004)
•  Parallel and Distributed
Algorithm:
•  Data Locality
•  Fault Tolerance
•  Linear Scalability
®
© 2014 MapR Technologies 6
MapReduce Basics
•  Assumes scalable distributed file system that
shards data
•  Map
–  Loading of the data and defining a set of keys
•  Reduce
–  Collects the organized key-based data to process
and output
•  Performance can be tweaked based on known
details of your source files and cluster shape
(size, total number)
®
© 2014 MapR Technologies 7
MapReduce Processing Model
•  Define mappers
•  Shuffling is automatic
•  Define reducers
•  For complex work, chain jobs together
®
© 2014 MapR Technologies 8
MapReduce: The Good
•  Built in fault tolerance
•  Optimized IO path
•  Scalable
•  Developer focuses on Map/Reduce, not
infrastructure
•  simple? API
®
© 2014 MapR Technologies 9
MapReduce: The Bad
•  Optimized for disk IO
–  Doesn’t leverage memory well
–  Iterative algorithms go through disk IO path again
and again
•  Primitive API
–  Developer’s have to build on very simple abstraction
–  Key/Value in/out
–  Even basic things like join require extensive code
•  Result often many files that need to be
combined appropriately
®
© 2014 MapR Technologies 10© 2014 MapR Technologies
®
Apache Spark
®
© 2014 MapR Technologies 11
Apache Spark
•  spark.apache.org
•  github.com/apache/spark
•  user@spark.apache.org
•  Originally developed in
2009 in UC Berkeley’s
AMP Lab
•  Fully open sourced in
2010 – now at Apache
Software Foundation
- Commercial Vendor Developing/Supporting
®
© 2014 MapR Technologies 12
Spark: Easy and Fast Big Data
•  Easy to Develop
–  Rich APIs in
Java, Scala,
Python
–  Interactive shell
•  Fast to Run
–  General execution
graphs
–  In-memory storage
2-5× less code
®
© 2014 MapR Technologies 13
Resilient Distributed Datasets (RDD)
•  Spark revolves around RDDs
•  Fault-tolerant read only collection of elements
that can be operated on in parallel
•  Cached in memory or on disk
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
®
© 2014 MapR Technologies 14
RDD Operations - Expressive
•  Transformations
–  Creation of a new RDD dataset from an existing
•  map, filter, distinct, union, sample, groupByKey, join,
reduce, etc…
•  Actions
–  Return a value after running a computation
•  collect, count, first, takeSample, foreach, etc…
Check the documentation for a complete list
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-
operations
®
© 2014 MapR Technologies 15
Easy: Clean API
•  Resilient Distributed
Datasets
•  Collections of objects spread
across a cluster, stored in
RAM or on Disk
•  Built through parallel
transformations
•  Automatically rebuilt on
failure
•  Operations
•  Transformations
(e.g. map, filter,
groupBy)
•  Actions
(e.g. count,
collect, save)
Write programs in terms of transformations on
distributed datasets
®
© 2014 MapR Technologies 16
Easy: Expressive API
•  map •  reduce
®
© 2014 MapR Technologies 17
Easy: Expressive API
•  map
•  filter
•  groupBy
•  sort
•  union
•  join
•  leftOuterJoin
•  rightOuterJoin
•  reduce
•  count
•  fold
•  reduceByKey
•  groupByKey
•  cogroup
•  cross
•  zip
sample
take
first
partitionBy
mapWith
pipe
save ...
®
© 2014 MapR Technologies 18
Easy: Example – Word Count
•  Spark•  Hadoop MapReduce
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
®
© 2014 MapR Technologies 19
Easy: Example – Word Count
•  Spark•  Hadoop MapReduce
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
®
© 2014 MapR Technologies 20
Easy: Works Well With Hadoop
•  Data Compatibility
•  Access your existing
Hadoop Data
•  Use the same data
formats
•  Adheres to data
locality for efficient
processing
•  Deployment
Models
•  “Standalone”
deployment
•  YARN-based
deployment
•  Mesos-based
deployment
•  Deploy on existing
Hadoop cluster or
side-by-side
®
© 2014 MapR Technologies 21
Easy: User-Driven Roadmap
•  Language support
–  Improved Python
support
–  SparkR
–  Java 8
–  Integrated Schema
and SQL support in
Spark’s APIs
•  Better ML
–  Sparse Data
Support
–  Model Evaluation
Framework
–  Performance Testing
®
© 2014 MapR Technologies 22
Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w
®
© 2014 MapR Technologies 23
Fast: Logistic Regression Performance
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110	
  s	
  /	
  iteration	
  
first	
  iteration	
  80	
  s	
  
further	
  iterations	
  1	
  s	
  
®
© 2014 MapR Technologies 24
Easy: Multi-language Support
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
®
© 2014 MapR Technologies 25
Easy: Interactive Shell
Scala based shell
% /opt/mapr/spark/spark-0.9.1/bin/spark-shell
scala> val logs = sc.textFile("hdfs:///user/keys/logdata”)"
scala> logs.count()"
…"
res0: Long = 232681
scala> logs.filter(l => l.contains("ERROR")).count()"
…."
res1: Long = 205

Python based shell as well - pyspark
®
© 2014 MapR Technologies 26© 2014 MapR Technologies
®
Fault Tolerance and Performance
®
© 2014 MapR Technologies 27
Fast: Using RAM, Operator Graphs
•  In-memory Caching
•  Data Partitions read
from RAM instead of
disk
•  Operator Graphs
•  Scheduling
Optimizations
•  Fault Tolerance
=	
  cached	
  partition	
  
=	
  RDD	
  
join	
  
filter	
  
groupBy	
  
Stage	
  3	
  
Stage	
  1	
  
Stage	
  2	
  
A:	
   B:	
  
C:	
   D:	
   E:	
  
F:	
  
map	
  
®
© 2014 MapR Technologies 28
Directed Acylic Graph (DAG)
•  Directed
–  Only in a single direction
•  Acyclic
–  No looping
•  This supports fault-tolerance
®
© 2014 MapR Technologies 29
Easy: Fault Recovery
RDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter	
  
(func	
  =	
  startsWith(…))	
  
map	
  
(func	
  =	
  split(...))	
  
®
© 2014 MapR Technologies 30
RDD Persistence / Caching
•  Variety of storage levels
–  memory_only (default), memory_and_disk, etc…
•  API Calls
–  persist(StorageLevel)
–  cache() – shorthand for
persist(StorageLevel.MEMORY_ONLY)
•  Considerations
–  Read from disk vs. recompute (memory_and_disk)
–  Total memory storage size (memory_only_ser)
–  Replicate to second node for faster fault recovery
(memory_only_2)
•  Think about this option if supporting a time sensitive client
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-
persistence
®
© 2014 MapR Technologies 31
PageRank Performance
171
80
23
14
0
50
100
150
200
30 60
Iterationtime(s)
Number of machines
Hadoop
Spark
®
© 2014 MapR Technologies 32
Other Iterative Algorithms
0.96
110
0 25 50 75 100 125
Logistic
Regression
4.1
155
0 30 60 90 120 150 180
K-Means
Clustering
Hadoop
Spark
Time per Iteration (s)
®
© 2014 MapR Technologies 33
Fast: Scaling Down
69	
  
58	
  
41	
  
30	
  
12	
  
0	
  
20	
  
40	
  
60	
  
80	
  
100	
  
Cache	
  
disabled	
  
25%	
   50%	
   75%	
   Fully	
  
cached	
  
Execution	
  time	
  (s)	
  
%	
  of	
  working	
  set	
  in	
  cache	
  
®
© 2014 MapR Technologies 34
Comparison to Storm
•  Higher throughput than Storm
–  Spark Streaming: 670k records/sec/node
–  Storm: 115k records/sec/node
–  Commercial systems: 100-500k records/sec/node
0	
  
10	
  
20	
  
30	
  
100	
   1000	
  
Throughput	
  per	
  node	
  
(MB/s)	
  
Record	
  Size	
  (bytes)	
  
WordCount	
  
Spark	
  
Storm	
  
0	
  
20	
  
40	
  
60	
  
100	
   1000	
  
Throughput	
  per	
  node	
  
(MB/s)	
  
Record	
  Size	
  (bytes)	
  
Grep	
  
Spark	
  
Storm	
  
®
© 2014 MapR Technologies 35© 2014 MapR Technologies
®
How Spark Works
®
© 2014 MapR Technologies 36
Working With RDDs
®
© 2014 MapR Technologies 37
Working With RDDs
RDD
textFile = sc.textFile(”SomeFile.txt”)!
®
© 2014 MapR Technologies 38
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
linesWithSpark = textFile.filter(lambda line: "Spark” in line)!
textFile = sc.textFile(”SomeFile.txt”)!
®
© 2014 MapR Technologies 39
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
Action
 Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)!
linesWithSpark.count()!
74!
!
linesWithSpark.first()!
# Apache Spark!
textFile = sc.textFile(”SomeFile.txt”)!
®
© 2014 MapR Technologies 40© 2014 MapR Technologies
®
Example: Log Mining
®
© 2014 MapR Technologies 41
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
®
© 2014 MapR Technologies 42
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
®
© 2014 MapR Technologies 43
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
®
© 2014 MapR Technologies 44
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
Base RDD
®
© 2014 MapR Technologies 45
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
®
© 2014 MapR Technologies 46
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
Transformed RDD
®
© 2014 MapR Technologies 47
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
®
© 2014 MapR Technologies 48
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Action
®
© 2014 MapR Technologies 49
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
®
© 2014 MapR Technologies 50
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
tasks
tasks
tasks
®
© 2014 MapR Technologies 51
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Read
HDFS
Block
Read
HDFS
Block
Read
HDFS
Block
®
© 2014 MapR Technologies 52
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
®
© 2014 MapR Technologies 53
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
results
results
results
®
© 2014 MapR Technologies 54
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
®
© 2014 MapR Technologies 55
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
tasks
tasks
tasks
Driver
®
© 2014 MapR Technologies 56
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Process
from
Cache
Process
from
Cache
Process
from
Cache
®
© 2014 MapR Technologies 57
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
results
results
results
®
© 2014 MapR Technologies 58
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Cache your data è Faster Results
Full-text search of Wikipedia
•  60GB on 20 EC2 machines
•  0.5 sec from cache vs. 20s for on-disk
®
© 2014 MapR Technologies 59© 2014 MapR Technologies
®
Example: Page Rank
®
© 2014 MapR Technologies 60
Example: PageRank
•  Good example of a more complex algorithm
–  Multiple stages of map & reduce
•  Benefits from Spark’s in-memory caching
–  Multiple iterations over the same data
®
© 2014 MapR Technologies 61
Basic Idea
Give pages ranks
(scores) based on links
to them
•  Links from many
pages è high rank
•  Link from a high-rank
page è high rank
Image:	
  en.wikipedia.org/wiki/File:PageRank-­‐hi-­‐res-­‐2.png	
  	
  
®
© 2014 MapR Technologies 62
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3.  Set each page’s rank to 0.15 + 0.85 × contribs
1.0	
   1.0	
  
1.0	
  
1.0	
  
®
© 2014 MapR Technologies 63
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3.  Set each page’s rank to 0.15 + 0.85 × contribs
1.0	
   1.0	
  
1.0	
  
1.0	
  
1	
  
0.5	
  
0.5	
  
0.5	
  
1	
  
0.5	
  
®
© 2014 MapR Technologies 64
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3.  Set each page’s rank to 0.15 + 0.85 × contribs
0.58	
   1.0	
  
1.85	
  
0.58	
  
®
© 2014 MapR Technologies 65
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3.  Set each page’s rank to 0.15 + 0.85 × contribs
0.58	
  
0.29	
  
0.29	
  
0.5	
  
1.85	
  
0.58	
   1.0	
  
1.85	
  
0.58	
  
0.5	
  
®
© 2014 MapR Technologies 66
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3.  Set each page’s rank to 0.15 + 0.85 × contribs
0.39	
   1.72	
  
1.31	
  
0.58	
  
. . .
®
© 2014 MapR Technologies 67
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3.  Set each page’s rank to 0.15 + 0.85 × contribs
0.46	
   1.37	
  
1.44	
  
0.73	
  
Final	
  state:	
  
®
© 2014 MapR Technologies 68
Scala Implementation
val links = // load RDD of (url, neighbors) pairs
var ranks = // give each url rank of 1.0
for (i <- 1 to ITERATIONS) {
val contribs = links.join(ranks).values.flatMap {
case (urls, rank)) =>
urls.map(dest => (dest, rank/urls.size))
}
ranks = contribs.reduceByKey(_ + _)
.mapValues(0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/
apache/spark/examples/SparkPageRank.scala
®
© 2014 MapR Technologies 69© 2014 MapR Technologies
®
Spark and More
®
© 2014 MapR Technologies 70
Easy: Unified Platform
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLLib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Continued innovation bringing new functionality, e.g.,:
•  BlinkDB (Approximate Queries)
•  SparkR (R wrapper for Spark)
•  Tachyon (off-heap RDD caching)
®
© 2014 MapR Technologies 71
Spark on MapR
•  Certified Spark Distribution
•  Fully supported and packaged by MapR in
partnership with Databricks
–  mapr-spark package with Spark, Shark, Spark
Streaming today
–  Spark-python, GraphX and MLLib soon
•  YARN integration
–  Spark can then allocate resources from cluster
when needed
®
© 2014 MapR Technologies 72
References
•  Based on slides from Pat McDonough at
•  Spark web site: http://spark.apache.org/
•  Spark on MapR:
–  http://www.mapr.com/products/apache-spark
–  http://doc.mapr.com/display/MapR/Installing+Spark
+and+Shark
®
© 2014 MapR Technologies 73
Q&A
@mapr maprtech
kbotzum@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Más contenido relacionado

Was ist angesagt?

Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 

Was ist angesagt? (20)

Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Sqoop
SqoopSqoop
Sqoop
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 

Andere mochten auch

Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distributionmcsrivas
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadeaviadea
 
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)Amazon Web Services
 
MapR Tutorial Series
MapR Tutorial SeriesMapR Tutorial Series
MapR Tutorial Seriesselvaraaju
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
MapR Data Analyst
MapR Data AnalystMapR Data Analyst
MapR Data Analystselvaraaju
 

Andere mochten auch (10)

Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distribution
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadea
 
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
 
MapR Tutorial Series
MapR Tutorial SeriesMapR Tutorial Series
MapR Tutorial Series
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
MapR Data Analyst
MapR Data AnalystMapR Data Analyst
MapR Data Analyst
 

Ähnlich wie Apache Spark & Hadoop

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkJen Aman
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 

Ähnlich wie Apache Spark & Hadoop (20)

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Is Spark Replacing Hadoop
Is Spark Replacing HadoopIs Spark Replacing Hadoop
Is Spark Replacing Hadoop
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 

Mehr von MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainMapR Technologies
 

Mehr von MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 

Último

EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarThousandEyes
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdfThe Good Food Institute
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FESTBillieHyde
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch TuesdayIvanti
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updateadam112203
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4DianaGray10
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingMAGNIntelligence
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud DataEric D. Schabell
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 

Último (20)

EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FEST
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch Tuesday
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 update
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced Computing
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 

Apache Spark & Hadoop

  • 1. ® © 2014 MapR Technologies 1 ® © 2014 MapR Technologies Apache Spark Keys Botzum Senior Principal Technologist, MapR Technologies June 2014
  • 2. ® © 2014 MapR Technologies 2 Agenda •  MapReduce •  Apache Spark •  How Spark Works •  Fault Tolerance and Performance •  Examples •  Spark and More
  • 3. ® © 2014 MapR Technologies 3 MapR: Best Product, Best Business & Best Customers Top Ranked Exponential Growth 500+ Customers Cloud Leaders 3X bookings Q1 ‘13 – Q1 ‘14 80% of accounts expand 3X 90% software licenses <1% lifetime churn >$1B in incremental revenue generated by 1 customer
  • 4. ® © 2014 MapR Technologies 4© 2014 MapR Technologies ® Review: MapReduce
  • 5. ® © 2014 MapR Technologies 5 MapReduce: A Programming Model •  MapReduce: Simplified Data Processing on Large Clusters (published 2004) •  Parallel and Distributed Algorithm: •  Data Locality •  Fault Tolerance •  Linear Scalability
  • 6. ® © 2014 MapR Technologies 6 MapReduce Basics •  Assumes scalable distributed file system that shards data •  Map –  Loading of the data and defining a set of keys •  Reduce –  Collects the organized key-based data to process and output •  Performance can be tweaked based on known details of your source files and cluster shape (size, total number)
  • 7. ® © 2014 MapR Technologies 7 MapReduce Processing Model •  Define mappers •  Shuffling is automatic •  Define reducers •  For complex work, chain jobs together
  • 8. ® © 2014 MapR Technologies 8 MapReduce: The Good •  Built in fault tolerance •  Optimized IO path •  Scalable •  Developer focuses on Map/Reduce, not infrastructure •  simple? API
  • 9. ® © 2014 MapR Technologies 9 MapReduce: The Bad •  Optimized for disk IO –  Doesn’t leverage memory well –  Iterative algorithms go through disk IO path again and again •  Primitive API –  Developer’s have to build on very simple abstraction –  Key/Value in/out –  Even basic things like join require extensive code •  Result often many files that need to be combined appropriately
  • 10. ® © 2014 MapR Technologies 10© 2014 MapR Technologies ® Apache Spark
  • 11. ® © 2014 MapR Technologies 11 Apache Spark •  spark.apache.org •  github.com/apache/spark •  user@spark.apache.org •  Originally developed in 2009 in UC Berkeley’s AMP Lab •  Fully open sourced in 2010 – now at Apache Software Foundation - Commercial Vendor Developing/Supporting
  • 12. ® © 2014 MapR Technologies 12 Spark: Easy and Fast Big Data •  Easy to Develop –  Rich APIs in Java, Scala, Python –  Interactive shell •  Fast to Run –  General execution graphs –  In-memory storage 2-5× less code
  • 13. ® © 2014 MapR Technologies 13 Resilient Distributed Datasets (RDD) •  Spark revolves around RDDs •  Fault-tolerant read only collection of elements that can be operated on in parallel •  Cached in memory or on disk http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 14. ® © 2014 MapR Technologies 14 RDD Operations - Expressive •  Transformations –  Creation of a new RDD dataset from an existing •  map, filter, distinct, union, sample, groupByKey, join, reduce, etc… •  Actions –  Return a value after running a computation •  collect, count, first, takeSample, foreach, etc… Check the documentation for a complete list http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd- operations
  • 15. ® © 2014 MapR Technologies 15 Easy: Clean API •  Resilient Distributed Datasets •  Collections of objects spread across a cluster, stored in RAM or on Disk •  Built through parallel transformations •  Automatically rebuilt on failure •  Operations •  Transformations (e.g. map, filter, groupBy) •  Actions (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets
  • 16. ® © 2014 MapR Technologies 16 Easy: Expressive API •  map •  reduce
  • 17. ® © 2014 MapR Technologies 17 Easy: Expressive API •  map •  filter •  groupBy •  sort •  union •  join •  leftOuterJoin •  rightOuterJoin •  reduce •  count •  fold •  reduceByKey •  groupByKey •  cogroup •  cross •  zip sample take first partitionBy mapWith pipe save ...
  • 18. ® © 2014 MapR Technologies 18 Easy: Example – Word Count •  Spark•  Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 19. ® © 2014 MapR Technologies 19 Easy: Example – Word Count •  Spark•  Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 20. ® © 2014 MapR Technologies 20 Easy: Works Well With Hadoop •  Data Compatibility •  Access your existing Hadoop Data •  Use the same data formats •  Adheres to data locality for efficient processing •  Deployment Models •  “Standalone” deployment •  YARN-based deployment •  Mesos-based deployment •  Deploy on existing Hadoop cluster or side-by-side
  • 21. ® © 2014 MapR Technologies 21 Easy: User-Driven Roadmap •  Language support –  Improved Python support –  SparkR –  Java 8 –  Integrated Schema and SQL support in Spark’s APIs •  Better ML –  Sparse Data Support –  Model Evaluation Framework –  Performance Testing
  • 22. ® © 2014 MapR Technologies 22 Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w
  • 23. ® © 2014 MapR Technologies 23 Fast: Logistic Regression Performance 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110  s  /  iteration   first  iteration  80  s   further  iterations  1  s  
  • 24. ® © 2014 MapR Technologies 24 Easy: Multi-language Support Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 25. ® © 2014 MapR Technologies 25 Easy: Interactive Shell Scala based shell % /opt/mapr/spark/spark-0.9.1/bin/spark-shell scala> val logs = sc.textFile("hdfs:///user/keys/logdata”)" scala> logs.count()" …" res0: Long = 232681 scala> logs.filter(l => l.contains("ERROR")).count()" …." res1: Long = 205 Python based shell as well - pyspark
  • 26. ® © 2014 MapR Technologies 26© 2014 MapR Technologies ® Fault Tolerance and Performance
  • 27. ® © 2014 MapR Technologies 27 Fast: Using RAM, Operator Graphs •  In-memory Caching •  Data Partitions read from RAM instead of disk •  Operator Graphs •  Scheduling Optimizations •  Fault Tolerance =  cached  partition   =  RDD   join   filter   groupBy   Stage  3   Stage  1   Stage  2   A:   B:   C:   D:   E:   F:   map  
  • 28. ® © 2014 MapR Technologies 28 Directed Acylic Graph (DAG) •  Directed –  Only in a single direction •  Acyclic –  No looping •  This supports fault-tolerance
  • 29. ® © 2014 MapR Technologies 29 Easy: Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter   (func  =  startsWith(…))   map   (func  =  split(...))  
  • 30. ® © 2014 MapR Technologies 30 RDD Persistence / Caching •  Variety of storage levels –  memory_only (default), memory_and_disk, etc… •  API Calls –  persist(StorageLevel) –  cache() – shorthand for persist(StorageLevel.MEMORY_ONLY) •  Considerations –  Read from disk vs. recompute (memory_and_disk) –  Total memory storage size (memory_only_ser) –  Replicate to second node for faster fault recovery (memory_only_2) •  Think about this option if supporting a time sensitive client http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd- persistence
  • 31. ® © 2014 MapR Technologies 31 PageRank Performance 171 80 23 14 0 50 100 150 200 30 60 Iterationtime(s) Number of machines Hadoop Spark
  • 32. ® © 2014 MapR Technologies 32 Other Iterative Algorithms 0.96 110 0 25 50 75 100 125 Logistic Regression 4.1 155 0 30 60 90 120 150 180 K-Means Clustering Hadoop Spark Time per Iteration (s)
  • 33. ® © 2014 MapR Technologies 33 Fast: Scaling Down 69   58   41   30   12   0   20   40   60   80   100   Cache   disabled   25%   50%   75%   Fully   cached   Execution  time  (s)   %  of  working  set  in  cache  
  • 34. ® © 2014 MapR Technologies 34 Comparison to Storm •  Higher throughput than Storm –  Spark Streaming: 670k records/sec/node –  Storm: 115k records/sec/node –  Commercial systems: 100-500k records/sec/node 0   10   20   30   100   1000   Throughput  per  node   (MB/s)   Record  Size  (bytes)   WordCount   Spark   Storm   0   20   40   60   100   1000   Throughput  per  node   (MB/s)   Record  Size  (bytes)   Grep   Spark   Storm  
  • 35. ® © 2014 MapR Technologies 35© 2014 MapR Technologies ® How Spark Works
  • 36. ® © 2014 MapR Technologies 36 Working With RDDs
  • 37. ® © 2014 MapR Technologies 37 Working With RDDs RDD textFile = sc.textFile(”SomeFile.txt”)!
  • 38. ® © 2014 MapR Technologies 38 Working With RDDs RDD RDD RDD RDD Transformations linesWithSpark = textFile.filter(lambda line: "Spark” in line)! textFile = sc.textFile(”SomeFile.txt”)!
  • 39. ® © 2014 MapR Technologies 39 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line)! linesWithSpark.count()! 74! ! linesWithSpark.first()! # Apache Spark! textFile = sc.textFile(”SomeFile.txt”)!
  • 40. ® © 2014 MapR Technologies 40© 2014 MapR Technologies ® Example: Log Mining
  • 41. ® © 2014 MapR Technologies 41 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
  • 42. ® © 2014 MapR Technologies 42 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver
  • 43. ® © 2014 MapR Technologies 43 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”)
  • 44. ® © 2014 MapR Technologies 44 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”) Base RDD
  • 45. ® © 2014 MapR Technologies 45 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver
  • 46. ® © 2014 MapR Technologies 46 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver Transformed RDD
  • 47. ® © 2014 MapR Technologies 47 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count()
  • 48. ® © 2014 MapR Technologies 48 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Action
  • 49. ® © 2014 MapR Technologies 49 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3
  • 50. ® © 2014 MapR Technologies 50 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver tasks tasks tasks
  • 51. ® © 2014 MapR Technologies 51 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Read HDFS Block Read HDFS Block Read HDFS Block
  • 52. ® © 2014 MapR Technologies 52 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data
  • 53. ® © 2014 MapR Technologies 53 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 results results results
  • 54. ® © 2014 MapR Technologies 54 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count()
  • 55. ® © 2014 MapR Technologies 55 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() tasks tasks tasks Driver
  • 56. ® © 2014 MapR Technologies 56 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Process from Cache Process from Cache Process from Cache
  • 57. ® © 2014 MapR Technologies 57 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver results results results
  • 58. ® © 2014 MapR Technologies 58 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Cache your data è Faster Results Full-text search of Wikipedia •  60GB on 20 EC2 machines •  0.5 sec from cache vs. 20s for on-disk
  • 59. ® © 2014 MapR Technologies 59© 2014 MapR Technologies ® Example: Page Rank
  • 60. ® © 2014 MapR Technologies 60 Example: PageRank •  Good example of a more complex algorithm –  Multiple stages of map & reduce •  Benefits from Spark’s in-memory caching –  Multiple iterations over the same data
  • 61. ® © 2014 MapR Technologies 61 Basic Idea Give pages ranks (scores) based on links to them •  Links from many pages è high rank •  Link from a high-rank page è high rank Image:  en.wikipedia.org/wiki/File:PageRank-­‐hi-­‐res-­‐2.png    
  • 62. ® © 2014 MapR Technologies 62 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 1.0   1.0   1.0   1.0  
  • 63. ® © 2014 MapR Technologies 63 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 1.0   1.0   1.0   1.0   1   0.5   0.5   0.5   1   0.5  
  • 64. ® © 2014 MapR Technologies 64 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 0.58   1.0   1.85   0.58  
  • 65. ® © 2014 MapR Technologies 65 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 0.58   0.29   0.29   0.5   1.85   0.58   1.0   1.85   0.58   0.5  
  • 66. ® © 2014 MapR Technologies 66 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 0.39   1.72   1.31   0.58   . . .
  • 67. ® © 2014 MapR Technologies 67 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 0.46   1.37   1.44   0.73   Final  state:  
  • 68. ® © 2014 MapR Technologies 68 Scala Implementation val links = // load RDD of (url, neighbors) pairs var ranks = // give each url rank of 1.0 for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).values.flatMap { case (urls, rank)) => urls.map(dest => (dest, rank/urls.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...) https://github.com/apache/spark/blob/master/examples/src/main/scala/org/ apache/spark/examples/SparkPageRank.scala
  • 69. ® © 2014 MapR Technologies 69© 2014 MapR Technologies ® Spark and More
  • 70. ® © 2014 MapR Technologies 70 Easy: Unified Platform Spark SQL (SQL) Spark Streaming (Streaming) MLLib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.,: •  BlinkDB (Approximate Queries) •  SparkR (R wrapper for Spark) •  Tachyon (off-heap RDD caching)
  • 71. ® © 2014 MapR Technologies 71 Spark on MapR •  Certified Spark Distribution •  Fully supported and packaged by MapR in partnership with Databricks –  mapr-spark package with Spark, Shark, Spark Streaming today –  Spark-python, GraphX and MLLib soon •  YARN integration –  Spark can then allocate resources from cluster when needed
  • 72. ® © 2014 MapR Technologies 72 References •  Based on slides from Pat McDonough at •  Spark web site: http://spark.apache.org/ •  Spark on MapR: –  http://www.mapr.com/products/apache-spark –  http://doc.mapr.com/display/MapR/Installing+Spark +and+Shark
  • 73. ® © 2014 MapR Technologies 73 Q&A @mapr maprtech kbotzum@mapr.com Engage with us! MapR maprtech mapr-technologies