3. 3
oBig data analytics / machine learning
oOffices in Seattle and Timisoara
o5+ years with Hadoop ecosystem
o1 year with Spark
4. 4
• “fast and general engine for large-scale data
processing”
• open sourced
• API for Java/Scala/Python (80 operators)
• not bounded to map-reduce paradigm
• powers a stack of high level tools including
Spark SQL, MLlib, Spark Streaming.
Apache Spark
5. 5
• Main entry point to Spark
• SparkConf: spark.app.name, spark.master, spark.serializer,
spark.cores.max, spark.task.cpus
SparkContext
val sc = new SparkContext(“url”, “name”, “sparkHome”, Seq(“app.jar”))
Cluster URL, or local
/ local[N]
App
name
Spark install
path on cluster
List of JARs with
app code (to ship)
6. 6
Resilient Distributed Dataset
• Immutable collection of elements partitioned
across the cluster, stored in RAM or on Disk
• Built through parallel transformations
• Automatically rebuilt on failure ( lineage )
Operations
• Transformations (e.g. map, filter, groupBy)
• Actions (e.g. count, collect, save)
Key Concept: RDDs
7. 7
Parallelize collection into an RDD
> sc.parallelize(List(1, 2, 3))
Load text file from local FS, HDFS, or S3
> sc.textFile(“test.txt”)
> sc.textFile(“textDir/*.txt”)
> sc.textFile(“hdfs://...”)
Use existing Hadoop InputFormat (Java/Scala only)
> sc.hadoopFile(keyClass, valClass, inputFmt, conf)
Creating RDDs
8. 8
> nums = sc.parallelize(List(1, 2, 3))
Pass each element through a function
> squares = nums.map(x => x * x) // {1, 4, 9}
Keep elements passing a predicate
> even = squares.filter(x => x % 2 == 0) // {4}
Retrieve RDD contents as a local collection
> nums.collect() # => [1, 2, 3]
Return first K elements
> nums.take(2) # => [1, 2]
Count number of elements
> nums.count() # => 3
Basic Transformations
Basic Actions
10. 10
RDD FaultTolerance
• RDDs maintain lineage information that can be used
to reconstruct lost partitions
• Ex:cachedMsgs = textFile(...).filter(_.contains(“error”))
.map(_.split(„t‟)(2))
.cache()
HdfsRDD
path: hdfs://…
FilteredRDD
func: contains(...)
MappedRDD
func: split(…)
CachedRDD
11. 11
Example: Log Mining
• Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(„t‟)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Cached RDD
Parallel operation
15. 15
Spark tuning
• Level of Parallelism
o number of partitions
o “reduce” operations <- largest parent RDD’s number of
partitions
o spark.default.parallelism
• Memory Usage of Reduce Tasks
• Broadcasting Large Variables
17. 17
Spark vs Hadoop (word count)
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
18. 18
val sc = new SparkContext(“spark://...”, “MyJob”, home, jars)
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
LOC : 6(11) vs 35
Spark vs Hadoop (word count)
19. 19
• Much more operators than map reduce
• Hadoop is a bigger and older community
• Happily Coexist
Spark vs Hadoop
20. 20
• Shark modified the Hive backend to run over
Spark, but had inconvenients:
Limited integration with Spark programs
Hive optimizer not designed for Spark
Spark SQL (alpha)
21. 21
• Spark SQL reuses the best parts of Shark
Hive data loading
In-memory column store
Spark SQL (alpha)
22. 22
• Adds
Support for multiple input formats
Rich language interfaces
RDD-aware optimizer
Spark SQL (alpha)
24. 24
Create SQL Context
val sqlContext = new SQLContext(sparkContext)
Create people RDD and register table
val people = sqlContext
.textFile("examples/src/main/resources/people.txt").map(_.split(","))
.map(p => Person(p(0), p(1).trim.toInt),p(3))
people.registerTempTable("people")
Query table
val teenagers = sqlContext.sql("SELECT name FROM people WHERE
age >= 13 AND age <= 23")
Spark SQL (alpha)
Radu,24,1.70
Andrei,23,1.88
25. 25
• Running time improvment 4x – 30x
• Bucketing
• Bucket Joins
• Skew Joins
• Partial DAG Execution
SparkSQL vs Hive
26. 26
• 0.8.0 - first POC … lots of OOM
• 0.8.1 – first production deployment, still lots of OOM
20 billion healthcare records, 200 TB of compressed hdfs data
Hadoop MR: 100 m1.xlarge (4c x 15GB)
BDAS: 20 cc2.8xlarge (32c x 60.8 GB), still lots of OOM map & reducer side
Perf gains of 4x to 40x, required individual dataset and query fine-tuning
Mixed Hive & Shark workloads where it made sense
Daily processing reduced from 14 hours to 1.5 hours!
• 0.9.0 - fixed many of the problems, but still requires patches! spilling on the
reducer side fixed (less OOM)
• 1.0.2 – in production today
• 1.1 upgrade in progress
Spark 0.8.0 to 1.1
27. 27
• cluster resource manager
• Multi-resource scheduling (memory, CPU, disk, and
ports)
• Scalability to 10,000s of nodes
• Fault-tolerant replicated master and slaves using
ZooKeeper
Mesos (0.20)
28. 28
• memory-centric distributed file system enabling
reliable file sharing at memory-speed across cluster
frameworks
• Pluggable underlayer file system: hdfs, S3, local file
system,…
Tachyon (v0.5)
29. 29
• Java like File API / FileSystem API
• Configurable block size
• Memory management
Tachyon (v0.5)
30. 30
• Jaws, xPatterns http spark sql server!
http://github.com/Atigeo/http-spark-sql-server
Backward compatible with Shark
Backend in spray io (REST on Akka)
• Spark Job Server
multiple Spark contexts in same JVM, job submission in Java + Scala
https://github.com/Atigeo/spark-job-rest
• Mesos framework starvation bug
• *SchedulerBackend update due to race conditions, Spark 0.9.0
patches
Community contribution
31. 31
• Read the papers
• Fine tuning can really boost your running time
• When using spark don’t think map-reduce
Lessons learned
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
Lineage : Logging the transformation used to built a dataset
Hdfs : one block / partition
Storage : in memory serialized Tachyon / deserialized in JVM / on disk (hdfs)