SlideShare ist ein Scribd-Unternehmen logo
1 von 86
Downloaden Sie, um offline zu lesen
Why Spark Is
the Next Top
(Compute)
Model
@deanwampler
Scala	
  Days	
  Amsterdam	
  2015
Wednesday, June 10, 15
Copyright (c) 2014-2015, Dean Wampler, All Rights Reserved, unless otherwise noted.
Image: Detail of the London Eye
dean.wampler@typesafe.com
polyglotprogramming.com/talks
@deanwampler
“Trolling the
Hadoop community
since 2012...”
Wednesday, June 10, 15
About me. You can find this presentation and others on Big Data and Scala at polyglotprogramming.com.
Programming Scala, 2nd Edition is forthcoming.
photo: Dusk at 30,000 ft above the Central Plains of the U.S. on a Winter’s Day.
3
Wednesday, June 10, 15
This page provides more information, as well as results of a recent survey of Spark usage, blog posts and webinars about the world of Spark.
typesafe.com/reactive-big-data
4
Wednesday, June 10, 15
This page provides more information, as well as results of a recent survey of Spark usage, blog posts and webinars about the world of Spark.
Hadoop
circa 2013Wednesday, June 10, 15
The state of Hadoop as of last year.
Image: Detail of the London Eye
Hadoop v2.X Cluster
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
master
Resource Mgr
Name Node
master
Name Node
Wednesday, June 10, 15
Schematic view of a Hadoop 2 cluster. For a more precise definition of the services and what they do, see e.g., http://hadoop.apache.org/
docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html We aren’t interested in great details at this point, but we’ll call out a few useful things to
know.
Hadoop v2.X Cluster
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
master
Resource Mgr
Name Node
master
Name Node
Resource and Node Managers
Wednesday, June 10, 15
Hadoop 2 uses YARN to manage resources via the master Resource Manager, which includes a pluggable job scheduler and an Applications
Manager. It coordinates with the Node Manager on each node to schedule jobs and provide resources. Other services involved, including
application-specific Containers and Application Masters are not shown.
Hadoop v2.X Cluster
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
master
Resource Mgr
Name Node
master
Name Node
Name Node and Data Nodes
Wednesday, June 10, 15
Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system
operations and resiliency when service instances fail. The data node services manage individual blocks for files.
MapReduce
The classic
compute model
for Hadoop
Wednesday, June 10, 15
Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system
operations and resiliency when service instances fail. The data node services manage individual blocks for files.
1 map step
+ 1 reduce step
(wash, rinse, repeat)
MapReduce
Wednesday, June 10, 15
You get 1 map step (although there is limited support for chaining mappers) and 1 reduce step. If you can’t implement an algorithm in these
two steps, you can chain jobs together, but you’ll pay a tax of flushing the entire data set to disk between these jobs.
Example:
Inverted Index
MapReduce
Wednesday, June 10, 15
The inverted index is a classic algorithm needed for building search engines.
Wednesday, June 10, 15
Before running MapReduce, crawl teh interwebs, find all the pages, and build a data set of URLs -> doc contents, written to flat files in HDFS
or one of the more “sophisticated” formats.
Wednesday, June 10, 15
Now we’re running MapReduce. In the map step, a task (JVM process) per file *block* (64MB or larger) reads the rows, tokenizes the text and
outputs key-value pairs (“tuples”)...
Map Task
(hadoop,(wikipedia.org/hadoop,1))
(mapreduce,(wikipedia.org/hadoop,
(hdfs,(wikipedia.org/hadoop, 1))
(provides,(wikipedia.org/hadoop,1))
(and,(wikipedia.org/hadoop,1))
Wednesday, June 10, 15
... the keys are each word found and the values are themselves tuples, each URL and the count of the word. In our simplified example, there
are typically only single occurrences of each work in each document. The real occurrences are interesting because if a word is mentioned a
lot in a document, the chances are higher that you would want to find that document in a search for that word.
Wednesday, June 10, 15
Wednesday, June 10, 15
The output tuples are sorted by key locally in each map task, then “shuffled” over the cluster network to reduce tasks (each a JVM process,
too), where we want all occurrences of a given key to land on the same reduce task.
Wednesday, June 10, 15
Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of
(URL-count) tuples (pairs).
Altogether...
Wednesday, June 10, 15
Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of
(URL-count) tuples (pairs).
What’s
not to like?Wednesday, June 10, 15
This seems okay, right? What’s wrong with it?
Restrictive model
makes most
algorithms hard to
implement.
Awkward
Wednesday, June 10, 15
Writing MapReduce jobs requires arcane, specialized skills that few master. For a good overview, see http://lintool.github.io/
MapReduceAlgorithms/.
Lack of flexibility
inhibits optimizations.
Awkward
Wednesday, June 10, 15
The inflexible compute model leads to complex code to improve performance where hacks are used to work around the limitations. Hence,
optimizations are hard to implement. The Spark team has commented on this, see http://databricks.com/blog/2014/03/26/Spark-SQL-
manipulating-structured-data-using-Spark.html
Full dump of
intermediate data
to disk between jobs.
Performance
Wednesday, June 10, 15
Sequencing jobs wouldn’t be so bad if the “system” was smart enough to cache data in memory. Instead, each job dumps everything to disk,
then the next job reads it back in again. This makes iterative algorithms particularly painful.
MapReduce only
supports
“batch mode”
Streaming
Wednesday, June 10, 15
Processing data streams as soon as possible has become very important. MR can’t do this, due to its coarse-grained nature and relative
inefficiency, so alternatives have to be used.
Enter
Spark
spark.apache.org
Wednesday, June 10, 15
Can be run in:
•YARN (Hadoop 2)
•Mesos (Cluster management)
•EC2
•Standalone mode
•Cassandra, Riak, ...
•...
Cluster Computing
Wednesday, June 10, 15
If you have a Hadoop cluster, you can run Spark as a seamless compute engine on YARN. (You can also use pre-YARN Hadoop v1 clusters,
but there you have manually allocate resources to the embedded Spark cluster vs Hadoop.) Mesos is a general-purpose cluster resource
manager that can also be used to manage Hadoop resources. Scripts for running a Spark cluster in EC2 are available. Standalone just means
you run Spark’s built-in support for clustering (or run locally on a single box - e.g., for development). EC2 deployments are usually
standalone... Finally, database vendors like Datastax are integrating Spark.
Fine-grained
operators for
composing
algorithms.
Compute Model
Wednesday, June 10, 15
Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little
code.
RDD:
Resilient,
Distributed
Dataset
Compute Model
Wednesday, June 10, 15
RDDs shard the data over a cluster, like a virtualized, distributed collection (analogous to HDFS). They support intelligent caching, which
means no naive flushes of massive datasets to disk. This feature alone allows Spark jobs to run 10-100x faster than comparable MapReduce
jobs! The “resilient” part means they will reconstitute shards lost due to process/server crashes.
Compute Model
Wednesday, June 10, 15
RDDs shard the data over a cluster, like a virtualized, distributed collection (analogous to HDFS). They support intelligent caching, which
means no naive flushes of massive datasets to disk. This feature alone allows Spark jobs to run 10-100x faster than comparable MapReduce
jobs! The “resilient” part means they will reconstitute shards lost due to process/server crashes.
Written in Scala,
with Java, Python,
and R APIs.
Compute Model
Wednesday, June 10, 15
Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little
code.
Inverted Index
in Java MapReduce
Wednesday, June 10, 15
Let’s	
  see	
  an	
  an	
  actual	
  implementa1on	
  of	
  the	
  inverted	
  index.	
  First,	
  a	
  Hadoop	
  MapReduce	
  (Java)	
  version,	
  adapted	
  from	
  hBps://
developer.yahoo.com/hadoop/tutorial/module4.html#solu1on	
  It’s	
  about	
  90	
  lines	
  of	
  code,	
  but	
  I	
  reformaBed	
  to	
  fit	
  beBer.
This	
  is	
  also	
  a	
  slightly	
  simpler	
  version	
  that	
  the	
  one	
  I	
  diagrammed.	
  It	
  doesn’t	
  record	
  a	
  count	
  of	
  each	
  word	
  in	
  a	
  document;	
  it	
  just	
  writes	
  
(word,doc-­‐1tle)	
  pairs	
  out	
  of	
  the	
  mappers	
  and	
  the	
  final	
  (word,list)	
  output	
  by	
  the	
  reducers	
  just	
  has	
  a	
  list	
  of	
  documenta1ons,	
  hence	
  repeats.	
  
A	
  second	
  job	
  would	
  be	
  necessary	
  to	
  count	
  the	
  repeats.
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class LineIndexer {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
Wednesday, June 10, 15
I’m	
  not	
  going	
  to	
  explain	
  this	
  in	
  much	
  detail.	
  I	
  used	
  yellow	
  for	
  method	
  calls,	
  because	
  methods	
  do	
  the	
  real	
  work!!	
  But	
  no1ce	
  that	
  the	
  func1ons	
  in	
  
this	
  code	
  don’t	
  really	
  do	
  a	
  whole	
  lot,	
  so	
  there’s	
  low	
  informa1on	
  density	
  and	
  you	
  do	
  a	
  lot	
  of	
  bit	
  twiddling.
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
new Path("output"));
conf.setMapperClass(
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
Wednesday, June 10, 15
boilerplate...
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
Wednesday, June 10, 15
main ends with a try-catch clause to run the
job.
public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
new Text();
public void map(
LongWritable key, Text val,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
Wednesday, June 10, 15
This is the LineIndexMapper class for the mapper. The map method does the real work of tokenization and writing the (word, document-name)
tuples.
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new
StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Wednesday, June 10, 15
The rest of the LineIndexMapper class and map
method.
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
Wednesday, June 10, 15
The reducer class, LineIndexReducer, with the reduce method that is called for each key and a list of values for that key. The reducer is
stupid; it just reformats the values collection into a long string and writes the final (word,list-string) output.
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
}
}
Wednesday, June 10, 15
EOF
Altogether
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class LineIndexer {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
new Path("output"));
conf.setMapperClass(
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
new Text();
public void map(
LongWritable key, Text val,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new
StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
}
}
Wednesday, June 10, 15
The	
  whole	
  shebang	
  (6pt.	
  font)
Inverted Index
in Spark
(Scala).
Wednesday, June 10, 15
This	
  code	
  is	
  approximately	
  45	
  lines,	
  but	
  it	
  does	
  more	
  than	
  the	
  previous	
  Java	
  example,	
  it	
  implements	
  the	
  original	
  inverted	
  index	
  algorithm	
  I	
  
diagrammed	
  where	
  word	
  counts	
  are	
  computed	
  and	
  included	
  in	
  the	
  data.
import
org.apache.spark.SparkContext
import
org.apache.spark.SparkContext._
object InvertedIndex {
def main(a: Array[String]) = {
val sc = new SparkContext(
"local[*]", "Inverted Idx")
sc.textFile("data/crawl")
.map { line =>
Wednesday, June 10, 15
The	
  InvertedIndex	
  implemented	
  in	
  Spark.	
  This	
  1me,	
  we’ll	
  also	
  count	
  the	
  occurrences	
  in	
  each	
  document	
  (as	
  I	
  originally	
  described	
  the	
  algorithm)	
  
and	
  sort	
  the	
  (url,N)	
  pairs	
  descending	
  by	
  N	
  (count),	
  and	
  ascending	
  by	
  URL.
import
org.apache.spark.SparkContext
import
org.apache.spark.SparkContext._
object InvertedIndex {
def main(a: Array[String]) = {
val sc = new SparkContext(
"local[*]", "Inverted Idx")
sc.textFile("data/crawl")
.map { line =>
Wednesday, June 10, 15
It	
  starts	
  with	
  imports,	
  then	
  declares	
  a	
  singleton	
  object	
  (a	
  first-­‐class	
  concept	
  in	
  Scala),	
  with	
  a	
  “main”	
  rou1ne	
  (as	
  in	
  Java).
The	
  methods	
  are	
  colored	
  yellow	
  again.	
  Note	
  this	
  1me	
  how	
  dense	
  with	
  meaning	
  they	
  are	
  this	
  1me.
import
org.apache.spark.SparkContext
import
org.apache.spark.SparkContext._
object InvertedIndex {
def main(a: Array[String]) = {
val sc = new SparkContext(
"local[*]", "Inverted Idx")
sc.textFile("data/crawl")
.map { line =>
Wednesday, June 10, 15
You	
  being	
  the	
  workflow	
  by	
  declaring	
  a	
  SparkContext.	
  We’re	
  running	
  in	
  “local[*]”	
  mode,	
  in	
  this	
  case,	
  meaning	
  on	
  a	
  single	
  machine,	
  but	
  using	
  all	
  
cores	
  available.	
  Normally	
  this	
  argument	
  would	
  be	
  a	
  command-­‐line	
  parameter,	
  so	
  you	
  can	
  develop	
  locally,	
  then	
  submit	
  to	
  a	
  cluster,	
  where	
  “local”	
  
would	
  be	
  replaced	
  by	
  the	
  appropriate	
  cluster	
  master	
  URI.	
  
sc.textFile("data/crawl")
.map { line =>
val Array(path, text) =
line.split("t",2)
(path, text)
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
Wednesday, June 10, 15
The	
  rest	
  of	
  the	
  program	
  is	
  a	
  sequence	
  of	
  func1on	
  calls,	
  analogous	
  to	
  “pipes”	
  we	
  connect	
  together	
  to	
  construct	
  the	
  data	
  flow.	
  Data	
  will	
  only	
  start	
  
“flowing”	
  when	
  we	
  ask	
  for	
  results.
We	
  start	
  by	
  reading	
  one	
  or	
  more	
  text	
  files	
  from	
  the	
  directory	
  “data/crawl”.	
  If	
  running	
  in	
  Hadoop,	
  if	
  there	
  are	
  one	
  or	
  more	
  Hadoop-­‐style	
  “part-­‐
NNNNN”	
  files,	
  Spark	
  will	
  process	
  all	
  of	
  them	
  (they	
  will	
  be	
  processed	
  synchronously	
  in	
  “local”	
  mode).
sc.textFile("data/crawl")
.map { line =>
val Array(path, text) =
line.split("t",2)
(path, text)
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
Wednesday, June 10, 15
sc.textFile	
  returns	
  an	
  RDD	
  with	
  a	
  string	
  for	
  each	
  line	
  of	
  input	
  text.	
  So,	
  the	
  first	
  thing	
  we	
  do	
  is	
  map	
  over	
  these	
  strings	
  to	
  extract	
  the	
  original	
  
document	
  id	
  (i.e.,	
  file	
  name),	
  followed	
  by	
  the	
  text	
  in	
  the	
  document,	
  all	
  on	
  one	
  line.	
  We	
  assume	
  tab	
  is	
  the	
  separator.	
  “(array(0),	
  array(1))”	
  returns	
  
a	
  two-­‐element	
  “tuple”.	
  Think	
  of	
  the	
  output	
  RDD	
  has	
  having	
  a	
  schema	
  “fileName:	
  String,	
  text:	
  String”.	
  	
  
sc.textFile("data/crawl")
.map { line =>
val Array(path, text) =
line.split("t",2)
(path, text)
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
Wednesday, June 10, 15
flatMap	
  maps	
  over	
  each	
  of	
  these	
  2-­‐element	
  tuples.	
  We	
  split	
  the	
  text	
  into	
  words	
  on	
  non-­‐alphanumeric	
  characters,	
  then	
  output	
  collec1ons	
  of	
  word	
  
(our	
  ul1mate,	
  final	
  “key”)	
  and	
  the	
  path.	
  That	
  is,	
  each	
  line	
  (one	
  thing)	
  is	
  converted	
  to	
  a	
  collec1on	
  of	
  (word,path)	
  pairs	
  (0	
  to	
  many	
  things),	
  but	
  we	
  
don’t	
  want	
  an	
  output	
  collec1on	
  of	
  nested	
  collec1ons,	
  so	
  flatMap	
  concatenates	
  nested	
  collec1ons	
  into	
  one	
  long	
  “flat”	
  collec1on	
  of	
  (word,path)	
  
pairs.
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupByKey
.mapValues { iter =>
iter.toSeq.sortBy {
case (path, n) => (-n, path)
}.mkString(", ")
((word1, path1), n1)
((word2, path2), n2)
...
Wednesday, June 10, 15
Next,	
  we	
  map	
  over	
  these	
  pairs	
  and	
  add	
  a	
  single	
  “seed”	
  count	
  of	
  1.	
  Note	
  the	
  structure	
  of	
  the	
  returned	
  tuple;	
  it’s	
  a	
  two-­‐tuple	
  where	
  the	
  first	
  
element	
  is	
  itself	
  a	
  two-­‐tuple	
  holding	
  (word,	
  path).	
  The	
  following	
  special	
  method,	
  reduceByKey	
  is	
  like	
  a	
  groupBy,	
  where	
  it	
  groups	
  over	
  those	
  
(word,	
  path)	
  “keys”	
  and	
  uses	
  the	
  func1on	
  to	
  sum	
  the	
  integers.	
  	
  The	
  popup	
  shows	
  the	
  what	
  the	
  output	
  data	
  looks	
  like.
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupByKey
.mapValues { iter =>
iter.toSeq.sortBy {
case (path, n) => (-n, path)
}.mkString(", ")
}
.saveAsTextFile("/path/out")
sc.stop()
}
(word1, (path1, n1))
(word2, (path2, n2))
...
Wednesday, June 10, 15
So,	
  the	
  input	
  to	
  the	
  next	
  map	
  is	
  now	
  ((word,	
  path),	
  n),	
  where	
  n	
  is	
  now	
  >=	
  1.	
  We	
  transform	
  these	
  tuples	
  into	
  the	
  form	
  we	
  actually	
  want,	
  (word,	
  
(path,	
  n)).	
  I	
  love	
  how	
  concise	
  and	
  elegant	
  this	
  code	
  is!
.map {
case ((w, p), n) => (w, (p, n))
}
.groupByKey
.mapValues { iter =>
iter.toSeq.sortBy {
case (path, n) => (-n, path)
}.mkString(", ")
}
.saveAsTextFile("/path/out")
sc.stop()
}
}
(word1, iter(
(path11, n11), (path12, n12)...))
(word2, iter(
(path21, n21), (path22, n22)...))
...
Wednesday, June 10, 15
Now	
  we	
  do	
  an	
  explicit	
  group	
  by	
  to	
  bring	
  all	
  the	
  same	
  words	
  together.	
  The	
  output	
  	
  will	
  be	
  (word,	
  seq(	
  (path1,	
  n1),	
  (path2,	
  n2),	
  ...)).
.map {
case ((w, p), n) => (w, (p, n))
}
.groupByKey
.mapValues { iter =>
iter.toSeq.sortBy {
case (path, n) => (-n, path)
}.mkString(", ")
}
.saveAsTextFile("/path/out")
sc.stop()
}
}
Wednesday, June 10, 15
The	
  last	
  map	
  over	
  just	
  the	
  values	
  (keeping	
  the	
  same	
  keys)	
  sorts	
  by	
  the	
  count	
  descending	
  and	
  path	
  ascending.	
  (Sor1ng	
  by	
  path	
  is	
  mostly	
  useful	
  for	
  
reproducibility,	
  e.g.,	
  in	
  tests!).
.map {
case ((w, p), n) => (w, (p, n))
}
.groupByKey
.mapValues { iter =>
iter.toSeq.sortBy {
case (path, n) => (-n, path)
}.mkString(", ")
}
.saveAsTextFile("/path/out")
sc.stop()
}
}
Wednesday, June 10, 15
Finally,	
  write	
  back	
  to	
  the	
  file	
  system	
  and	
  stop	
  the	
  job.
Altogether
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(a: Array[String]) = {
val sc = new SparkContext(
"local[*]", "Inverted Idx")
sc.textFile("data/crawl")
.map { line =>
val Array(path, text) =
line.split("t",2)
(path, text)
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupByKey
.mapValues { iter =>
iter.toSeq.sortBy {
case (path, n) => (-n, path)
}
}
.saveAsTextFile("/path/to/out")
sc.stop()
}
}
Wednesday, June 10, 15
The	
  whole	
  shebang	
  (14pt.	
  font,	
  this	
  1me)
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupByKey
.mapValues { iter =>
iter.toSeq.sortBy {
case (path, n) => (-n, path)
}.mkString(", ")
Concise Operators!
Wednesday, June 10, 15
Once	
  you	
  have	
  this	
  arsenal	
  of	
  concise	
  combinators	
  (operators),	
  you	
  can	
  compose	
  sophis1ca1on	
  dataflows	
  very	
  quickly.
Wednesday, June 10, 15
Another	
  example	
  of	
  a	
  beau1ful	
  and	
  profound	
  DSL,	
  in	
  this	
  case	
  from	
  the	
  world	
  of	
  Physics:	
  Maxwell’s	
  equa1ons:	
  hBp://upload.wikimedia.org/
wikipedia/commons/c/c4/Maxwell'sEqua1ons.svg
The Spark version
took me ~30 minutes
to write.
Wednesday, June 10, 15
Once	
  you	
  learn	
  the	
  core	
  primi1ves	
  I	
  used,	
  and	
  a	
  few	
  tricks	
  for	
  manipula1ng	
  the	
  RDD	
  tuples,	
  you	
  can	
  very	
  quickly	
  build	
  complex	
  
algorithms	
  for	
  data	
  processing!
The	
  Spark	
  API	
  allowed	
  us	
  to	
  focus	
  almost	
  exclusively	
  on	
  the	
  “domain”	
  of	
  data	
  transforma1ons,	
  while	
  the	
  Java	
  MapReduce	
  version	
  (which	
  
does	
  less),	
  forced	
  tedious	
  aBen1on	
  to	
  infrastructure	
  mechanics.	
  
Use a SQL query
when you can!!
Wednesday, June 10, 15
New DataFrame API
with query optimizer
(equal performance for Scala,
Java, Python, and R),
Python/R-like idioms.
Spark SQL!
Wednesday, June 10, 15
This API sits on top of a new query optimizer called Catalyst, that supports equally fast execution for all high-level languages, a first in the big
data world.
Mix SQL queries with
the RDD API.
Spark SQL!
Wednesday, June 10, 15
Use the best tool for a particular
problem.
Create, Read, and Delete
Hive Tables
Spark SQL!
Wednesday, June 10, 15
Interoperate with Hive, the original Hadoop SQL tool.
Read JSON and
Infer the Schema
Spark SQL!
Wednesday, June 10, 15
Read strings that are JSON records, infer the schema on the fly. Also, write RDD records as
JSON.
Read and write
Parquet files
Spark SQL!
Wednesday, June 10, 15
Parquet is a newer file format developed by Twitter and Cloudera that is becoming very popular. IT stores in column order, which is better than
row order when you have lots of columns and your queries only need a few of them. Also, columns of the same data types are easier to
compress, which Parquet does for you. Finally, Parquet files carry the data schema.
~10-100x the performance of
Hive, due to in-memory
caching of RDDs & better
Spark abstractions.
SparkSQL
Wednesday, June 10, 15
Combine SparkSQL
queries with Machine
Learning code.
Wednesday, June 10, 15
We’ll	
  use	
  the	
  Spark	
  “MLlib”	
  in	
  the	
  example,	
  then	
  return	
  to	
  it	
  in	
  a	
  moment.
CREATE TABLE Users(
userId INT,
name STRING,
email STRING,
age INT,
latitude DOUBLE,
longitude DOUBLE,
subscribed BOOLEAN);
CREATE TABLE Events(
userId INT,
action INT);
Equivalent	
  
HiveQL	
  Schemas	
  
defini1ons.
Wednesday, June 10, 15
This	
  example	
  adapted	
  from	
  the	
  following	
  blog	
  post	
  announcing	
  Spark	
  SQL:
hBp://databricks.com/blog/2014/03/26/Spark-­‐SQL-­‐manipula1ng-­‐structured-­‐data-­‐using-­‐Spark.html
Adapted	
  here	
  to	
  use	
  Spark’s	
  own	
  SQL,	
  not	
  the	
  integra1on	
  with	
  Hive.	
  Imagine	
  we	
  have	
  a	
  stream	
  of	
  events	
  from	
  users	
  and	
  the	
  events	
  that	
  have	
  
occurred	
  as	
  they	
  used	
  a	
  system.
val trainingDataTable = sql("""
SELECT e.action, u.age,
u.latitude, u.longitude
FROM Users u
JOIN Events e
ON u.userId = e.userId""")
val trainingData =
trainingDataTable map { row =>
val features =
Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model =
new LogisticRegressionWithSGD()
.run(trainingData)
val allCandidates = sql("""
Wednesday, June 10, 15
Here	
  is	
  some	
  Spark	
  (Scala)	
  code	
  with	
  an	
  embedded	
  SQL	
  query	
  that	
  joins	
  the	
  Users	
  and	
  Events	
  tables.	
  The	
  “””...”””	
  string	
  allows	
  embedded	
  line	
  
feeds.
The	
  “sql”	
  func1on	
  returns	
  an	
  RDD.	
  If	
  we	
  used	
  the	
  Hive	
  integra1on	
  and	
  this	
  was	
  a	
  query	
  against	
  a	
  Hive	
  table,	
  we	
  would	
  use	
  the	
  hql(...)	
  func1on	
  
instead.
val trainingDataTable = sql("""
SELECT e.action, u.age,
u.latitude, u.longitude
FROM Users u
JOIN Events e
ON u.userId = e.userId""")
val trainingData =
trainingDataTable map { row =>
val features =
Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model =
new LogisticRegressionWithSGD()
.run(trainingData)
val allCandidates = sql("""
Wednesday, June 10, 15
We	
  map	
  over	
  the	
  RDD	
  to	
  create	
  LabeledPoints,	
  an	
  object	
  used	
  in	
  Spark’s	
  MLlib	
  (machine	
  learning	
  library)	
  for	
  a	
  recommenda1on	
  engine.	
  The	
  
“label”	
  is	
  the	
  kind	
  of	
  event	
  and	
  the	
  user’s	
  age	
  and	
  lat/long	
  coordinates	
  are	
  the	
  “features”	
  used	
  for	
  making	
  recommenda1ons.	
  (E.g.,	
  if	
  you’re	
  25	
  
and	
  near	
  a	
  certain	
  loca1on	
  in	
  the	
  city,	
  you	
  might	
  be	
  interested	
  a	
  nightclub	
  near	
  by...)
val model =
new LogisticRegressionWithSGD()
.run(trainingData)
val allCandidates = sql("""
SELECT userId, age, latitude, longitude
FROM Users
WHERE subscribed = FALSE""")
case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// In-memory table
scores.registerTempTable("Scores")
Wednesday, June 10, 15
Next	
  we	
  train	
  the	
  recommenda1on	
  engine,	
  using	
  a	
  “logis1c	
  regression”	
  fit	
  to	
  the	
  training	
  data,	
  where	
  “stochas1c	
  gradient	
  descent”	
  (SGD)	
  is	
  used	
  
to	
  train	
  it.	
  (This	
  is	
  a	
  standard	
  tool	
  set	
  for	
  recommenda1on	
  engines;	
  see	
  for	
  example:	
  hBp://www.cs.cmu.edu/~wcohen/10-­‐605/assignments/
sgd.pdf)
val model =
new LogisticRegressionWithSGD()
.run(trainingData)
val allCandidates = sql("""
SELECT userId, age, latitude, longitude
FROM Users
WHERE subscribed = FALSE""")
case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// In-memory table
scores.registerTempTable("Scores")
Wednesday, June 10, 15
Now	
  run	
  a	
  query	
  against	
  all	
  users	
  who	
  aren’t	
  already	
  subscribed	
  to	
  no1fica1ons.
case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// In-memory table
scores.registerTempTable("Scores")
val topCandidates = sql("""
SELECT u.name, u.email
FROM Scores s
JOIN Users u ON s.userId = u.userId
ORDER BY score DESC
LIMIT 100""")Wednesday, June 10, 15
Declare	
  a	
  class	
  to	
  hold	
  each	
  user’s	
  “score”	
  as	
  produced	
  by	
  the	
  recommenda1on	
  engine	
  and	
  map	
  the	
  “all”	
  query	
  results	
  to	
  Scores.
case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// In-memory table
scores.registerTempTable("Scores")
val topCandidates = sql("""
SELECT u.name, u.email
FROM Scores s
JOIN Users u ON s.userId = u.userId
ORDER BY score DESC
LIMIT 100""")Wednesday, June 10, 15
Then	
  “register”	
  the	
  scores	
  RDD	
  as	
  a	
  “Scores”	
  table	
  in	
  in	
  memory.	
  If	
  you	
  use	
  the	
  Hive	
  binding	
  instead,	
  this	
  would	
  be	
  a	
  table	
  in	
  Hive’s	
  metadata	
  
storage.
// In-memory table
scores.registerTempTable("Scores")
val topCandidates = sql("""
SELECT u.name, u.email
FROM Scores s
JOIN Users u ON s.userId = u.userId
ORDER BY score DESC
LIMIT 100""")
Wednesday, June 10, 15
Finally,	
  run	
  a	
  new	
  query	
  to	
  find	
  the	
  people	
  with	
  the	
  highest	
  scores	
  that	
  aren’t	
  already	
  subscribing	
  to	
  no1fica1ons.	
  You	
  might	
  send	
  them	
  an	
  email	
  
next	
  recommending	
  that	
  they	
  subscribe...
val trainingDataTable = sql("""
SELECT e.action, u.age,
u.latitude, u.longitude
FROM Users u
JOIN Events e
ON u.userId = e.userId""")
val trainingData =
trainingDataTable map { row =>
val features =
Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model =
new LogisticRegressionWithSGD()
.run(trainingData)
val allCandidates = sql("""
SELECT userId, age, latitude, longitude
FROM Users
WHERE subscribed = FALSE""")
case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// In-memory table
scores.registerTempTable("Scores")
val topCandidates = sql("""
SELECT u.name, u.email
FROM Scores s
JOIN Users u ON s.userId = u.userId
ORDER BY score DESC
LIMIT 100""")
Altogether
Wednesday, June 10, 15
12	
  point	
  font	
  again.
Event Stream
Processing
Wednesday, June 10, 15
Use the same abstractions
for near real-time,
event streaming.
Spark Streaming
Wednesday, June 10, 15
Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little
code.
“Mini batches”
Wednesday, June 10, 15
A DSTream (discretized stream) wraps the RDDs for each “batch” of events. You can specify the granularity, such as all events in 1 second
batches, then your Spark job is passed each batch of data for processing. You can also work with moving windows of batches.
Very similar code...
Wednesday, June 10, 15
val sc = new SparkContext(...)
val ssc = new StreamingContext(
sc, Seconds(10))
// A DStream that will listen
// for text on server:port
val lines =
ssc.socketTextStream(s, p)
// Word Count...
val words = lines flatMap {
line => line.split("""W+""")
}Wednesday, June 10, 15
This	
  example	
  adapted	
  from	
  the	
  following	
  page	
  on	
  the	
  Spark	
  website:
hBp://spark.apache.org/docs/0.9.0/streaming-­‐programming-­‐guide.html#a-­‐quick-­‐example
val sc = new SparkContext(...)
val ssc = new StreamingContext(
sc, Seconds(10))
// A DStream that will listen
// for text on server:port
val lines =
ssc.socketTextStream(s, p)
// Word Count...
val words = lines flatMap {
line => line.split("""W+""")
}Wednesday, June 10, 15
We	
  create	
  a	
  StreamingContext	
  that	
  wraps	
  a	
  SparkContext	
  (there	
  are	
  alterna1ve	
  ways	
  to	
  construct	
  it...).	
  It	
  will	
  “clump”	
  the	
  events	
  into	
  1-­‐second	
  
intervals.	
  
val sc = new SparkContext(...)
val ssc = new StreamingContext(
sc, Seconds(10))
// A DStream that will listen
// for text on server:port
val lines =
ssc.socketTextStream(s, p)
// Word Count...
val words = lines flatMap {
line => line.split("""W+""")
}Wednesday, June 10, 15
Next	
  we	
  setup	
  a	
  socket	
  to	
  stream	
  text	
  to	
  us	
  from	
  another	
  server	
  and	
  port	
  (one	
  of	
  several	
  ways	
  to	
  ingest	
  data).
// Word Count...
val words = lines flatMap {
line => line.split("""W+""")
}
val pairs = words map ((_, 1))
val wordCounts =
pairs reduceByKey ((i,j) => i+j)
wordCount.saveAsTextFiles(outpath)
ssc.start()
ssc.awaitTermination()
Wednesday, June 10, 15
Now	
  we	
  “count	
  words”.	
  For	
  each	
  mini-­‐batch	
  (1	
  second’s	
  worth	
  of	
  data),	
  we	
  split	
  the	
  input	
  text	
  into	
  words	
  (on	
  whitespace,	
  which	
  is	
  too	
  crude).
Once	
  we	
  setup	
  the	
  flow,	
  we	
  start	
  it	
  and	
  wait	
  for	
  it	
  to	
  terminate	
  through	
  some	
  means,	
  such	
  as	
  the	
  server	
  socket	
  closing.
// Word Count...
val words = lines flatMap {
line => line.split("""W+""")
}
val pairs = words map ((_, 1))
val wordCounts =
pairs reduceByKey ((i,j) => i+j)
wordCount.saveAsTextFiles(outpath)
ssc.start()
ssc.awaitTermination()
Wednesday, June 10, 15
We	
  count	
  these	
  words	
  just	
  like	
  we	
  counted	
  (word,	
  path)	
  pairs	
  early.
val pairs = words map ((_, 1))
val wordCounts =
pairs reduceByKey ((i,j) => i+j)
wordCount.saveAsTextFiles(outpath)
ssc.start()
ssc.awaitTermination()
Wednesday, June 10, 15
print	
  is	
  useful	
  diagnos1c	
  tool	
  that	
  prints	
  a	
  header	
  and	
  the	
  first	
  10	
  records	
  to	
  the	
  console	
  at	
  each	
  itera1on.
val pairs = words map ((_, 1))
val wordCounts =
pairs reduceByKey ((i,j) => i+j)
wordCount.saveAsTextFiles(outpath)
ssc.start()
ssc.awaitTermination()
Wednesday, June 10, 15
Now	
  start	
  the	
  data	
  flow	
  and	
  wait	
  for	
  it	
  to	
  terminate	
  (possibly	
  forever).
Machine Learning
Library...
Wednesday, June 10, 15
MLlib,	
  which	
  we	
  won’t	
  discuss	
  further.
Distributed Graph
Computing...
Wednesday, June 10, 15
GraphX, which we won’t discuss further.
Some problems are more naturally represented as graphs.
Extends RDDs to support property graphs with directed edges.
A flexible, scalable distributed
compute platform with
concise, powerful APIs and
higher-order tools.
spark.apache.org
Spark
Wednesday, June 10, 15
polyglotprogramming.com/talks
@deanwampler
Wednesday, June 10, 15
Copyright (c) 2014-2015, Dean Wampler, All Rights Reserved, unless otherwise noted.
Image: The London Eye on one side of the Thames, Parliament on the other.

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanTaro L. Saito
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Spark For The Business Analyst
Spark For The Business AnalystSpark For The Business Analyst
Spark For The Business AnalystGustaf Cavanaugh
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Holden Karau
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
 

Was ist angesagt? (20)

Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Spark For The Business Analyst
Spark For The Business AnalystSpark For The Business Analyst
Spark For The Business Analyst
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 

Andere mochten auch

Error Handling in Reactive Systems
Error Handling in Reactive SystemsError Handling in Reactive Systems
Error Handling in Reactive SystemsDean Wampler
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldDean Wampler
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to ScalaTim Underwood
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317Nan Zhu
 
Visualising Data with Code
Visualising Data with CodeVisualising Data with Code
Visualising Data with CodeRi Liu
 
Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Hamilton
 

Andere mochten auch (9)

Error Handling in Reactive Systems
Error Handling in Reactive SystemsError Handling in Reactive Systems
Error Handling in Reactive Systems
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
 
Scala
ScalaScala
Scala
 
Why Scala?
Why Scala?Why Scala?
Why Scala?
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to Scala
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Visualising Data with Code
Visualising Data with CodeVisualising Data with Code
Visualising Data with Code
 
Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science
 

Ähnlich wie Why Spark Is the Next Top (Compute) Model

spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogsprateek kumar
 
Non-Relational Databases: This hurts. I like it.
Non-Relational Databases: This hurts. I like it.Non-Relational Databases: This hurts. I like it.
Non-Relational Databases: This hurts. I like it.Onyxfish
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 

Ähnlich wie Why Spark Is the Next Top (Compute) Model (20)

spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Non-Relational Databases: This hurts. I like it.
Non-Relational Databases: This hurts. I like it.Non-Relational Databases: This hurts. I like it.
Non-Relational Databases: This hurts. I like it.
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 

Kürzlich hochgeladen

What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 

Kürzlich hochgeladen (20)

What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 

Why Spark Is the Next Top (Compute) Model

  • 1. Why Spark Is the Next Top (Compute) Model @deanwampler Scala  Days  Amsterdam  2015 Wednesday, June 10, 15 Copyright (c) 2014-2015, Dean Wampler, All Rights Reserved, unless otherwise noted. Image: Detail of the London Eye
  • 2. dean.wampler@typesafe.com polyglotprogramming.com/talks @deanwampler “Trolling the Hadoop community since 2012...” Wednesday, June 10, 15 About me. You can find this presentation and others on Big Data and Scala at polyglotprogramming.com. Programming Scala, 2nd Edition is forthcoming. photo: Dusk at 30,000 ft above the Central Plains of the U.S. on a Winter’s Day.
  • 3. 3 Wednesday, June 10, 15 This page provides more information, as well as results of a recent survey of Spark usage, blog posts and webinars about the world of Spark.
  • 4. typesafe.com/reactive-big-data 4 Wednesday, June 10, 15 This page provides more information, as well as results of a recent survey of Spark usage, blog posts and webinars about the world of Spark.
  • 5. Hadoop circa 2013Wednesday, June 10, 15 The state of Hadoop as of last year. Image: Detail of the London Eye
  • 6. Hadoop v2.X Cluster node DiskDiskDiskDiskDisk Node Mgr Data Node node DiskDiskDiskDiskDisk Node Mgr Data Node node DiskDiskDiskDiskDisk Node Mgr Data Node master Resource Mgr Name Node master Name Node Wednesday, June 10, 15 Schematic view of a Hadoop 2 cluster. For a more precise definition of the services and what they do, see e.g., http://hadoop.apache.org/ docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html We aren’t interested in great details at this point, but we’ll call out a few useful things to know.
  • 7. Hadoop v2.X Cluster node DiskDiskDiskDiskDisk Node Mgr Data Node node DiskDiskDiskDiskDisk Node Mgr Data Node node DiskDiskDiskDiskDisk Node Mgr Data Node master Resource Mgr Name Node master Name Node Resource and Node Managers Wednesday, June 10, 15 Hadoop 2 uses YARN to manage resources via the master Resource Manager, which includes a pluggable job scheduler and an Applications Manager. It coordinates with the Node Manager on each node to schedule jobs and provide resources. Other services involved, including application-specific Containers and Application Masters are not shown.
  • 8. Hadoop v2.X Cluster node DiskDiskDiskDiskDisk Node Mgr Data Node node DiskDiskDiskDiskDisk Node Mgr Data Node node DiskDiskDiskDiskDisk Node Mgr Data Node master Resource Mgr Name Node master Name Node Name Node and Data Nodes Wednesday, June 10, 15 Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system operations and resiliency when service instances fail. The data node services manage individual blocks for files.
  • 9. MapReduce The classic compute model for Hadoop Wednesday, June 10, 15 Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system operations and resiliency when service instances fail. The data node services manage individual blocks for files.
  • 10. 1 map step + 1 reduce step (wash, rinse, repeat) MapReduce Wednesday, June 10, 15 You get 1 map step (although there is limited support for chaining mappers) and 1 reduce step. If you can’t implement an algorithm in these two steps, you can chain jobs together, but you’ll pay a tax of flushing the entire data set to disk between these jobs.
  • 11. Example: Inverted Index MapReduce Wednesday, June 10, 15 The inverted index is a classic algorithm needed for building search engines.
  • 12. Wednesday, June 10, 15 Before running MapReduce, crawl teh interwebs, find all the pages, and build a data set of URLs -> doc contents, written to flat files in HDFS or one of the more “sophisticated” formats.
  • 13. Wednesday, June 10, 15 Now we’re running MapReduce. In the map step, a task (JVM process) per file *block* (64MB or larger) reads the rows, tokenizes the text and outputs key-value pairs (“tuples”)...
  • 14. Map Task (hadoop,(wikipedia.org/hadoop,1)) (mapreduce,(wikipedia.org/hadoop, (hdfs,(wikipedia.org/hadoop, 1)) (provides,(wikipedia.org/hadoop,1)) (and,(wikipedia.org/hadoop,1)) Wednesday, June 10, 15 ... the keys are each word found and the values are themselves tuples, each URL and the count of the word. In our simplified example, there are typically only single occurrences of each work in each document. The real occurrences are interesting because if a word is mentioned a lot in a document, the chances are higher that you would want to find that document in a search for that word.
  • 16. Wednesday, June 10, 15 The output tuples are sorted by key locally in each map task, then “shuffled” over the cluster network to reduce tasks (each a JVM process, too), where we want all occurrences of a given key to land on the same reduce task.
  • 17. Wednesday, June 10, 15 Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of (URL-count) tuples (pairs).
  • 18. Altogether... Wednesday, June 10, 15 Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of (URL-count) tuples (pairs).
  • 19. What’s not to like?Wednesday, June 10, 15 This seems okay, right? What’s wrong with it?
  • 20. Restrictive model makes most algorithms hard to implement. Awkward Wednesday, June 10, 15 Writing MapReduce jobs requires arcane, specialized skills that few master. For a good overview, see http://lintool.github.io/ MapReduceAlgorithms/.
  • 21. Lack of flexibility inhibits optimizations. Awkward Wednesday, June 10, 15 The inflexible compute model leads to complex code to improve performance where hacks are used to work around the limitations. Hence, optimizations are hard to implement. The Spark team has commented on this, see http://databricks.com/blog/2014/03/26/Spark-SQL- manipulating-structured-data-using-Spark.html
  • 22. Full dump of intermediate data to disk between jobs. Performance Wednesday, June 10, 15 Sequencing jobs wouldn’t be so bad if the “system” was smart enough to cache data in memory. Instead, each job dumps everything to disk, then the next job reads it back in again. This makes iterative algorithms particularly painful.
  • 23. MapReduce only supports “batch mode” Streaming Wednesday, June 10, 15 Processing data streams as soon as possible has become very important. MR can’t do this, due to its coarse-grained nature and relative inefficiency, so alternatives have to be used.
  • 25. Can be run in: •YARN (Hadoop 2) •Mesos (Cluster management) •EC2 •Standalone mode •Cassandra, Riak, ... •... Cluster Computing Wednesday, June 10, 15 If you have a Hadoop cluster, you can run Spark as a seamless compute engine on YARN. (You can also use pre-YARN Hadoop v1 clusters, but there you have manually allocate resources to the embedded Spark cluster vs Hadoop.) Mesos is a general-purpose cluster resource manager that can also be used to manage Hadoop resources. Scripts for running a Spark cluster in EC2 are available. Standalone just means you run Spark’s built-in support for clustering (or run locally on a single box - e.g., for development). EC2 deployments are usually standalone... Finally, database vendors like Datastax are integrating Spark.
  • 26. Fine-grained operators for composing algorithms. Compute Model Wednesday, June 10, 15 Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little code.
  • 27. RDD: Resilient, Distributed Dataset Compute Model Wednesday, June 10, 15 RDDs shard the data over a cluster, like a virtualized, distributed collection (analogous to HDFS). They support intelligent caching, which means no naive flushes of massive datasets to disk. This feature alone allows Spark jobs to run 10-100x faster than comparable MapReduce jobs! The “resilient” part means they will reconstitute shards lost due to process/server crashes.
  • 28. Compute Model Wednesday, June 10, 15 RDDs shard the data over a cluster, like a virtualized, distributed collection (analogous to HDFS). They support intelligent caching, which means no naive flushes of massive datasets to disk. This feature alone allows Spark jobs to run 10-100x faster than comparable MapReduce jobs! The “resilient” part means they will reconstitute shards lost due to process/server crashes.
  • 29. Written in Scala, with Java, Python, and R APIs. Compute Model Wednesday, June 10, 15 Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little code.
  • 30. Inverted Index in Java MapReduce Wednesday, June 10, 15 Let’s  see  an  an  actual  implementa1on  of  the  inverted  index.  First,  a  Hadoop  MapReduce  (Java)  version,  adapted  from  hBps:// developer.yahoo.com/hadoop/tutorial/module4.html#solu1on  It’s  about  90  lines  of  code,  but  I  reformaBed  to  fit  beBer. This  is  also  a  slightly  simpler  version  that  the  one  I  diagrammed.  It  doesn’t  record  a  count  of  each  word  in  a  document;  it  just  writes   (word,doc-­‐1tle)  pairs  out  of  the  mappers  and  the  final  (word,list)  output  by  the  reducers  just  has  a  list  of  documenta1ons,  hence  repeats.   A  second  job  would  be  necessary  to  count  the  repeats.
  • 31. import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class LineIndexer { public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class); conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); Wednesday, June 10, 15 I’m  not  going  to  explain  this  in  much  detail.  I  used  yellow  for  method  calls,  because  methods  do  the  real  work!!  But  no1ce  that  the  func1ons  in   this  code  don’t  really  do  a  whole  lot,  so  there’s  low  informa1on  density  and  you  do  a  lot  of  bit  twiddling.
  • 32. JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class); conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); conf.setMapperClass( LineIndexMapper.class); conf.setReducerClass( LineIndexReducer.class); client.setConf(conf); try { JobClient.runJob(conf); Wednesday, June 10, 15 boilerplate...
  • 33. LineIndexMapper.class); conf.setReducerClass( LineIndexReducer.class); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = Wednesday, June 10, 15 main ends with a try-catch clause to run the job.
  • 34. public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text(); public void map( LongWritable key, Text val, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName); Wednesday, June 10, 15 This is the LineIndexMapper class for the mapper. The map method does the real work of tokenization and writing the (word, document-name) tuples.
  • 35. FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName); String line = val.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, location); } } } public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Wednesday, June 10, 15 The rest of the LineIndexMapper class and map method.
  • 36. public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first=false; toReturn.append( values.next().toString()); } output.collect(key, new Text(toReturn.toString())); Wednesday, June 10, 15 The reducer class, LineIndexReducer, with the reduce method that is called for each key and a list of values for that key. The reducer is stupid; it just reformats the values collection into a long string and writes the final (word,list-string) output.
  • 37. boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first=false; toReturn.append( values.next().toString()); } output.collect(key, new Text(toReturn.toString())); } } } Wednesday, June 10, 15 EOF
  • 38. Altogether import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class LineIndexer { public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class); conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); conf.setMapperClass( LineIndexMapper.class); conf.setReducerClass( LineIndexReducer.class); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text(); public void map( LongWritable key, Text val, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName); String line = val.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, location); } } } public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first=false; toReturn.append( values.next().toString()); } output.collect(key, new Text(toReturn.toString())); } } } Wednesday, June 10, 15 The  whole  shebang  (6pt.  font)
  • 39. Inverted Index in Spark (Scala). Wednesday, June 10, 15 This  code  is  approximately  45  lines,  but  it  does  more  than  the  previous  Java  example,  it  implements  the  original  inverted  index  algorithm  I   diagrammed  where  word  counts  are  computed  and  included  in  the  data.
  • 40. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object InvertedIndex { def main(a: Array[String]) = { val sc = new SparkContext( "local[*]", "Inverted Idx") sc.textFile("data/crawl") .map { line => Wednesday, June 10, 15 The  InvertedIndex  implemented  in  Spark.  This  1me,  we’ll  also  count  the  occurrences  in  each  document  (as  I  originally  described  the  algorithm)   and  sort  the  (url,N)  pairs  descending  by  N  (count),  and  ascending  by  URL.
  • 41. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object InvertedIndex { def main(a: Array[String]) = { val sc = new SparkContext( "local[*]", "Inverted Idx") sc.textFile("data/crawl") .map { line => Wednesday, June 10, 15 It  starts  with  imports,  then  declares  a  singleton  object  (a  first-­‐class  concept  in  Scala),  with  a  “main”  rou1ne  (as  in  Java). The  methods  are  colored  yellow  again.  Note  this  1me  how  dense  with  meaning  they  are  this  1me.
  • 42. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object InvertedIndex { def main(a: Array[String]) = { val sc = new SparkContext( "local[*]", "Inverted Idx") sc.textFile("data/crawl") .map { line => Wednesday, June 10, 15 You  being  the  workflow  by  declaring  a  SparkContext.  We’re  running  in  “local[*]”  mode,  in  this  case,  meaning  on  a  single  machine,  but  using  all   cores  available.  Normally  this  argument  would  be  a  command-­‐line  parameter,  so  you  can  develop  locally,  then  submit  to  a  cluster,  where  “local”   would  be  replaced  by  the  appropriate  cluster  master  URI.  
  • 43. sc.textFile("data/crawl") .map { line => val Array(path, text) = line.split("t",2) (path, text) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) Wednesday, June 10, 15 The  rest  of  the  program  is  a  sequence  of  func1on  calls,  analogous  to  “pipes”  we  connect  together  to  construct  the  data  flow.  Data  will  only  start   “flowing”  when  we  ask  for  results. We  start  by  reading  one  or  more  text  files  from  the  directory  “data/crawl”.  If  running  in  Hadoop,  if  there  are  one  or  more  Hadoop-­‐style  “part-­‐ NNNNN”  files,  Spark  will  process  all  of  them  (they  will  be  processed  synchronously  in  “local”  mode).
  • 44. sc.textFile("data/crawl") .map { line => val Array(path, text) = line.split("t",2) (path, text) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) Wednesday, June 10, 15 sc.textFile  returns  an  RDD  with  a  string  for  each  line  of  input  text.  So,  the  first  thing  we  do  is  map  over  these  strings  to  extract  the  original   document  id  (i.e.,  file  name),  followed  by  the  text  in  the  document,  all  on  one  line.  We  assume  tab  is  the  separator.  “(array(0),  array(1))”  returns   a  two-­‐element  “tuple”.  Think  of  the  output  RDD  has  having  a  schema  “fileName:  String,  text:  String”.    
  • 45. sc.textFile("data/crawl") .map { line => val Array(path, text) = line.split("t",2) (path, text) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) Wednesday, June 10, 15 flatMap  maps  over  each  of  these  2-­‐element  tuples.  We  split  the  text  into  words  on  non-­‐alphanumeric  characters,  then  output  collec1ons  of  word   (our  ul1mate,  final  “key”)  and  the  path.  That  is,  each  line  (one  thing)  is  converted  to  a  collec1on  of  (word,path)  pairs  (0  to  many  things),  but  we   don’t  want  an  output  collec1on  of  nested  collec1ons,  so  flatMap  concatenates  nested  collec1ons  into  one  long  “flat”  collec1on  of  (word,path)   pairs.
  • 46. } .map { case (w, p) => ((w, p), 1) } .reduceByKey { case (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") ((word1, path1), n1) ((word2, path2), n2) ... Wednesday, June 10, 15 Next,  we  map  over  these  pairs  and  add  a  single  “seed”  count  of  1.  Note  the  structure  of  the  returned  tuple;  it’s  a  two-­‐tuple  where  the  first   element  is  itself  a  two-­‐tuple  holding  (word,  path).  The  following  special  method,  reduceByKey  is  like  a  groupBy,  where  it  groups  over  those   (word,  path)  “keys”  and  uses  the  func1on  to  sum  the  integers.    The  popup  shows  the  what  the  output  data  looks  like.
  • 47. .reduceByKey { case (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") } .saveAsTextFile("/path/out") sc.stop() } (word1, (path1, n1)) (word2, (path2, n2)) ... Wednesday, June 10, 15 So,  the  input  to  the  next  map  is  now  ((word,  path),  n),  where  n  is  now  >=  1.  We  transform  these  tuples  into  the  form  we  actually  want,  (word,   (path,  n)).  I  love  how  concise  and  elegant  this  code  is!
  • 48. .map { case ((w, p), n) => (w, (p, n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") } .saveAsTextFile("/path/out") sc.stop() } } (word1, iter( (path11, n11), (path12, n12)...)) (word2, iter( (path21, n21), (path22, n22)...)) ... Wednesday, June 10, 15 Now  we  do  an  explicit  group  by  to  bring  all  the  same  words  together.  The  output    will  be  (word,  seq(  (path1,  n1),  (path2,  n2),  ...)).
  • 49. .map { case ((w, p), n) => (w, (p, n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") } .saveAsTextFile("/path/out") sc.stop() } } Wednesday, June 10, 15 The  last  map  over  just  the  values  (keeping  the  same  keys)  sorts  by  the  count  descending  and  path  ascending.  (Sor1ng  by  path  is  mostly  useful  for   reproducibility,  e.g.,  in  tests!).
  • 50. .map { case ((w, p), n) => (w, (p, n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") } .saveAsTextFile("/path/out") sc.stop() } } Wednesday, June 10, 15 Finally,  write  back  to  the  file  system  and  stop  the  job.
  • 51. Altogether import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object InvertedIndex { def main(a: Array[String]) = { val sc = new SparkContext( "local[*]", "Inverted Idx") sc.textFile("data/crawl") .map { line => val Array(path, text) = line.split("t",2) (path, text) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { case (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) } } .saveAsTextFile("/path/to/out") sc.stop() } } Wednesday, June 10, 15 The  whole  shebang  (14pt.  font,  this  1me)
  • 52. } .map { case (w, p) => ((w, p), 1) } .reduceByKey { case (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") Concise Operators! Wednesday, June 10, 15 Once  you  have  this  arsenal  of  concise  combinators  (operators),  you  can  compose  sophis1ca1on  dataflows  very  quickly.
  • 53. Wednesday, June 10, 15 Another  example  of  a  beau1ful  and  profound  DSL,  in  this  case  from  the  world  of  Physics:  Maxwell’s  equa1ons:  hBp://upload.wikimedia.org/ wikipedia/commons/c/c4/Maxwell'sEqua1ons.svg
  • 54. The Spark version took me ~30 minutes to write. Wednesday, June 10, 15 Once  you  learn  the  core  primi1ves  I  used,  and  a  few  tricks  for  manipula1ng  the  RDD  tuples,  you  can  very  quickly  build  complex   algorithms  for  data  processing! The  Spark  API  allowed  us  to  focus  almost  exclusively  on  the  “domain”  of  data  transforma1ons,  while  the  Java  MapReduce  version  (which   does  less),  forced  tedious  aBen1on  to  infrastructure  mechanics.  
  • 55. Use a SQL query when you can!! Wednesday, June 10, 15
  • 56. New DataFrame API with query optimizer (equal performance for Scala, Java, Python, and R), Python/R-like idioms. Spark SQL! Wednesday, June 10, 15 This API sits on top of a new query optimizer called Catalyst, that supports equally fast execution for all high-level languages, a first in the big data world.
  • 57. Mix SQL queries with the RDD API. Spark SQL! Wednesday, June 10, 15 Use the best tool for a particular problem.
  • 58. Create, Read, and Delete Hive Tables Spark SQL! Wednesday, June 10, 15 Interoperate with Hive, the original Hadoop SQL tool.
  • 59. Read JSON and Infer the Schema Spark SQL! Wednesday, June 10, 15 Read strings that are JSON records, infer the schema on the fly. Also, write RDD records as JSON.
  • 60. Read and write Parquet files Spark SQL! Wednesday, June 10, 15 Parquet is a newer file format developed by Twitter and Cloudera that is becoming very popular. IT stores in column order, which is better than row order when you have lots of columns and your queries only need a few of them. Also, columns of the same data types are easier to compress, which Parquet does for you. Finally, Parquet files carry the data schema.
  • 61. ~10-100x the performance of Hive, due to in-memory caching of RDDs & better Spark abstractions. SparkSQL Wednesday, June 10, 15
  • 62. Combine SparkSQL queries with Machine Learning code. Wednesday, June 10, 15 We’ll  use  the  Spark  “MLlib”  in  the  example,  then  return  to  it  in  a  moment.
  • 63. CREATE TABLE Users( userId INT, name STRING, email STRING, age INT, latitude DOUBLE, longitude DOUBLE, subscribed BOOLEAN); CREATE TABLE Events( userId INT, action INT); Equivalent   HiveQL  Schemas   defini1ons. Wednesday, June 10, 15 This  example  adapted  from  the  following  blog  post  announcing  Spark  SQL: hBp://databricks.com/blog/2014/03/26/Spark-­‐SQL-­‐manipula1ng-­‐structured-­‐data-­‐using-­‐Spark.html Adapted  here  to  use  Spark’s  own  SQL,  not  the  integra1on  with  Hive.  Imagine  we  have  a  stream  of  events  from  users  and  the  events  that  have   occurred  as  they  used  a  system.
  • 64. val trainingDataTable = sql(""" SELECT e.action, u.age, u.latitude, u.longitude FROM Users u JOIN Events e ON u.userId = e.userId""") val trainingData = trainingDataTable map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD() .run(trainingData) val allCandidates = sql(""" Wednesday, June 10, 15 Here  is  some  Spark  (Scala)  code  with  an  embedded  SQL  query  that  joins  the  Users  and  Events  tables.  The  “””...”””  string  allows  embedded  line   feeds. The  “sql”  func1on  returns  an  RDD.  If  we  used  the  Hive  integra1on  and  this  was  a  query  against  a  Hive  table,  we  would  use  the  hql(...)  func1on   instead.
  • 65. val trainingDataTable = sql(""" SELECT e.action, u.age, u.latitude, u.longitude FROM Users u JOIN Events e ON u.userId = e.userId""") val trainingData = trainingDataTable map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD() .run(trainingData) val allCandidates = sql(""" Wednesday, June 10, 15 We  map  over  the  RDD  to  create  LabeledPoints,  an  object  used  in  Spark’s  MLlib  (machine  learning  library)  for  a  recommenda1on  engine.  The   “label”  is  the  kind  of  event  and  the  user’s  age  and  lat/long  coordinates  are  the  “features”  used  for  making  recommenda1ons.  (E.g.,  if  you’re  25   and  near  a  certain  loca1on  in  the  city,  you  might  be  interested  a  nightclub  near  by...)
  • 66. val model = new LogisticRegressionWithSGD() .run(trainingData) val allCandidates = sql(""" SELECT userId, age, latitude, longitude FROM Users WHERE subscribed = FALSE""") case class Score( userId: Int, score: Double) val scores = allCandidates map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features)) } // In-memory table scores.registerTempTable("Scores") Wednesday, June 10, 15 Next  we  train  the  recommenda1on  engine,  using  a  “logis1c  regression”  fit  to  the  training  data,  where  “stochas1c  gradient  descent”  (SGD)  is  used   to  train  it.  (This  is  a  standard  tool  set  for  recommenda1on  engines;  see  for  example:  hBp://www.cs.cmu.edu/~wcohen/10-­‐605/assignments/ sgd.pdf)
  • 67. val model = new LogisticRegressionWithSGD() .run(trainingData) val allCandidates = sql(""" SELECT userId, age, latitude, longitude FROM Users WHERE subscribed = FALSE""") case class Score( userId: Int, score: Double) val scores = allCandidates map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features)) } // In-memory table scores.registerTempTable("Scores") Wednesday, June 10, 15 Now  run  a  query  against  all  users  who  aren’t  already  subscribed  to  no1fica1ons.
  • 68. case class Score( userId: Int, score: Double) val scores = allCandidates map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features)) } // In-memory table scores.registerTempTable("Scores") val topCandidates = sql(""" SELECT u.name, u.email FROM Scores s JOIN Users u ON s.userId = u.userId ORDER BY score DESC LIMIT 100""")Wednesday, June 10, 15 Declare  a  class  to  hold  each  user’s  “score”  as  produced  by  the  recommenda1on  engine  and  map  the  “all”  query  results  to  Scores.
  • 69. case class Score( userId: Int, score: Double) val scores = allCandidates map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features)) } // In-memory table scores.registerTempTable("Scores") val topCandidates = sql(""" SELECT u.name, u.email FROM Scores s JOIN Users u ON s.userId = u.userId ORDER BY score DESC LIMIT 100""")Wednesday, June 10, 15 Then  “register”  the  scores  RDD  as  a  “Scores”  table  in  in  memory.  If  you  use  the  Hive  binding  instead,  this  would  be  a  table  in  Hive’s  metadata   storage.
  • 70. // In-memory table scores.registerTempTable("Scores") val topCandidates = sql(""" SELECT u.name, u.email FROM Scores s JOIN Users u ON s.userId = u.userId ORDER BY score DESC LIMIT 100""") Wednesday, June 10, 15 Finally,  run  a  new  query  to  find  the  people  with  the  highest  scores  that  aren’t  already  subscribing  to  no1fica1ons.  You  might  send  them  an  email   next  recommending  that  they  subscribe...
  • 71. val trainingDataTable = sql(""" SELECT e.action, u.age, u.latitude, u.longitude FROM Users u JOIN Events e ON u.userId = e.userId""") val trainingData = trainingDataTable map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD() .run(trainingData) val allCandidates = sql(""" SELECT userId, age, latitude, longitude FROM Users WHERE subscribed = FALSE""") case class Score( userId: Int, score: Double) val scores = allCandidates map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features)) } // In-memory table scores.registerTempTable("Scores") val topCandidates = sql(""" SELECT u.name, u.email FROM Scores s JOIN Users u ON s.userId = u.userId ORDER BY score DESC LIMIT 100""") Altogether Wednesday, June 10, 15 12  point  font  again.
  • 73. Use the same abstractions for near real-time, event streaming. Spark Streaming Wednesday, June 10, 15 Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little code.
  • 74. “Mini batches” Wednesday, June 10, 15 A DSTream (discretized stream) wraps the RDDs for each “batch” of events. You can specify the granularity, such as all events in 1 second batches, then your Spark job is passed each batch of data for processing. You can also work with moving windows of batches.
  • 76. val sc = new SparkContext(...) val ssc = new StreamingContext( sc, Seconds(10)) // A DStream that will listen // for text on server:port val lines = ssc.socketTextStream(s, p) // Word Count... val words = lines flatMap { line => line.split("""W+""") }Wednesday, June 10, 15 This  example  adapted  from  the  following  page  on  the  Spark  website: hBp://spark.apache.org/docs/0.9.0/streaming-­‐programming-­‐guide.html#a-­‐quick-­‐example
  • 77. val sc = new SparkContext(...) val ssc = new StreamingContext( sc, Seconds(10)) // A DStream that will listen // for text on server:port val lines = ssc.socketTextStream(s, p) // Word Count... val words = lines flatMap { line => line.split("""W+""") }Wednesday, June 10, 15 We  create  a  StreamingContext  that  wraps  a  SparkContext  (there  are  alterna1ve  ways  to  construct  it...).  It  will  “clump”  the  events  into  1-­‐second   intervals.  
  • 78. val sc = new SparkContext(...) val ssc = new StreamingContext( sc, Seconds(10)) // A DStream that will listen // for text on server:port val lines = ssc.socketTextStream(s, p) // Word Count... val words = lines flatMap { line => line.split("""W+""") }Wednesday, June 10, 15 Next  we  setup  a  socket  to  stream  text  to  us  from  another  server  and  port  (one  of  several  ways  to  ingest  data).
  • 79. // Word Count... val words = lines flatMap { line => line.split("""W+""") } val pairs = words map ((_, 1)) val wordCounts = pairs reduceByKey ((i,j) => i+j) wordCount.saveAsTextFiles(outpath) ssc.start() ssc.awaitTermination() Wednesday, June 10, 15 Now  we  “count  words”.  For  each  mini-­‐batch  (1  second’s  worth  of  data),  we  split  the  input  text  into  words  (on  whitespace,  which  is  too  crude). Once  we  setup  the  flow,  we  start  it  and  wait  for  it  to  terminate  through  some  means,  such  as  the  server  socket  closing.
  • 80. // Word Count... val words = lines flatMap { line => line.split("""W+""") } val pairs = words map ((_, 1)) val wordCounts = pairs reduceByKey ((i,j) => i+j) wordCount.saveAsTextFiles(outpath) ssc.start() ssc.awaitTermination() Wednesday, June 10, 15 We  count  these  words  just  like  we  counted  (word,  path)  pairs  early.
  • 81. val pairs = words map ((_, 1)) val wordCounts = pairs reduceByKey ((i,j) => i+j) wordCount.saveAsTextFiles(outpath) ssc.start() ssc.awaitTermination() Wednesday, June 10, 15 print  is  useful  diagnos1c  tool  that  prints  a  header  and  the  first  10  records  to  the  console  at  each  itera1on.
  • 82. val pairs = words map ((_, 1)) val wordCounts = pairs reduceByKey ((i,j) => i+j) wordCount.saveAsTextFiles(outpath) ssc.start() ssc.awaitTermination() Wednesday, June 10, 15 Now  start  the  data  flow  and  wait  for  it  to  terminate  (possibly  forever).
  • 83. Machine Learning Library... Wednesday, June 10, 15 MLlib,  which  we  won’t  discuss  further.
  • 84. Distributed Graph Computing... Wednesday, June 10, 15 GraphX, which we won’t discuss further. Some problems are more naturally represented as graphs. Extends RDDs to support property graphs with directed edges.
  • 85. A flexible, scalable distributed compute platform with concise, powerful APIs and higher-order tools. spark.apache.org Spark Wednesday, June 10, 15
  • 86. polyglotprogramming.com/talks @deanwampler Wednesday, June 10, 15 Copyright (c) 2014-2015, Dean Wampler, All Rights Reserved, unless otherwise noted. Image: The London Eye on one side of the Thames, Parliament on the other.