In this talk will show how Large Scale Data Analytics can be done with Spark and Cassandra on the DataStax Enterprise Platform. First we will give an overview of what is the Spark Cassandra Connector and how it enables working with large data sets. Then we will use the Spark Notebook to show live examples in the browser of interacting with the data. The example will load a large Movies Database from Cassandra into Spark and then show how that data can be transformed and analyzed using Spark.
6. What is Spark?
• Fast and general compute engine for large-scale data
processing
• Fault Tolerant Distributed Datasets
• Distributed Transformation on Datasets
• Integrated Batch, Iterative and Streaming Analysis
• In Memory Storage with Spill-over to Disk
9. Application
(Spark Driver)
Spark Master
Worker
Spark Components
You application code
which creates the SparkContext
A process which shells out to create
a Executor JVM
A Process which Manages the
Resources of the Spark Cluster
These processes are all separate and require networking
to communicate
Hosting
Application UI
:4040
Hosting
Spark Master UI
:7080
WorkerWorkerWorkerWorker
11. Spark is about Data Analytics
• How do we get data into Spark?
• How can we work with large datasets?
• What do we do with the results of the analytics?
19. DSE 4.7 Analytics + Search
• Allows Analytics Jobs to use Solr Queries
• Allows searching for data across partitions
• Example:
val table = sc.cassandraTable("music","solr")
val result = table.select("id","artist_name").where("solr_query='artist_name:Miles*'").collect
39. Imperative Code
final List<Integer> numbers =
Arrays.asList(1, 2, 3);
final List<Integer> numbersPlusOne =
Collections.emptyList();
for (Integer number : numbers) {
final Integer numberPlusOne = number + 1;
numbersPlusOne.add(numberPlusOne);
}
40. We Want Declarative Code
• Remove temporary lists - List<Integer> numbersPlusOne =
Collections.emptyList();
• Remove looping - for (Integer number : numbers)
• Focus on what - x+1
43. Java 8
final List<Integer> numbers =
Arrays.asList(1, 2, 3);
final List<Integer> numbersPlusOne =
numbers.stream().map(number -> number + 1).
collect(Collectors.toList());
λ
44. Scala
val numbers = 1 to 20
val incFunc = (x:Int) => x+1
numbers.map(incFunc)
numbers.map(x => x+1)
numbers.map(_+1)
λ
45. SQL - Declarative or Imperative?
• SQL is Declarative
• What operation to perform and not how
to perform it
• Select doesn’t define how just what data
we want
46. Closures vs Functions?
• Closure is a Function which closes over
the surrounding context
• Closures can access variables in
surrounding context
• Spark Job passes closures to operate on
the data
48. Why Functions with Spark?
• Declarative Programming - Define What and Not
How
• Define what operations to perform and Spark
figures out how to operate on the data
• Easy to handle Events and Async Results with
Functional Callbacks
• Avoid Inner Classes