Training Day | December 3rd 
Beginner Track 
• Introduction to Cassandra 
• Introduction to Spark, Shark, Scala and 
Cassa...
Cassandra + Spark = Awesome 
Johnny Miller, Solutions Architect 
@CyanMiller 
www.linkedin.com/in/johnnymiller
Who is DataStax? 
Founded in April 
2010 
©2014 DataStax Confidential. Do not distribute without 
consent. 3 
OUR INVESTOR...
DataStax Enterprise 
www.datastax.com
DataStax Enterprise is free for startups 
• Unlimited, free use of the software in DataStax 
Enterprise. 
• No limit on nu...
Training Day | December 3rd 
Beginner Track 
• Introduction to Cassandra 
• Introduction to Spark, Shark, Scala and 
Cassa...
What is Apache Cassandra? 
Apache Cassandra™ is a massively scalable NoSQL OLTP 
database. 
Cassandra is designed to handl...
What is Apache Cassandra? 
• Masterless architecture with read/write anywhere design. 
• Continuous availability with no s...
“In terms of scalability, there is a clear winner 
throughout our experiments. Cassandra achieves 
the highest throughput ...
Cassandra Architecture Overview 
• Cassandra was designed with the understanding 
that system/hardware failures can and do...
Cassandra Architecture Overview 
• Multi data centre support out of the box 
• Configurable replication factor 
• Configur...
Cassandra Query Language 
CREATE TABLE sporty_league ( 
team_name varchar, 
player_name varchar, 
jersey int, 
PRIMARY KEY...
Adoption 
http://db-engines.com/en/ranking 
November 2014
Performance & Scale 
DataStax works for small to huge deployments. 
• DataStax Enterprise footprint @ Netflix 
• 80+ Clust...
Cassandra Use Cases 
• Playlists/Collections 
• Personalisation/Recommendation 
• Messaging 
• Fraud Detection 
• Internet...
Apache Spark 
• Distributed computing framework 
• Created by UC AMP Lab since 2009 
• Apache Project since 2010 
• Solves...
Fast 
* Logistic Regression Performance
Components 
Shark 
or 
Spark SQL 
Compatible 
Streaming ML 
Spark (General execution engine) 
Graph 
Cassandra
Analytics Workload Isolation
Analytics High Availability 
* All nodes are Spark Workers 
* By default resilient to Worker failures 
* First Spark node ...
API 
map! reduce!
API 
map! 
filter! 
groupBy! 
sort! 
union! 
join! 
leftOuterJoin! 
rightOuterJoin! 
reduce! 
count! 
fold! 
reduceByKey! ...
API 
* Resilient Distributed Datasets 
* Collections of objects spread across a 
cluster, stored in RAM or on Disk 
* Buil...
A Quick Comparison to Hadoop 
HDFS 
map() 
reduce() 
map() 
reduce() 
©2014 DataStax Confidential. Do not distribute witho...
A Quick Comparison to Hadoop 
HDFS 
map() 
reduce() 
map() 
reduce() 
©2014 DataStax Confidential. Do not distribute witho...
Word Count Example 
• 1. package org.myorg; 
• 2. 
• 3. import java.io.IOException; 
• 4. import java.util.*; 
• 5. 
• 6. ...
Word Count Example 
• 1. package org.myorg; 
• 2. 
• 3. import java.io.IOException; 
• 4. import java.util.*; 
• 5. 
• 6. ...
Word Count Example 
• 1. package org.myorg; 
• 2. 
• 3. import java.io.IOException; 
• 4. import java.util.*; 
• 5. 
• 6. ...
Spark / Shark Benchmark 
©2014 DataStax Confidential. Do not distribute without consent. 30
Spark Streaming 
©2014 DataStax Confidential. Do not distribute without consent. 31
Spark Streaming 
• Scales to 100’s of nodes 
• High performance streaming 
• In-memory processing 
• Data processed in sma...
Spark Streaming 
• Spark primary data abstraction item 
• Resilient Distributed Dataset (RDD) 
• Immutable collection of e...
Spark Streaming 
* Micro batching (each batch represented as RDD) 
* Fault tolerant 
* Exactly-once processing 
* Unified ...
Spark Streaming Example 
import 
com.datastax.spark.connector.streaming._ 
// 
Spark 
connection 
options 
val 
conf 
= 
n...
Spark SQL 
• SQL-92 and HiveQL compatible query engine 
• Currently only SELECT and INSERT queries 
• Support for in-memor...
Spark SQL and HQL Example 
import 
com.datastax.spark.connector._ 
// 
Connect 
to 
the 
Spark 
cluster 
val 
conf 
= 
new...
Spark 
• The next big thing! 
• Simple to use 
• Works great with Cassandra 
• Fast distributed processing – faster than M...
Real-time Big Data! 
Data Enrichment 
Data Pre-computed 
aggregates 
NO ETL 
©2014 DataStax Confidential. Do not distribut...
Real-Time Big Data Use Cases 
• Recommendation Engine 
• Internet of Things 
• Fraud Detection 
• Risk Analysis 
• Buyer B...
Partnership 
©2014 DataStax Confidential. Do not distribute without consent. 41
How to use Spark with Cassandra? 
* DataStax Cassandra Spark driver 
* Open source: https://github.com/datastax/cassandra-...
Thank You 
We power the big data apps 
that transform business. 
©2013 DataStax Confidential. Do not distribute without co...
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Nächste SlideShare
Wird geladen in …5
×

Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014

1.935 Aufrufe

Veröffentlicht am

Johnny Miller – Cassandra + Spark = Awesome

This talk will discuss how Cassandra and Spark can work together to deliver real-time analytics. This is a technical discussion that will introduce the attendees to the basic principals on Cassandra and Spark, why they work well together and examples usecases.

Veröffentlicht in: Daten & Analysen

Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014

  1. 1. Training Day | December 3rd Beginner Track • Introduction to Cassandra • Introduction to Spark, Shark, Scala and Cassandra Advanced Track • Data Modeling • Performance Tuning Conference Day | December 4th Cassandra Summit Europe 2014 will be the single largest gathering of Cassandra users in Europe. Learn how the world's most successful companies are transforming their businesses and growing faster than ever using Apache Cassandra. http://bit.ly/cassandrasummit2014
  2. 2. Cassandra + Spark = Awesome Johnny Miller, Solutions Architect @CyanMiller www.linkedin.com/in/johnnymiller
  3. 3. Who is DataStax? Founded in April 2010 ©2014 DataStax Confidential. Do not distribute without consent. 3 OUR INVESTORS 500+ customers 30% of the Fortune 100 300+ employees 38 countries worldwide Powering critical systems DATASTAX BY THE NUMBERS
  4. 4. DataStax Enterprise www.datastax.com
  5. 5. DataStax Enterprise is free for startups • Unlimited, free use of the software in DataStax Enterprise. • No limit on number of nodes or other hidden restrictions. • If you’re a startup, it’s free! www.datastax.com/startups
  6. 6. Training Day | December 3rd Beginner Track • Introduction to Cassandra • Introduction to Spark, Shark, Scala and Cassandra Advanced Track • Data Modeling • Performance Tuning Conference Day | December 4th Cassandra Summit Europe 2014 will be the single largest gathering of Cassandra users in Europe. Learn how the world's most successful companies are transforming their businesses and growing faster than ever using Apache Cassandra. http://bit.ly/cassandrasummit2014
  7. 7. What is Apache Cassandra? Apache Cassandra™ is a massively scalable NoSQL OLTP database. Cassandra is designed to handle big data workloads across multiple data centers with no single point of failure, providing enterprises with continuous availability without compromising performance. Cassandra is: • A Highly distributed database • Low latency – very near real-time • 100% availability – No SPOF • Highly scalable – Linear Scalability • Wide Column Store • Disk Optimised
  8. 8. What is Apache Cassandra? • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. • Multi-data center and cloud availability zone support. • Linear scale performance with online capacity expansion. • CQL – SQL-like language. Node 100,000 txns/sec Node Node Node Node 400,000 txns/sec Node Node Node Node 200,000 txns/sec Node Node Node Node Node
  9. 9. “In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput.” Solving Big Data Challenges for Enterprise Application Performance Management, Tilman Rable, et al., August 2013, p. 10. Benchmark paper presented at the Very Large Database Conference, 2013. http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2013.pdf Netflix Cloud Benchmark… End Point Independent NoSQL Benchmark http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability- on.html Highest in throughput… Lowest in latency… Cassandra: A Leader in Performance
  10. 10. Cassandra Architecture Overview • Cassandra was designed with the understanding that system/hardware failures can and do occur • Peer-to-peer, distributed system • All nodes the same • Data partitioned among all nodes in the cluster • Custom data replication to ensure fault tolerance Node 1 Node 5 Node 2 Node 4 Node 3
  11. 11. Cassandra Architecture Overview • Multi data centre support out of the box • Configurable replication factor • Configurable data consistency per request • Active-Active replication architecture Node 1 1st copy Node 5 Node 2 Node 4 2nd copy Node 3 Node 1 1st Node 5 Node 2 Node 4 2nd copy Node 3 3rd copy DC: USA DC: EU
  12. 12. Cassandra Query Language CREATE TABLE sporty_league ( team_name varchar, player_name varchar, jersey int, PRIMARY KEY (team_name, player_name) ); SELECT * FROM sporty_league WHERE team_name = ‘Mighty Mutts’ and player_name = ‘Lucky’; INSERT INTO sporty_league (team_name, player_name, jersey) VALUES ('Mighty Mutts',’Felix’,90);
  13. 13. Adoption http://db-engines.com/en/ranking November 2014
  14. 14. Performance & Scale DataStax works for small to huge deployments. • DataStax Enterprise footprint @ Netflix • 80+ Clusters • 2500+ nodes • 4 Data Centres (Amazon Regions) • > 1 Trillion transactions per day See: http://www.datastax.com/resources/casestudies/netflix
  15. 15. Cassandra Use Cases • Playlists/Collections • Personalisation/Recommendation • Messaging • Fraud Detection • Internet of Things/Sensor Data • Time Series ©2014 DataStax Confidential. Do not distribute without consent. 16
  16. 16. Apache Spark • Distributed computing framework • Created by UC AMP Lab since 2009 • Apache Project since 2010 • Solves problems Hadoop is bad at • Iterative Algorithms • Interactive Machine Learning • More general purpose than MapReduce • Streaming! ©2014 DataStax Confidential. Do not distribute without consent. 17
  17. 17. Fast * Logistic Regression Performance
  18. 18. Components Shark or Spark SQL Compatible Streaming ML Spark (General execution engine) Graph Cassandra
  19. 19. Analytics Workload Isolation
  20. 20. Analytics High Availability * All nodes are Spark Workers * By default resilient to Worker failures * First Spark node promoted as Spark Master * Standby Master promoted on failure * Master HA available in DataStax Enterprise
  21. 21. API map! reduce!
  22. 22. API map! filter! groupBy! sort! union! join! leftOuterJoin! rightOuterJoin! reduce! count! fold! reduceByKey! groupByKey! cogroup! cross! zip! sample! take! first! partitionBy! mapWith! pipe! save ! ...!
  23. 23. API * Resilient Distributed Datasets * Collections of objects spread across a cluster, stored in RAM or on Disk * Built through parallel transformations * Automatically rebuilt upon failure * Operations * Transformations (e.g. map, filter, groupBy * Actions (e.g. count, collect, save)
  24. 24. A Quick Comparison to Hadoop HDFS map() reduce() map() reduce() ©2014 DataStax Confidential. Do not distribute without consent. 25
  25. 25. A Quick Comparison to Hadoop HDFS map() reduce() map() reduce() ©2014 DataStax Confidential. Do not distribute without consent. 26 Data Source 1 Data Source 2 map() join() cache() transform() transform()
  26. 26. Word Count Example • 1. package org.myorg; • 2. • 3. import java.io.IOException; • 4. import java.util.*; • 5. • 6. import org.apache.hadoop.fs.Path; • 7. import org.apache.hadoop.conf.*; • 8. import org.apache.hadoop.io.*; • 9. import org.apache.hadoop.mapred.*; • 10. import org.apache.hadoop.util.*; • 11. • 12. public class WordCount { • 13. • 14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { • 15. private final static IntWritable one = new IntWritable(1); • 16. private Text word = new Text(); • 17. • 18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { • 19. String line = value.toString(); • 20. StringTokenizer tokenizer = new StringTokenizer(line); • 21. while (tokenizer.hasMoreTokens()) { • 22. word.set(tokenizer.nextToken()); • 23. output.collect(word, one); • 24. } • 25. } • 26. } • 27. • 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { • 29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { • 30. int sum = 0; • 31. while (values.hasNext()) { • 32. sum += values.next().get(); • 33. } • 34. output.collect(key, new IntWritable(sum)); • 35. } • 36. } • 37. • 38. public static void main(String[] args) throws Exception { • 39. JobConf conf = new JobConf(WordCount.class); • 40. conf.setJobName("wordcount"); • 41. • 42. conf.setOutputKeyClass(Text.class); • 43. conf.setOutputValueClass(IntWritable.class); • 44. • 45. conf.setMapperClass(Map.class); • 46. conf.setCombinerClass(Reduce.class); • 47. conf.setReducerClass(Reduce.class); • 48. • 49. conf.setInputFormat(TextInputFormat.class); • 50. conf.setOutputFormat(TextOutputFormat.class); • 51. • 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); • 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); • 54. • 55. JobClient.runJob(conf); • 57. } • 58. } ©2014 DataStax Confidential. Do not distribute without consent. 27
  27. 27. Word Count Example • 1. package org.myorg; • 2. • 3. import java.io.IOException; • 4. import java.util.*; • 5. • 6. import org.apache.hadoop.fs.Path; • 7. import org.apache.hadoop.conf.*; • 8. import org.apache.hadoop.io.*; • 9. import org.apache.hadoop.mapred.*; • 10. import org.apache.hadoop.util.*; • 11. • 12. public class WordCount { • 13. • 14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { • 15. private final static IntWritable one = new IntWritable(1); • 16. private Text word = new Text(); • 17. • 18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { • 19. String line = value.toString(); • 20. StringTokenizer tokenizer = new StringTokenizer(line); • 21. while (tokenizer.hasMoreTokens()) { • 22. word.set(tokenizer.nextToken()); • 23. output.collect(word, one); • 24. } • 25. } • 26. } • 27. • 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { • 29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { • 30. int sum = 0; • 31. while (values.hasNext()) { • 32. sum += values.next().get(); • 33. } • 34. output.collect(key, new IntWritable(sum)); • 35. } • 36. } • 37. • 38. public static void main(String[] args) throws Exception { • 39. JobConf conf = new JobConf(WordCount.class); • 40. conf.setJobName("wordcount"); • 41. • 42. conf.setOutputKeyClass(Text.class); • 43. conf.setOutputValueClass(IntWritable.class); • 44. • 45. conf.setMapperClass(Map.class); • 46. conf.setCombinerClass(Reduce.class); • 47. conf.setReducerClass(Reduce.class); • 48. • 49. conf.setInputFormat(TextInputFormat.class); • 50. conf.setOutputFormat(TextOutputFormat.class); • 51. • 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); • 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); • 54. • 55. JobClient.runJob(conf); • 57. } • 58. } ©2014 DataStax Confidential. Do not distribute without consent. 28 1. file = spark.textFile("hdfs://...") 2. counts = file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) 3. counts.saveAsTextFile("hdfs://...")
  28. 28. Word Count Example • 1. package org.myorg; • 2. • 3. import java.io.IOException; • 4. import java.util.*; • 5. • 6. import org.apache.hadoop.fs.Path; • 7. import org.apache.hadoop.conf.*; • 8. import org.apache.hadoop.io.*; • 9. import org.apache.hadoop.mapred.*; • 10. import org.apache.hadoop.util.*; • 11. • 12. public class WordCount { • 13. • 14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { • 15. private final static IntWritable one = new IntWritable(1); • 16. private Text word = new Text(); • 17. • 18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { • 19. String line = value.toString(); • 20. StringTokenizer tokenizer = new StringTokenizer(line); • 21. while (tokenizer.hasMoreTokens()) { • 22. word.set(tokenizer.nextToken()); • 23. output.collect(word, one); • 24. } • 25. } • 26. } • 27. • 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { • 29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { • 30. int sum = 0; • 31. while (values.hasNext()) { • 32. sum += values.next().get(); • 33. } • 34. output.collect(key, new IntWritable(sum)); • 35. } • 36. } • 37. • 38. public static void main(String[] args) throws Exception { • 39. JobConf conf = new JobConf(WordCount.class); • 40. conf.setJobName("wordcount"); • 41. • 42. conf.setOutputKeyClass(Text.class); • 43. conf.setOutputValueClass(IntWritable.class); • 44. • 45. conf.setMapperClass(Map.class); • 46. conf.setCombinerClass(Reduce.class); • 47. conf.setReducerClass(Reduce.class); • 48. • 49. conf.setInputFormat(TextInputFormat.class); • 50. conf.setOutputFormat(TextOutputFormat.class); • 51. • 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); • 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); • 54. • 55. JobClient.runJob(conf); • 57. } • 58. } ©2014 DataStax Confidential. Do not distribute without consent. 29 1. file = spark.textFile("hdfs://...") 2. counts = file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) 3. counts.saveAsTextFile("hdfs://...") 10x to 100x the speed of MapReduce
  29. 29. Spark / Shark Benchmark ©2014 DataStax Confidential. Do not distribute without consent. 30
  30. 30. Spark Streaming ©2014 DataStax Confidential. Do not distribute without consent. 31
  31. 31. Spark Streaming • Scales to 100’s of nodes • High performance streaming • In-memory processing • Data processed in small batches • Designed to be fault tolerant • Maintains information in low level data abstraction elements that are able to be rebuilt upon faults ©2014 DataStax Confidential. Do not distribute without consent. 32
  32. 32. Spark Streaming • Spark primary data abstraction item • Resilient Distributed Dataset (RDD) • Immutable collection of elements that can be processed in parallel • RDD can be reconstructed from source in case of node failures • Descretized Stream (DStream) • continuous stream of RDD’s ©2014 DataStax Confidential. Do not distribute without consent. 33
  33. 33. Spark Streaming * Micro batching (each batch represented as RDD) * Fault tolerant * Exactly-once processing * Unified stream and batch processing framework * Supports Kafka, Flume, ZeroMQ, Kinesis, MQTT producers. Data Stream DStream RDD
  34. 34. Spark Streaming Example import com.datastax.spark.connector.streaming._ // Spark connection options val conf = new SparkConf(true)... // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) // stream input val lines = ssc.socketTextStream(serverIP, serverPort) // count words val wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) // stream output wordCounts.saveToCassandra("test", "words") // start processing ssc.start() ssc.awaitTermination()
  35. 35. Spark SQL • SQL-92 and HiveQL compatible query engine • Currently only SELECT and INSERT queries • Support for in-memory computation • Pushdown of predicates to Cassandra when possible
  36. 36. Spark SQL and HQL Example import com.datastax.spark.connector._ // Connect to the Spark cluster val conf = new SparkConf(true)... val sc = new SparkContext(conf) // Create Cassandra SQL context val cc = new CassandraSQLContext(sc) // Execute SQL query val rdd = cc.sql("INSERT INTO ks.t1 SELECT c1,c2 FROM ks.t2") // Execute HQL query val rdd = cc.hql("SELECT * FROM keyspace.table JOIN ... WHERE ...")
  37. 37. Spark • The next big thing! • Simple to use • Works great with Cassandra • Fast distributed processing – faster than MapReduce • Streaming • Machine Learning • Classification, Collaborative filtering, Clustering, Optimization ©2014 DataStax Confidential. Do not distribute without consent. 38
  38. 38. Real-time Big Data! Data Enrichment Data Pre-computed aggregates NO ETL ©2014 DataStax Confidential. Do not distribute without consent. 39 Batch Processing Machine Learning
  39. 39. Real-Time Big Data Use Cases • Recommendation Engine • Internet of Things • Fraud Detection • Risk Analysis • Buyer Behaviour Analytics • Telematics, Logistics • Business Intelligence • Infrastructure Monitoring ©2014 DataStax Confidential. Do not distribute without consent. 40
  40. 40. Partnership ©2014 DataStax Confidential. Do not distribute without consent. 41
  41. 41. How to use Spark with Cassandra? * DataStax Cassandra Spark driver * Open source: https://github.com/datastax/cassandra-driver-spark * DataStax Enterprise Analytics
  42. 42. Thank You We power the big data apps that transform business. ©2013 DataStax Confidential. Do not distribute without consent.

×