SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Lightning-fast analytics with Spark and Cassandra 
Nick Bailey 
@nickmbailey 
©2014 DataStax Confidential. Do not distribute without consent. 
1
What is Spark? 
*Apache Project since 2010" 
*Fast" 
* 10x-100x faster than Hadoop MapReduce" 
* In-memory storage" 
* Single JVM process per node" 
*Easy" 
*Rich Scala, Java and Python APIs" 
* 2x-5x less code" 
* Interactive shell 
Analytic 
Analytic 
Search
API 
map reduce
API 
map" 
filter" 
groupBy 
sort" 
union" 
join" 
leftOuterJoin 
rightOuterJoin 
reduce" 
count" 
fold" 
reduceByKey 
groupByKey 
cogroup 
cross" 
zip 
sample" 
take" 
first 
partitionBy 
mapWith 
pipe" 
save 
...
API 
*Resilient Distributed Datasets (RDD)" 
*Collections of objects spread across a cluster, stored in RAM or on Disk" 
* Built through parallel transformations" 
* Automatically rebuilt on failure" 
" 
*Operations" 
* Transformations (e.g. map, filter, groupBy)" 
* Actions (e.g. count, collect, save)
Operator Graph: Optimization and Fault Tolerance 
join 
groupBy 
filter 
Stage 3 
Stage 1 
Stage 2 
A: B: 
C: D: E: 
F: 
map 
= RDD = Cached partition
Fast 
Running Time (s) 
4000 
3000 
2000 
1000 
0 
1 5 10 20 30 
Number of Iterations 
110 sec / iteration 
Hadoop Spark 
first iteration 80 sec" 
further iterations 1 sec 
* Logistic Regression Performance"
Why Spark on Cassandra? 
*Data model independent queries" 
*Cross-table operations (JOIN, UNION, etc.)" 
*Complex analytics (e.g. machine learning)" 
*Data transformation, aggregation, etc." 
*Stream processing
How to Spark on Cassandra? 
*DataStax Cassandra Spark driver" 
*Open source: https://github.com/datastax/spark-cassandra-connector" 
*Compatible with" 
* Spark 0.9+" 
*Cassandra 2.0+" 
*DataStax Enterprise 4.5+
Cassandra Spark Driver 
*Cassandra tables exposed as Spark RDDs" 
*Read from and write to Cassandra" 
*Mapping of C* tables and rows to Scala objects" 
*All Cassandra types supported and converted to Scala types" 
*Server side data selection" 
*Spark Streaming support" 
*Scala and Java support
Connecting to Cassandra 
// Import Cassandra-specific functions on SparkContext and RDD objects" 
import com.datastax.driver.spark._" 
" 
" 
// Spark connection options" 
val conf = new SparkConf(true)" 
! ! .setMaster("spark://192.168.123.10:7077")" 
! ! .setAppName("cassandra-demo")" 
.set(“cassandra.connection.host", "192.168.123.10") // initial 
contact! 
.set("cassandra.username", "cassandra")! 
.set("cassandra.password", "cassandra") " 
" 
val sc = new SparkContext(conf)
Accessing Data 
CREATE TABLE test.words (word text PRIMARY KEY, count int);" 
" 
INSERT INTO test.words (word, count) VALUES ('bar', 30);" 
INSERT INTO test.words (word, count) VALUES ('foo', 20); 
*Accessing table above as RDD: 
// Use table as RDD" 
val rdd = sc.cassandraTable("test", "words")" 
// rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]" 
" 
rdd.toArray.foreach(println)" 
// CassandraRow[word: bar, count: 30]! 
// CassandraRow[word: foo, count: 20]" 
" 
rdd.columnNames // Stream(word, count) " 
rdd.size // 2" 
" 
val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, 
count: 30]! 
firstRow.getInt("count") // Int = 30
Saving Data 
val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))" 
// newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]" 
" 
newRdd.saveToCassandra("test", "words", Seq("word", "count")) 
*RDD above saved to Cassandra: 
SELECT * FROM test.words;" 
" 
word | count" 
------+-------" 
bar | 30" 
foo | 20" 
cat | 40" 
fox | 50" 
" 
(4 rows)
Type Mapping 
CQL Type Scala Type 
ascii String 
bigint Long 
boolean Boolean 
counter Long 
decimal BigDecimal, java.math.BigDecimal 
double Double 
float Float 
inet java.net.InetAddress 
int Int 
list Vector, List, Iterable, Seq, IndexedSeq, java.util.List 
map Map, TreeMap, java.util.HashMap 
set Set, TreeSet, java.util.HashSet 
text, varchar String 
timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime 
timeuuid java.util.UUID 
uuid java.util.UUID 
varint BigInt, java.math.BigInteger 
*nullable values Option
Mapping Rows to Objects 
*Mapping rows to Scala Case Classes" 
*CQL underscore case column mapped to Scala camel case 
property" 
*Custom mapping functions (see docs) 
CREATE TABLE test.cars (" 
! id text PRIMARY KEY," 
! model text," 
! fuel_type text," 
! year int! 
); 
case class Vehicle(" 
! id: String," 
! model: String," 
! fuelType: String," 
! year: Int! 
)" 
" 
sc.cassandraTable[Vehicle]("test", "cars").toArray! 
//Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009)," 
// Vehicle(MT8787, Hyundai x35, Diesel, 2011) 
!
Server Side Data Selection 
*Reduce the amount of data transferred" 
*Selecting columns 
sc.cassandraTable("test", 
"users").select("username").toArray.foreach(println)" 
// CassandraRow{username: john} ! 
// CassandraRow{username: tom} 
" 
*Selecting rows (by clustering columns and/or secondary indexes) 
sc.cassandraTable("test", "cars").select("model").where("color = ?", 
"black").toArray.foreach(println)" 
// CassandraRow{model: Ford Mondeo}! 
// CassandraRow{model: Hyundai x35}
Spark SQL 
Compatible 
Spark SQL Streaming ML 
Spark (General execution engine) 
Graph 
Cassandra
Spark SQL 
*SQL query engine on top of Spark" 
*Hive compatible (JDBC, UDFs, types, metadata, etc.)" 
*Support for in-memory processing" 
*Pushdown of predicates to Cassandra when possible
Spark SQL Example 
" 
import com.datastax.spark.connector._" 
" 
// Connect to the Spark cluster" 
val conf = new SparkConf(true)...! 
val sc = new SparkContext(conf)" 
" 
// Create Cassandra SQL context! 
val cc = new CassandraSQLContext(sc)! 
" 
// Execute SQL query" 
val rdd = cc.sql("SELECT * FROM keyspace.table WHERE ...”)"
Analytics Workload Isolation 
Cassandra" 
+ Spark DC 
Cassandra" 
Only DC 
Online 
App 
Analytical 
App 
Mixed Load Cassandra Cluster
Analytics High Availability 
*Spark Workers run on all Cassandra nodes" 
*Workers are resilient by default" 
*First Spark node promoted as Spark Master" 
*Standby Master promoted on failure" 
*Master HA available in DataStax Enterprise 
Spark Master 
Spark Standby Master 
Spark Worker
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
 
Heuritech: Apache Spark REX
Heuritech: Apache Spark REXHeuritech: Apache Spark REX
Heuritech: Apache Spark REX
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
 

Andere mochten auch

Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
Patrick McFadin
 

Andere mochten auch (20)

Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Transform & Analyze Time Series Data via Apache Spark @Windward
Transform & Analyze Time Series Data via Apache Spark @WindwardTransform & Analyze Time Series Data via Apache Spark @Windward
Transform & Analyze Time Series Data via Apache Spark @Windward
 
Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)
 
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
 
Data Modeling with Cassandra and Time Series Data
Data Modeling with Cassandra and Time Series DataData Modeling with Cassandra and Time Series Data
Data Modeling with Cassandra and Time Series Data
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
 
Cassandra 3.0
Cassandra 3.0Cassandra 3.0
Cassandra 3.0
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Cassandra 3 new features 2016
Cassandra 3 new features 2016Cassandra 3 new features 2016
Cassandra 3 new features 2016
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Spark & Zeppelin을 활용한 머신러닝 실전 적용기
Spark & Zeppelin을 활용한 머신러닝 실전 적용기Spark & Zeppelin을 활용한 머신러닝 실전 적용기
Spark & Zeppelin을 활용한 머신러닝 실전 적용기
 
Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaTime Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy Ryza
 
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
 
Apache Spark 입문에서 머신러닝까지
Apache Spark 입문에서 머신러닝까지Apache Spark 입문에서 머신러닝까지
Apache Spark 입문에서 머신러닝까지
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106
Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106
Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106
 

Ähnlich wie Lightning fast analytics with Spark and Cassandra

Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
Databricks
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 

Ähnlich wie Lightning fast analytics with Spark and Cassandra (20)

Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Declarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformDeclarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with Terraform
 

Mehr von nickmbailey

An Introduction to Cassandra on Linux
An Introduction to Cassandra on LinuxAn Introduction to Cassandra on Linux
An Introduction to Cassandra on Linux
nickmbailey
 
Introduction to Cassandra and Data Modeling
Introduction to Cassandra and Data ModelingIntroduction to Cassandra and Data Modeling
Introduction to Cassandra and Data Modeling
nickmbailey
 
CFS: Cassandra backed storage for Hadoop
CFS: Cassandra backed storage for HadoopCFS: Cassandra backed storage for Hadoop
CFS: Cassandra backed storage for Hadoop
nickmbailey
 

Mehr von nickmbailey (9)

Clojure at DataStax: The Long Road From Python to Clojure
Clojure at DataStax: The Long Road From Python to ClojureClojure at DataStax: The Long Road From Python to Clojure
Clojure at DataStax: The Long Road From Python to Clojure
 
Introduction to Cassandra Architecture
Introduction to Cassandra ArchitectureIntroduction to Cassandra Architecture
Introduction to Cassandra Architecture
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 
Cassandra and Clojure
Cassandra and ClojureCassandra and Clojure
Cassandra and Clojure
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basics
 
An Introduction to Cassandra on Linux
An Introduction to Cassandra on LinuxAn Introduction to Cassandra on Linux
An Introduction to Cassandra on Linux
 
Introduction to Cassandra and Data Modeling
Introduction to Cassandra and Data ModelingIntroduction to Cassandra and Data Modeling
Introduction to Cassandra and Data Modeling
 
CFS: Cassandra backed storage for Hadoop
CFS: Cassandra backed storage for HadoopCFS: Cassandra backed storage for Hadoop
CFS: Cassandra backed storage for Hadoop
 
Clojure and the Web
Clojure and the WebClojure and the Web
Clojure and the Web
 

Kürzlich hochgeladen

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 

Kürzlich hochgeladen (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 

Lightning fast analytics with Spark and Cassandra

  • 1. Lightning-fast analytics with Spark and Cassandra Nick Bailey @nickmbailey ©2014 DataStax Confidential. Do not distribute without consent. 1
  • 2. What is Spark? *Apache Project since 2010" *Fast" * 10x-100x faster than Hadoop MapReduce" * In-memory storage" * Single JVM process per node" *Easy" *Rich Scala, Java and Python APIs" * 2x-5x less code" * Interactive shell Analytic Analytic Search
  • 4. API map" filter" groupBy sort" union" join" leftOuterJoin rightOuterJoin reduce" count" fold" reduceByKey groupByKey cogroup cross" zip sample" take" first partitionBy mapWith pipe" save ...
  • 5. API *Resilient Distributed Datasets (RDD)" *Collections of objects spread across a cluster, stored in RAM or on Disk" * Built through parallel transformations" * Automatically rebuilt on failure" " *Operations" * Transformations (e.g. map, filter, groupBy)" * Actions (e.g. count, collect, save)
  • 6. Operator Graph: Optimization and Fault Tolerance join groupBy filter Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map = RDD = Cached partition
  • 7. Fast Running Time (s) 4000 3000 2000 1000 0 1 5 10 20 30 Number of Iterations 110 sec / iteration Hadoop Spark first iteration 80 sec" further iterations 1 sec * Logistic Regression Performance"
  • 8. Why Spark on Cassandra? *Data model independent queries" *Cross-table operations (JOIN, UNION, etc.)" *Complex analytics (e.g. machine learning)" *Data transformation, aggregation, etc." *Stream processing
  • 9. How to Spark on Cassandra? *DataStax Cassandra Spark driver" *Open source: https://github.com/datastax/spark-cassandra-connector" *Compatible with" * Spark 0.9+" *Cassandra 2.0+" *DataStax Enterprise 4.5+
  • 10. Cassandra Spark Driver *Cassandra tables exposed as Spark RDDs" *Read from and write to Cassandra" *Mapping of C* tables and rows to Scala objects" *All Cassandra types supported and converted to Scala types" *Server side data selection" *Spark Streaming support" *Scala and Java support
  • 11. Connecting to Cassandra // Import Cassandra-specific functions on SparkContext and RDD objects" import com.datastax.driver.spark._" " " // Spark connection options" val conf = new SparkConf(true)" ! ! .setMaster("spark://192.168.123.10:7077")" ! ! .setAppName("cassandra-demo")" .set(“cassandra.connection.host", "192.168.123.10") // initial contact! .set("cassandra.username", "cassandra")! .set("cassandra.password", "cassandra") " " val sc = new SparkContext(conf)
  • 12. Accessing Data CREATE TABLE test.words (word text PRIMARY KEY, count int);" " INSERT INTO test.words (word, count) VALUES ('bar', 30);" INSERT INTO test.words (word, count) VALUES ('foo', 20); *Accessing table above as RDD: // Use table as RDD" val rdd = sc.cassandraTable("test", "words")" // rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]" " rdd.toArray.foreach(println)" // CassandraRow[word: bar, count: 30]! // CassandraRow[word: foo, count: 20]" " rdd.columnNames // Stream(word, count) " rdd.size // 2" " val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30]! firstRow.getInt("count") // Int = 30
  • 13. Saving Data val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))" // newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]" " newRdd.saveToCassandra("test", "words", Seq("word", "count")) *RDD above saved to Cassandra: SELECT * FROM test.words;" " word | count" ------+-------" bar | 30" foo | 20" cat | 40" fox | 50" " (4 rows)
  • 14. Type Mapping CQL Type Scala Type ascii String bigint Long boolean Boolean counter Long decimal BigDecimal, java.math.BigDecimal double Double float Float inet java.net.InetAddress int Int list Vector, List, Iterable, Seq, IndexedSeq, java.util.List map Map, TreeMap, java.util.HashMap set Set, TreeSet, java.util.HashSet text, varchar String timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime timeuuid java.util.UUID uuid java.util.UUID varint BigInt, java.math.BigInteger *nullable values Option
  • 15. Mapping Rows to Objects *Mapping rows to Scala Case Classes" *CQL underscore case column mapped to Scala camel case property" *Custom mapping functions (see docs) CREATE TABLE test.cars (" ! id text PRIMARY KEY," ! model text," ! fuel_type text," ! year int! ); case class Vehicle(" ! id: String," ! model: String," ! fuelType: String," ! year: Int! )" " sc.cassandraTable[Vehicle]("test", "cars").toArray! //Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009)," // Vehicle(MT8787, Hyundai x35, Diesel, 2011) !
  • 16. Server Side Data Selection *Reduce the amount of data transferred" *Selecting columns sc.cassandraTable("test", "users").select("username").toArray.foreach(println)" // CassandraRow{username: john} ! // CassandraRow{username: tom} " *Selecting rows (by clustering columns and/or secondary indexes) sc.cassandraTable("test", "cars").select("model").where("color = ?", "black").toArray.foreach(println)" // CassandraRow{model: Ford Mondeo}! // CassandraRow{model: Hyundai x35}
  • 17. Spark SQL Compatible Spark SQL Streaming ML Spark (General execution engine) Graph Cassandra
  • 18. Spark SQL *SQL query engine on top of Spark" *Hive compatible (JDBC, UDFs, types, metadata, etc.)" *Support for in-memory processing" *Pushdown of predicates to Cassandra when possible
  • 19. Spark SQL Example " import com.datastax.spark.connector._" " // Connect to the Spark cluster" val conf = new SparkConf(true)...! val sc = new SparkContext(conf)" " // Create Cassandra SQL context! val cc = new CassandraSQLContext(sc)! " // Execute SQL query" val rdd = cc.sql("SELECT * FROM keyspace.table WHERE ...”)"
  • 20. Analytics Workload Isolation Cassandra" + Spark DC Cassandra" Only DC Online App Analytical App Mixed Load Cassandra Cluster
  • 21. Analytics High Availability *Spark Workers run on all Cassandra nodes" *Workers are resilient by default" *First Spark node promoted as Spark Master" *Standby Master promoted on failure" *Master HA available in DataStax Enterprise Spark Master Spark Standby Master Spark Worker