SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Cassandra and Spark:
Optimizing for Data
Locality
Russell Spitzer

Software Engineer @ DataStax
Lex Luther Was Right:
Location is Important
The value of many things is based upon it's
location. Developed land near the beach is
valuable but desert and farmland is generally
much cheaper. Unfortunately moving land is
generally impossible.
Spark Summit
or lake, or swamp or whatever body of water
is "Data" at the time this slide is viewed.
Lex Luther Was Wrong:
Don't ETL the Data Ocean
Spark Summit
My House
Spark is Our Hero Giving us the
Ability to Do Our Analytics Without the ETL
PARK
Moving Data Between Machines Is Expensive
Do Work Where the Data Lives!
Our Cassandra Nodes are like Cities and our Spark Executors
are like Super Heroes. We'd rather they spend their time locally
rather than flying back and forth all the time.
Moving Data Between Machines Is Expensive
Do Work Where the Data Lives!
Our Cassandra Nodes are like Cities and our Spark Executors
are like Super Heroes. We'd rather they spend their time locally
rather than flying back and forth all the time.
Metropolis
Superman
Gotham
Batman
Spark Executors
DataStax Open Source
Spark Cassandra Connector is Available on Github
https://github.com/datastax/spark-cassandra-connector
• Compatible with Spark 1.3
• Read and Map C* Data Types
• Saves To Cassandra
• Intelligent write batching
• Supports Collections
• Secondary index pushdown
• Arbitrary CQL Execution!
How the Spark Cassandra Connector
Reads Data Node Local
Cassandra Locates a Row Based on
Partition Key and Token Range
All of the rows in a Cassandra Cluster
are stored based based on their
location in the Token Range.
Metropolis
Gotham
Coast City
Each of the Nodes in a 

Cassandra Cluster is primarily
responsible for one set of
Tokens.
0999
500
Cassandra Locates a Row Based on
Partition Key and Token Range
Metropolis
Gotham
Coast City
Each of the Nodes in a 

Cassandra Cluster is primarily
responsible for one set of
Tokens.
0999
500
750 - 99
350 - 749
100 - 349
Cassandra Locates a Row Based on
Partition Key and Token Range
Jacek 514 Red
The CQL Schema designates
at least one column to be the
Partition Key.
Metropolis
Gotham
Coast City
Cassandra Locates a Row Based on
Partition Key and Token Range
Jacek 514 Red
The hash of the Partition Key
tells us where a row
should be stored.
Metropolis
Gotham
Coast City
Cassandra Locates a Row Based on
Partition Key and Token Range
830
With VNodes the ranges are
not contiguous but the same
mechanism controls row
location.
Jacek 514 Red
Metropolis
Gotham
Coast City
Cassandra Locates a Row Based on
Partition Key and Token Range
830
Loading Huge Amounts of Data
Table Scans involve loading most of the data in Cassandra
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size
The (estimated) number of C* Partitions to be placed in a Spark Partition
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size
The (estimated) number of C* Partitions to be placed in a Spark Partition
CassandraRDD
Spark
Partition
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size
The (estimated) number of C* Partitions to be placed in a Spark Partition
CassandraRDD
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size
The (estimated) number of C* Partitions to be placed in a Spark Partition
CassandraRDD
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size
The (estimated) number of C* Partitions to be placed in a Spark Partition
CassandraRDD
Spark Driver
Spark Partitions Are Annotated With the Location For
TokenRanges they Span
Metropolis
Metropolis
Superman
Gotham
Batman
Coast City
Green L.
The Driver waits spark.locality.wait for the
preferred location to have an open executor
Assigns Task
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instance
On the Executor the task is transformed into CQL queries which are
executed via the Java Driver.
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
Tokens 780 - 830
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instance
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
Tokens 780 - 830
The C* Java Driver pages spark.cassandra.input.page.row.size
CQL rows at a time
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instance
SELECT * FROM keyspace.table WHERE 

(Pushed Down Clauses) AND
token(pk) > 780 and token(pk) <= 830
Tokens 780 - 830
Because we are utilizing CQL we can also pushdown predicates which
can be handled by C*.
Loading Sizable But Defined Amounts of Data
Retrieving sets of Partition Keys can be done in parallel
joinWithCassandraTable Provides an Interface for
Obtaining a Set of C* Partitions
Generic RDD
Metropolis
Superman
Gotham
Batman
Coast City
Green L.
Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local
Generic RDD
Metropolis
Superman
Gotham
Batman
Coast City
Green L.
Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local
joinWithCassandraTable Provides an Interface for
Obtaining a Set of C* Partitions
repartitionByCassandraReplica Repartitions RDD's to be
C* Local
Generic RDD CassandraPartitionedRDD
This operation requires a shuffle
joinWithCassandraTable on CassandraPartitionedRDDs
(or CassandraTableScanRDDs) will be Node local
CassandraPartitionedRDDs are partitioned to be executed node local
CassandraPartitionedRDD
Metropolis
Superman
Gotham
Batman
Coast City
Green L.
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instance
The C* Java Driver pages spark.cassandra.input.page.row.size
CQL rows at a time
SELECT * FROM keyspace.table WHERE
pk =
DataStax Enterprise Comes Bundled
with Spark and the Connector
Apache Spark Apache Solr
DataStax Delivers
Apache Cassandra
In A Database Platform
DataStax Enterprise Enables This Same Machinery 

with Solr Pushdown
Metropolis
Spark Executor (Superman)
DataStax
Enterprise
SELECT * FROM keyspace.table
WHERE solr_query = 'title:b'
AND
token(pk) > 780 and token(pk) <= 830
Tokens 780 - 830
Learn More Online and at Cassandra Summit
https://academy.datastax.com/

Weitere ähnliche Inhalte

Was ist angesagt?

Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousRussell Spitzer
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Russell Spitzer
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsRussell Spitzer
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorRussell Spitzer
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015Patrick McFadin
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark Summit
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraJim Hatcher
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformDataStax Academy
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and SparkEvan Chan
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team ApachePatrick McFadin
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionPatrick McFadin
 

Was ist angesagt? (20)

Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra Connector
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 

Andere mochten auch

Gluster.community.day.2013
Gluster.community.day.2013Gluster.community.day.2013
Gluster.community.day.2013Udo Seidel
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksMarian Marinov
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 

Andere mochten auch (6)

Gluster.community.day.2013
Gluster.community.day.2013Gluster.community.day.2013
Gluster.community.day.2013
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networks
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 

Ähnlich wie Cassandra and Spark: Optimizing for Data Locality

Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark StreamingGerard Maas
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
 
Akka Cluster in Production
Akka Cluster in ProductionAkka Cluster in Production
Akka Cluster in Productionbilyushonak
 
Spark_tutorial (1).pptx
Spark_tutorial (1).pptxSpark_tutorial (1).pptx
Spark_tutorial (1).pptx0111002
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonDatabricks
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
 
Manchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsManchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsChristopher Batey
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark datastaxjp
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for SysadminsNathan Milford
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...DataStax Academy
 
Cassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorCassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorChristopher Batey
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraEric Evans
 
C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans
C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric EvansC* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans
C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric EvansDataStax Academy
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + ElkVasil Remeniuk
 

Ähnlich wie Cassandra and Spark: Optimizing for Data Locality (20)

Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Akka Cluster in Production
Akka Cluster in ProductionAkka Cluster in Production
Akka Cluster in Production
 
Spark_tutorial (1).pptx
Spark_tutorial (1).pptxSpark_tutorial (1).pptx
Spark_tutorial (1).pptx
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Manchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsManchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internals
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Cassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorCassandra London - C* Spark Connector
Cassandra London - C* Spark Connector
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans
C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric EvansC* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans
C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
 

Kürzlich hochgeladen

Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyRafigAliyev2
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 
how can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoinhow can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like BitcoinDOT TECH
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsalex933524
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonPayment Village
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...elinavihriala
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理cyebo
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?DOT TECH
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxbenishzehra469
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdfvyankatesh1
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxMALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxNidaFaviankaNawawi
 

Kürzlich hochgeladen (20)

Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
how can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoinhow can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoin
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptx
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxMALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
 

Cassandra and Spark: Optimizing for Data Locality

  • 1. Cassandra and Spark: Optimizing for Data Locality Russell Spitzer
 Software Engineer @ DataStax
  • 2. Lex Luther Was Right: Location is Important The value of many things is based upon it's location. Developed land near the beach is valuable but desert and farmland is generally much cheaper. Unfortunately moving land is generally impossible. Spark Summit
  • 3. or lake, or swamp or whatever body of water is "Data" at the time this slide is viewed. Lex Luther Was Wrong: Don't ETL the Data Ocean Spark Summit My House
  • 4. Spark is Our Hero Giving us the Ability to Do Our Analytics Without the ETL PARK
  • 5. Moving Data Between Machines Is Expensive Do Work Where the Data Lives! Our Cassandra Nodes are like Cities and our Spark Executors are like Super Heroes. We'd rather they spend their time locally rather than flying back and forth all the time.
  • 6. Moving Data Between Machines Is Expensive Do Work Where the Data Lives! Our Cassandra Nodes are like Cities and our Spark Executors are like Super Heroes. We'd rather they spend their time locally rather than flying back and forth all the time. Metropolis Superman Gotham Batman Spark Executors
  • 7. DataStax Open Source Spark Cassandra Connector is Available on Github https://github.com/datastax/spark-cassandra-connector • Compatible with Spark 1.3 • Read and Map C* Data Types • Saves To Cassandra • Intelligent write batching • Supports Collections • Secondary index pushdown • Arbitrary CQL Execution!
  • 8. How the Spark Cassandra Connector Reads Data Node Local
  • 9. Cassandra Locates a Row Based on Partition Key and Token Range All of the rows in a Cassandra Cluster are stored based based on their location in the Token Range.
  • 10. Metropolis Gotham Coast City Each of the Nodes in a 
 Cassandra Cluster is primarily responsible for one set of Tokens. 0999 500 Cassandra Locates a Row Based on Partition Key and Token Range
  • 11. Metropolis Gotham Coast City Each of the Nodes in a 
 Cassandra Cluster is primarily responsible for one set of Tokens. 0999 500 750 - 99 350 - 749 100 - 349 Cassandra Locates a Row Based on Partition Key and Token Range
  • 12. Jacek 514 Red The CQL Schema designates at least one column to be the Partition Key. Metropolis Gotham Coast City Cassandra Locates a Row Based on Partition Key and Token Range
  • 13. Jacek 514 Red The hash of the Partition Key tells us where a row should be stored. Metropolis Gotham Coast City Cassandra Locates a Row Based on Partition Key and Token Range 830
  • 14. With VNodes the ranges are not contiguous but the same mechanism controls row location. Jacek 514 Red Metropolis Gotham Coast City Cassandra Locates a Row Based on Partition Key and Token Range 830
  • 15. Loading Huge Amounts of Data Table Scans involve loading most of the data in Cassandra
  • 16. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition
  • 17. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD Spark Partition
  • 18. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD
  • 19. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD
  • 20. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD
  • 21. Spark Driver Spark Partitions Are Annotated With the Location For TokenRanges they Span Metropolis Metropolis Superman Gotham Batman Coast City Green L. The Driver waits spark.locality.wait for the preferred location to have an open executor Assigns Task
  • 22. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance On the Executor the task is transformed into CQL queries which are executed via the Java Driver. SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830
  • 23. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830 The C* Java Driver pages spark.cassandra.input.page.row.size CQL rows at a time
  • 24. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance SELECT * FROM keyspace.table WHERE 
 (Pushed Down Clauses) AND token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830 Because we are utilizing CQL we can also pushdown predicates which can be handled by C*.
  • 25. Loading Sizable But Defined Amounts of Data Retrieving sets of Partition Keys can be done in parallel
  • 26. joinWithCassandraTable Provides an Interface for Obtaining a Set of C* Partitions Generic RDD Metropolis Superman Gotham Batman Coast City Green L. Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local
  • 27. Generic RDD Metropolis Superman Gotham Batman Coast City Green L. Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local joinWithCassandraTable Provides an Interface for Obtaining a Set of C* Partitions
  • 28. repartitionByCassandraReplica Repartitions RDD's to be C* Local Generic RDD CassandraPartitionedRDD This operation requires a shuffle
  • 29. joinWithCassandraTable on CassandraPartitionedRDDs (or CassandraTableScanRDDs) will be Node local CassandraPartitionedRDDs are partitioned to be executed node local CassandraPartitionedRDD Metropolis Superman Gotham Batman Coast City Green L.
  • 30. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance The C* Java Driver pages spark.cassandra.input.page.row.size CQL rows at a time SELECT * FROM keyspace.table WHERE pk =
  • 31. DataStax Enterprise Comes Bundled with Spark and the Connector Apache Spark Apache Solr DataStax Delivers Apache Cassandra In A Database Platform
  • 32. DataStax Enterprise Enables This Same Machinery 
 with Solr Pushdown Metropolis Spark Executor (Superman) DataStax Enterprise SELECT * FROM keyspace.table WHERE solr_query = 'title:b' AND token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830
  • 33. Learn More Online and at Cassandra Summit https://academy.datastax.com/