Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Cassandra and Spark:
Optimizing for Data
Locality
Russell Spitzer

Software Engineer @ DataStax
Lex Luther Was Right:
Location is Important
The value of many things is based upon it's
location. Developed land near the ...
or lake, or swamp or whatever body of water
is "Data" at the time this slide is viewed.
Lex Luther Was Wrong:
Don't ETL th...
Spark is Our Hero Giving us the
Ability to Do Our Analytics Without the ETL
PARK
Moving Data Between Machines Is Expensive
Do Work Where the Data Lives!
Our Cassandra Nodes are like Cities and our Spark ...
Moving Data Between Machines Is Expensive
Do Work Where the Data Lives!
Our Cassandra Nodes are like Cities and our Spark ...
DataStax Open Source
Spark Cassandra Connector is Available on Github
https://github.com/datastax/spark-cassandra-connecto...
How the Spark Cassandra Connector
Reads Data Node Local
Cassandra Locates a Row Based on
Partition Key and Token Range
All of the rows in a Cassandra Cluster
are stored based bas...
Metropolis
Gotham
Coast City
Each of the Nodes in a 

Cassandra Cluster is primarily
responsible for one set of
Tokens.
09...
Metropolis
Gotham
Coast City
Each of the Nodes in a 

Cassandra Cluster is primarily
responsible for one set of
Tokens.
09...
Jacek 514 Red
The CQL Schema designates
at least one column to be the
Partition Key.
Metropolis
Gotham
Coast City
Cassandr...
Jacek 514 Red
The hash of the Partition Key
tells us where a row
should be stored.
Metropolis
Gotham
Coast City
Cassandra ...
With VNodes the ranges are
not contiguous but the same
mechanism controls row
location.
Jacek 514 Red
Metropolis
Gotham
Co...
Loading Huge Amounts of Data
Table Scans involve loading most of the data in Cassandra
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.sp...
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.sp...
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.sp...
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.sp...
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.sp...
Spark Driver
Spark Partitions Are Annotated With the Location For
TokenRanges they Span
Metropolis
Metropolis
Superman
Got...
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instanc...
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instanc...
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instanc...
Loading Sizable But Defined Amounts of Data
Retrieving sets of Partition Keys can be done in parallel
joinWithCassandraTable Provides an Interface for
Obtaining a Set of C* Partitions
Generic RDD
Metropolis
Superman
Gotham
B...
Generic RDD
Metropolis
Superman
Gotham
Batman
Coast City
Green L.
Generic RDDs can Be Joined But the Spark Tasks will Not ...
repartitionByCassandraReplica Repartitions RDD's to be
C* Local
Generic RDD CassandraPartitionedRDD
This operation require...
joinWithCassandraTable on CassandraPartitionedRDDs
(or CassandraTableScanRDDs) will be Node local
CassandraPartitionedRDDs...
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instanc...
DataStax Enterprise Comes Bundled
with Spark and the Connector
Apache Spark Apache Solr
DataStax Delivers
Apache Cassandra...
DataStax Enterprise Enables This Same Machinery 

with Solr Pushdown
Metropolis
Spark Executor (Superman)
DataStax
Enterpr...
Learn More Online and at Cassandra Summit
https://academy.datastax.com/
Nächste SlideShare
Wird geladen in …5
×

Cassandra and Spark: Optimizing for Data Locality

5.084 Aufrufe

Veröffentlicht am

An overview of how the Spark Cassandra Connector achieves data locality on the read path.

Veröffentlicht in: Daten & Analysen
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Cassandra and Spark: Optimizing for Data Locality

  1. 1. Cassandra and Spark: Optimizing for Data Locality Russell Spitzer
 Software Engineer @ DataStax
  2. 2. Lex Luther Was Right: Location is Important The value of many things is based upon it's location. Developed land near the beach is valuable but desert and farmland is generally much cheaper. Unfortunately moving land is generally impossible. Spark Summit
  3. 3. or lake, or swamp or whatever body of water is "Data" at the time this slide is viewed. Lex Luther Was Wrong: Don't ETL the Data Ocean Spark Summit My House
  4. 4. Spark is Our Hero Giving us the Ability to Do Our Analytics Without the ETL PARK
  5. 5. Moving Data Between Machines Is Expensive Do Work Where the Data Lives! Our Cassandra Nodes are like Cities and our Spark Executors are like Super Heroes. We'd rather they spend their time locally rather than flying back and forth all the time.
  6. 6. Moving Data Between Machines Is Expensive Do Work Where the Data Lives! Our Cassandra Nodes are like Cities and our Spark Executors are like Super Heroes. We'd rather they spend their time locally rather than flying back and forth all the time. Metropolis Superman Gotham Batman Spark Executors
  7. 7. DataStax Open Source Spark Cassandra Connector is Available on Github https://github.com/datastax/spark-cassandra-connector • Compatible with Spark 1.3 • Read and Map C* Data Types • Saves To Cassandra • Intelligent write batching • Supports Collections • Secondary index pushdown • Arbitrary CQL Execution!
  8. 8. How the Spark Cassandra Connector Reads Data Node Local
  9. 9. Cassandra Locates a Row Based on Partition Key and Token Range All of the rows in a Cassandra Cluster are stored based based on their location in the Token Range.
  10. 10. Metropolis Gotham Coast City Each of the Nodes in a 
 Cassandra Cluster is primarily responsible for one set of Tokens. 0999 500 Cassandra Locates a Row Based on Partition Key and Token Range
  11. 11. Metropolis Gotham Coast City Each of the Nodes in a 
 Cassandra Cluster is primarily responsible for one set of Tokens. 0999 500 750 - 99 350 - 749 100 - 349 Cassandra Locates a Row Based on Partition Key and Token Range
  12. 12. Jacek 514 Red The CQL Schema designates at least one column to be the Partition Key. Metropolis Gotham Coast City Cassandra Locates a Row Based on Partition Key and Token Range
  13. 13. Jacek 514 Red The hash of the Partition Key tells us where a row should be stored. Metropolis Gotham Coast City Cassandra Locates a Row Based on Partition Key and Token Range 830
  14. 14. With VNodes the ranges are not contiguous but the same mechanism controls row location. Jacek 514 Red Metropolis Gotham Coast City Cassandra Locates a Row Based on Partition Key and Token Range 830
  15. 15. Loading Huge Amounts of Data Table Scans involve loading most of the data in Cassandra
  16. 16. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition
  17. 17. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD Spark Partition
  18. 18. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD
  19. 19. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD
  20. 20. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD
  21. 21. Spark Driver Spark Partitions Are Annotated With the Location For TokenRanges they Span Metropolis Metropolis Superman Gotham Batman Coast City Green L. The Driver waits spark.locality.wait for the preferred location to have an open executor Assigns Task
  22. 22. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance On the Executor the task is transformed into CQL queries which are executed via the Java Driver. SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830
  23. 23. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830 The C* Java Driver pages spark.cassandra.input.page.row.size CQL rows at a time
  24. 24. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance SELECT * FROM keyspace.table WHERE 
 (Pushed Down Clauses) AND token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830 Because we are utilizing CQL we can also pushdown predicates which can be handled by C*.
  25. 25. Loading Sizable But Defined Amounts of Data Retrieving sets of Partition Keys can be done in parallel
  26. 26. joinWithCassandraTable Provides an Interface for Obtaining a Set of C* Partitions Generic RDD Metropolis Superman Gotham Batman Coast City Green L. Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local
  27. 27. Generic RDD Metropolis Superman Gotham Batman Coast City Green L. Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local joinWithCassandraTable Provides an Interface for Obtaining a Set of C* Partitions
  28. 28. repartitionByCassandraReplica Repartitions RDD's to be C* Local Generic RDD CassandraPartitionedRDD This operation requires a shuffle
  29. 29. joinWithCassandraTable on CassandraPartitionedRDDs (or CassandraTableScanRDDs) will be Node local CassandraPartitionedRDDs are partitioned to be executed node local CassandraPartitionedRDD Metropolis Superman Gotham Batman Coast City Green L.
  30. 30. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance The C* Java Driver pages spark.cassandra.input.page.row.size CQL rows at a time SELECT * FROM keyspace.table WHERE pk =
  31. 31. DataStax Enterprise Comes Bundled with Spark and the Connector Apache Spark Apache Solr DataStax Delivers Apache Cassandra In A Database Platform
  32. 32. DataStax Enterprise Enables This Same Machinery 
 with Solr Pushdown Metropolis Spark Executor (Superman) DataStax Enterprise SELECT * FROM keyspace.table WHERE solr_query = 'title:b' AND token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830
  33. 33. Learn More Online and at Cassandra Summit https://academy.datastax.com/

×