SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Downloaden Sie, um offline zu lesen
Maximum Overdrive: 

Tuning the Spark Cassandra Connector
Russell Spitzer, Datastax
© DataStax, All Rights Reserved.
Who is this guy and why should I listen to him?
2
Russell Spitzer, Passing Software Engineer
•Been working at DataStax since 2013
•Worked in Test Engineering and now Analytics Dev
•Working with Spark since 0.9
•Working with Cassandra since 1.2
•Main focus: the Spark Cassandra Connector
•Surgically grafted to the Spark Cassandra Connector Mailing List
© DataStax, All Rights Reserved.
The Spark Cassandra Connector
Connects Spark to Cassandra
3
It's all there in the name
•Provides a DataSource for Datasets/Data Frames
•Provides methods for Writing DataSets/Data Frames
•Reading and Writing RDD
•Connection Pooling
•Type Conversions and Mapping
•Data Locality
•Open Source Software!
https://github.com/datastax/spark-cassandra-connector
© DataStax, All Rights Reserved. 4
WARNING: THIS TALK WILL CONTAIN TECHNICAL DETAILS AND EXPLICIT SCALA
DISTRIBUTED SYSTEMS
Tuning the Spark Cassandra Connector
DISTRIBUTED SYSTEMS
© DataStax, All Rights Reserved.
1 Lots of Write Tuning
2 A Bit of Read Tuning
5
© DataStax, All Rights Reserved. 6
Context is Very Important
Knowing your Data is Key for Maximum Performance
© DataStax, All Rights Reserved.
Write Tuning in the SCC Is all about Batching
7
Batches aren't good
for performance in
Cassandra.
Not when the writes
within the batch are in the same
Partition and they are unlogged!

I keep telling you this!
© DataStax, All Rights Reserved.
Multi-Partition Key Batches put load on the
Coordinator
8
THE BATCH
Cassandra
Cluster
Row Row
Row
Row
© DataStax, All Rights Reserved.
Multi-Partition Key Batches put load on the
Coordinator
9
Cassandra
Cluster
A batch moves as a single entity
to the Coordinator for that write



This batch has to sit there until
all the portions of it get confirmed
at their set consistency level
© DataStax, All Rights Reserved.
Multi-Partition Key Batches put load on the
Coordinator
10
Cassandra
Cluster
Even when some portions of the
batch finish early we have to wait
until the entire thing is done before
we can respond to the client.
© DataStax, All Rights Reserved.
We end up with a lot of rows just sitting around in
memory waiting for others to get out of the way
11
© DataStax, All Rights Reserved.
Single Partition Batches are Treated as
A Single Mutation in Cassandra
12
THE BATCH
Row Row Row
RowRow Row
Row Row Row
Row Row Row
Cassandra
Cluster
© DataStax, All Rights Reserved.
Single Partition Batches are Treated as
A Single Mutation in Cassandra
13
Cassandra
Cluster
Now the entire batch can be
treated as a single mutation. We
also only have to wait for one set
of replicas
© DataStax, All Rights Reserved.
When all of the Rows are Going to the Same Place
Writing to Cassandra is Fast
14
The Connector Will Automatically Batch Writes
15
rdd.saveToCassandra("bestkeyspace",	"besttable")
df.write

		.format("org.apache.spark.sql.cassandra")

		.options(Map("table"	->	"besttable",	"keyspace"	->	"bestkeyspace"))

		.save()
import	org.apache.spark.sql.cassandra._	
df.write

		.cassandraFormat("besttable",	"bestkeyspace")

		.save()
RDD
DataFrame
By default batching happens on 

Identical Partition Key
16
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
WriteConf(batchGroupingKey=	?)
Change it as a SparkConf or DataFrame Parameter
Or directly pass in a WriteConf
Batches are Placed in Holding Until Certain
Thresholds are hit
17
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
Batches are Placed in Holding Until Certain
Thresholds are hit
18
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
output.batch.grouping.buffer.size
output.batch.size.bytes / output.batch.size.rows
output.concurrent.writes
output.consistency.level
Batches are Placed in Holding Until Certain
Thresholds are hit
19
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
output.batch.grouping.buffer.size
output.batch.size.bytes / output.batch.size.rows
output.concurrent.writes
output.consistency.level
Batches are Placed in Holding Until Certain
Thresholds are hit
20
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
output.batch.grouping.buffer.size
output.batch.size.bytes / output.batch.size.rows
output.concurrent.writes
output.consistency.level
Batches are Placed in Holding Until Certain
Thresholds are hit
21
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
output.batch.grouping.buffer.size
output.batch.size.bytes / output.batch.size.rows
output.concurrent.writes
output.consistency.level
Spark Cassandra Stress for Running Basic
Benchmarks
22
https://github.com/datastax/spark-cassandra-stress
Running Benchmarks on a bunch of AWS Machines, 

5 M3.2XLarge

DSE 5.0.1
Spark 1.6.1

Spark CC 1.6.0
RF = 3
2 M Writes/ 100K C* Partitions and 400 Spark Partitions



Caveat: Don't benchmark exactly like this
I'm making some bad decisions to to make some broad points
Depending on your use case Sorting within Partitions
can Greatly Increase Write Performance
23
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
0
20
40
60
80
Rows Out of Order Rows In Order
77
28
Default Conf kOps/s
Grouping on Partition Key

The Safest Thing You Can Do
Including everything in the Batch
24
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
0
32.5
65
97.5
130
Rows Out of Order Rows In Order
125
69
77
28
Default Conf kOps/s
No Batch Key
May be Safe For Short Durations
BUT WILL LEAD TO SYSTEM

INSTABILITY
Grouping on Replica Set
25
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
0
32.5
65
97.5
130
Rows Out of Order Rows In Order
125
70
77
28
Default Conf kOps/s
Grouped on Replica Set
Safer, But still will put
extra load on the Coordinator
Remember the Tortoise vs the Hare
26
Overwhelming Cassandra will slow you down
Limit the amount of writes per executor : output.throughput_mb_per_sec

Limit maximum executor cores : spark.max.cores
Lower concurrency : output.concurrent.writes
DEPENDING ON DISK PERFORMANCE YOUR

INITIAL SPEEDS IN BENCHMARKING MAY 

NOT BE SUSTAINABLE
For Example Lets run with Batch Key None for a
Longer Test (20M writes)
27
[Stage 0:=========================> (191 + 15) / 400]WARN 2016-08-19 21:11:55,817
org.apache.spark.scheduler.TaskSetManager: Lost task 192.0 in stage 0.0 (TID 193, ip-172-31-13-127.us-west-1.compute.internal):
java.io.IOException: Failed to write statements to ks.tab.
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:166)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
For Example Lets run with Batch Key None for a
Longer Test (20M writes)
28
[Stage 0:=========================> (191 + 15) / 400]WARN 2016-08-19 21:11:55,817
org.apache.spark.scheduler.TaskSetManager: Lost task 192.0 in stage 0.0 (TID 193, ip-172-31-13-127.us-west-1.compute.internal):
java.io.IOException: Failed to write statements to ks.tab.
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:166)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Back to Default PartitionKey Batching
29
0
50
100
150
200
Rows Out of Order Rows In Order
190
39
77
28
Initial Run kOps/s
10X Length Run
So why are we doing so
much better over a longer
run?
Back to Default PartitionKey Batching
30
0
50
100
150
200
Rows Out of Order Rows In Order
190
39
77
28
Initial Run kOps/s
10X Length Run
400 Spark Partitions in Both Cases

2M/ 400 = 5000

20M / 400 = 50000
Having Too Many Partitions will Slow Down your
Writes
31
Every task has Setup and Teardown and
we can only build up good batches if there
are enough elements to build them from
Depending on your use case Sorting within Partitions
can Greatly Increase Write Performance
32
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
0
50
100
150
200
Rows Out of Order Rows In Order
190
39
10X Length Run
A spark sort on partition key

may speed up your total operation
by several fold
Maximizing performance for out of Order Writes or No
Clustering Keys
33
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
0
50
100
150
200
Rows Out of Order Rows In Order
59
47
190
39
10X Length Run
Modified Conf kOps/s
Turn Off Batching
Increase Concurrency
spark.cassandra.output.batch.size.rows	1	
spark.cassandra.output.concurrent.writes	2000
Maximizing performance for out of Order Writes or No
Clustering Keys
34
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
0
50
100
150
200
Rows Out of Order Rows In Order
59
47
190
39
10X Length Run
Modified Conf kOps/s
Turn Off Batching
Increase Concurrency
spark.cassandra.output.batch.size.rows	1	
spark.cassandra.output.concurrent.writes	2000	
Single
Partition Batches
are good! I keep
telling you!
This turns the connector into a Multi-Machine Cassandra
Loader (Basically just executeAsync as fast as possible)
35
https://github.com/brianmhess/cassandra-loader
Now Let's Talk About Reading!
36
Read Tuning mostly About Partitioning
37
• RDDs are a large Dataset Broken Into Bits,
• These bits are call Partitions
• Cassandra Partitions != Spark Partitions
• Spark Partitions are sized based on the estimated data size of the underlying C* table
• input.split.size_in_mb
TokenRange
Spark Partitions
OOMs Caused by Spark Partitions Holding Too Much
Data
38
Executor JVM Heap
Core 1
Core 2
Core 3
As a general rule of thumb your Executor should be
set to hold 



Number of Cores * Size of Partition * 1.2



See a lot of GC? OOM? Increase the amount of partitions
Some Caveats
• We don't know the actual partition size until runtime
• Cassandra on disk memory usage != in memory size
OOMs Caused by Spark Partitions Holding Too Much
Data
39
Executor JVM Heap
Core 1
Core 2
Core 3
input.split.size_in_mb	 64	 	
Approx	amount	of	data	to	be	fetched	into	a	

Spark	partition.	Minimum	number	of	resulting	

Spark	partitions	is	

1	+	2	*	SparkContext.defaultParallelism
split.size_in_mb compares uses the system table size_esitmates
to determine how many Cassandra Partitions should be in a 

Spark Partition. 



Due to Compression and Inflation, the actual in memory size
can be much larger
Certain Queries can't be broken Up
40
• Hot Spots Make a Spark Partition OOM
• Full C* Partition in Spark Partiton
Certain Queries can't be broken Up
41
• Hot Spots Make a Spark Partition OOM
• Full C* Partition in Spark Partiton
• Single Partition Lookups
• Can't do anything about this
• Don't know how partition is distributed
Certain Queries can't be broken Up
42
• Hot Spots Make a Spark Partition OOM
• Full C* Partition in Spark Partiton
• Single Partition Lookups
• Can't do anything about this
• Don't know how partition is distributed
• IN clauses
• Replace with JoinWithCassandraTable







• If all else fails use CassandraConnector
Read speed is mostly dictated by Cassandra's
Paging Speed
43
input.fetch.size_in_rows 1000 Number of CQL rows fetched per driver request
Cassandra of the Future, As Fast as CSV!?!
44
https://issues.apache.org/jira/browse/CASSANDRA-9259 :
Bulk Reading from Cassandra

Stefania Alborghetti
In Summation, Know your Data
• Write Tuning
• Batching Key
• Sorting
• Turning off Batching When Beneficial
• Having enough Data in a Task
• Read Tuning
• Number of Partitions
• Some Queries can't be Broken Up
• Changing paging from Cassandra
• Future bulk read speedup!
45
46
The End
Don't Let it End Like That!
Contribute to the Spark Cassandra Connector
47
• OSS Project that loves community involvement
• Bug Reports
• Feature Requests
• Write Code
• Doc Improvements
• Come join us!
https://github.com/datastax/spark-cassandra-connector
See you on the mailing list!
https://github.com/datastax/spark-cassandra-connector

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 

Was ist angesagt? (20)

Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easy
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Go Programming Patterns
Go Programming PatternsGo Programming Patterns
Go Programming Patterns
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 

Ähnlich wie Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, DataStax) | C* Summit 2016

Ähnlich wie Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, DataStax) | C* Summit 2016 (20)

Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
TechEvent Apache Cassandra
TechEvent Apache CassandraTechEvent Apache Cassandra
TechEvent Apache Cassandra
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
 
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
 
Why Cassandra?
Why Cassandra?Why Cassandra?
Why Cassandra?
 
Staying Ahead of the Curve with Spring and Cassandra 4
Staying Ahead of the Curve with Spring and Cassandra 4Staying Ahead of the Curve with Spring and Cassandra 4
Staying Ahead of the Curve with Spring and Cassandra 4
 

Mehr von DataStax

Mehr von DataStax (20)

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
 

Kürzlich hochgeladen

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 

Kürzlich hochgeladen (20)

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 

Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, DataStax) | C* Summit 2016

  • 1. Maximum Overdrive: 
 Tuning the Spark Cassandra Connector Russell Spitzer, Datastax
  • 2. © DataStax, All Rights Reserved. Who is this guy and why should I listen to him? 2 Russell Spitzer, Passing Software Engineer •Been working at DataStax since 2013 •Worked in Test Engineering and now Analytics Dev •Working with Spark since 0.9 •Working with Cassandra since 1.2 •Main focus: the Spark Cassandra Connector •Surgically grafted to the Spark Cassandra Connector Mailing List
  • 3. © DataStax, All Rights Reserved. The Spark Cassandra Connector Connects Spark to Cassandra 3 It's all there in the name •Provides a DataSource for Datasets/Data Frames •Provides methods for Writing DataSets/Data Frames •Reading and Writing RDD •Connection Pooling •Type Conversions and Mapping •Data Locality •Open Source Software! https://github.com/datastax/spark-cassandra-connector
  • 4. © DataStax, All Rights Reserved. 4 WARNING: THIS TALK WILL CONTAIN TECHNICAL DETAILS AND EXPLICIT SCALA DISTRIBUTED SYSTEMS Tuning the Spark Cassandra Connector DISTRIBUTED SYSTEMS
  • 5. © DataStax, All Rights Reserved. 1 Lots of Write Tuning 2 A Bit of Read Tuning 5
  • 6. © DataStax, All Rights Reserved. 6 Context is Very Important Knowing your Data is Key for Maximum Performance
  • 7. © DataStax, All Rights Reserved. Write Tuning in the SCC Is all about Batching 7 Batches aren't good for performance in Cassandra. Not when the writes within the batch are in the same Partition and they are unlogged!
 I keep telling you this!
  • 8. © DataStax, All Rights Reserved. Multi-Partition Key Batches put load on the Coordinator 8 THE BATCH Cassandra Cluster Row Row Row Row
  • 9. © DataStax, All Rights Reserved. Multi-Partition Key Batches put load on the Coordinator 9 Cassandra Cluster A batch moves as a single entity to the Coordinator for that write
 
 This batch has to sit there until all the portions of it get confirmed at their set consistency level
  • 10. © DataStax, All Rights Reserved. Multi-Partition Key Batches put load on the Coordinator 10 Cassandra Cluster Even when some portions of the batch finish early we have to wait until the entire thing is done before we can respond to the client.
  • 11. © DataStax, All Rights Reserved. We end up with a lot of rows just sitting around in memory waiting for others to get out of the way 11
  • 12. © DataStax, All Rights Reserved. Single Partition Batches are Treated as A Single Mutation in Cassandra 12 THE BATCH Row Row Row RowRow Row Row Row Row Row Row Row Cassandra Cluster
  • 13. © DataStax, All Rights Reserved. Single Partition Batches are Treated as A Single Mutation in Cassandra 13 Cassandra Cluster Now the entire batch can be treated as a single mutation. We also only have to wait for one set of replicas
  • 14. © DataStax, All Rights Reserved. When all of the Rows are Going to the Same Place Writing to Cassandra is Fast 14
  • 15. The Connector Will Automatically Batch Writes 15 rdd.saveToCassandra("bestkeyspace", "besttable") df.write
 .format("org.apache.spark.sql.cassandra")
 .options(Map("table" -> "besttable", "keyspace" -> "bestkeyspace"))
 .save() import org.apache.spark.sql.cassandra._ df.write
 .cassandraFormat("besttable", "bestkeyspace")
 .save() RDD DataFrame
  • 16. By default batching happens on 
 Identical Partition Key 16 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters WriteConf(batchGroupingKey= ?) Change it as a SparkConf or DataFrame Parameter Or directly pass in a WriteConf
  • 17. Batches are Placed in Holding Until Certain Thresholds are hit 17 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
  • 18. Batches are Placed in Holding Until Certain Thresholds are hit 18 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters output.batch.grouping.buffer.size output.batch.size.bytes / output.batch.size.rows output.concurrent.writes output.consistency.level
  • 19. Batches are Placed in Holding Until Certain Thresholds are hit 19 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters output.batch.grouping.buffer.size output.batch.size.bytes / output.batch.size.rows output.concurrent.writes output.consistency.level
  • 20. Batches are Placed in Holding Until Certain Thresholds are hit 20 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters output.batch.grouping.buffer.size output.batch.size.bytes / output.batch.size.rows output.concurrent.writes output.consistency.level
  • 21. Batches are Placed in Holding Until Certain Thresholds are hit 21 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters output.batch.grouping.buffer.size output.batch.size.bytes / output.batch.size.rows output.concurrent.writes output.consistency.level
  • 22. Spark Cassandra Stress for Running Basic Benchmarks 22 https://github.com/datastax/spark-cassandra-stress Running Benchmarks on a bunch of AWS Machines, 
 5 M3.2XLarge
 DSE 5.0.1 Spark 1.6.1
 Spark CC 1.6.0 RF = 3 2 M Writes/ 100K C* Partitions and 400 Spark Partitions
 
 Caveat: Don't benchmark exactly like this I'm making some bad decisions to to make some broad points
  • 23. Depending on your use case Sorting within Partitions can Greatly Increase Write Performance 23 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 0 20 40 60 80 Rows Out of Order Rows In Order 77 28 Default Conf kOps/s Grouping on Partition Key
 The Safest Thing You Can Do
  • 24. Including everything in the Batch 24 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 0 32.5 65 97.5 130 Rows Out of Order Rows In Order 125 69 77 28 Default Conf kOps/s No Batch Key May be Safe For Short Durations BUT WILL LEAD TO SYSTEM
 INSTABILITY
  • 25. Grouping on Replica Set 25 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 0 32.5 65 97.5 130 Rows Out of Order Rows In Order 125 70 77 28 Default Conf kOps/s Grouped on Replica Set Safer, But still will put extra load on the Coordinator
  • 26. Remember the Tortoise vs the Hare 26 Overwhelming Cassandra will slow you down Limit the amount of writes per executor : output.throughput_mb_per_sec
 Limit maximum executor cores : spark.max.cores Lower concurrency : output.concurrent.writes DEPENDING ON DISK PERFORMANCE YOUR
 INITIAL SPEEDS IN BENCHMARKING MAY 
 NOT BE SUSTAINABLE
  • 27. For Example Lets run with Batch Key None for a Longer Test (20M writes) 27 [Stage 0:=========================> (191 + 15) / 400]WARN 2016-08-19 21:11:55,817 org.apache.spark.scheduler.TaskSetManager: Lost task 192.0 in stage 0.0 (TID 193, ip-172-31-13-127.us-west-1.compute.internal): java.io.IOException: Failed to write statements to ks.tab. at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:166) at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109) at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134) at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37) at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
  • 28. For Example Lets run with Batch Key None for a Longer Test (20M writes) 28 [Stage 0:=========================> (191 + 15) / 400]WARN 2016-08-19 21:11:55,817 org.apache.spark.scheduler.TaskSetManager: Lost task 192.0 in stage 0.0 (TID 193, ip-172-31-13-127.us-west-1.compute.internal): java.io.IOException: Failed to write statements to ks.tab. at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:166) at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109) at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134) at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37) at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
  • 29. Back to Default PartitionKey Batching 29 0 50 100 150 200 Rows Out of Order Rows In Order 190 39 77 28 Initial Run kOps/s 10X Length Run So why are we doing so much better over a longer run?
  • 30. Back to Default PartitionKey Batching 30 0 50 100 150 200 Rows Out of Order Rows In Order 190 39 77 28 Initial Run kOps/s 10X Length Run 400 Spark Partitions in Both Cases
 2M/ 400 = 5000
 20M / 400 = 50000
  • 31. Having Too Many Partitions will Slow Down your Writes 31 Every task has Setup and Teardown and we can only build up good batches if there are enough elements to build them from
  • 32. Depending on your use case Sorting within Partitions can Greatly Increase Write Performance 32 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 0 50 100 150 200 Rows Out of Order Rows In Order 190 39 10X Length Run A spark sort on partition key
 may speed up your total operation by several fold
  • 33. Maximizing performance for out of Order Writes or No Clustering Keys 33 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 0 50 100 150 200 Rows Out of Order Rows In Order 59 47 190 39 10X Length Run Modified Conf kOps/s Turn Off Batching Increase Concurrency spark.cassandra.output.batch.size.rows 1 spark.cassandra.output.concurrent.writes 2000
  • 34. Maximizing performance for out of Order Writes or No Clustering Keys 34 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters 0 50 100 150 200 Rows Out of Order Rows In Order 59 47 190 39 10X Length Run Modified Conf kOps/s Turn Off Batching Increase Concurrency spark.cassandra.output.batch.size.rows 1 spark.cassandra.output.concurrent.writes 2000 Single Partition Batches are good! I keep telling you!
  • 35. This turns the connector into a Multi-Machine Cassandra Loader (Basically just executeAsync as fast as possible) 35 https://github.com/brianmhess/cassandra-loader
  • 36. Now Let's Talk About Reading! 36
  • 37. Read Tuning mostly About Partitioning 37 • RDDs are a large Dataset Broken Into Bits, • These bits are call Partitions • Cassandra Partitions != Spark Partitions • Spark Partitions are sized based on the estimated data size of the underlying C* table • input.split.size_in_mb TokenRange Spark Partitions
  • 38. OOMs Caused by Spark Partitions Holding Too Much Data 38 Executor JVM Heap Core 1 Core 2 Core 3 As a general rule of thumb your Executor should be set to hold 
 
 Number of Cores * Size of Partition * 1.2
 
 See a lot of GC? OOM? Increase the amount of partitions Some Caveats • We don't know the actual partition size until runtime • Cassandra on disk memory usage != in memory size
  • 39. OOMs Caused by Spark Partitions Holding Too Much Data 39 Executor JVM Heap Core 1 Core 2 Core 3 input.split.size_in_mb 64 Approx amount of data to be fetched into a 
 Spark partition. Minimum number of resulting 
 Spark partitions is 
 1 + 2 * SparkContext.defaultParallelism split.size_in_mb compares uses the system table size_esitmates to determine how many Cassandra Partitions should be in a 
 Spark Partition. 
 
 Due to Compression and Inflation, the actual in memory size can be much larger
  • 40. Certain Queries can't be broken Up 40 • Hot Spots Make a Spark Partition OOM • Full C* Partition in Spark Partiton
  • 41. Certain Queries can't be broken Up 41 • Hot Spots Make a Spark Partition OOM • Full C* Partition in Spark Partiton • Single Partition Lookups • Can't do anything about this • Don't know how partition is distributed
  • 42. Certain Queries can't be broken Up 42 • Hot Spots Make a Spark Partition OOM • Full C* Partition in Spark Partiton • Single Partition Lookups • Can't do anything about this • Don't know how partition is distributed • IN clauses • Replace with JoinWithCassandraTable
 
 
 
 • If all else fails use CassandraConnector
  • 43. Read speed is mostly dictated by Cassandra's Paging Speed 43 input.fetch.size_in_rows 1000 Number of CQL rows fetched per driver request
  • 44. Cassandra of the Future, As Fast as CSV!?! 44 https://issues.apache.org/jira/browse/CASSANDRA-9259 : Bulk Reading from Cassandra
 Stefania Alborghetti
  • 45. In Summation, Know your Data • Write Tuning • Batching Key • Sorting • Turning off Batching When Beneficial • Having enough Data in a Task • Read Tuning • Number of Partitions • Some Queries can't be Broken Up • Changing paging from Cassandra • Future bulk read speedup! 45
  • 47. Don't Let it End Like That! Contribute to the Spark Cassandra Connector 47 • OSS Project that loves community involvement • Bug Reports • Feature Requests • Write Code • Doc Improvements • Come join us! https://github.com/datastax/spark-cassandra-connector
  • 48. See you on the mailing list! https://github.com/datastax/spark-cassandra-connector