Rahul Singh of Anant Corporation covers the three common problems in Datastax / Cassandra operations which stem from Data Modeling and outlines strategies and best practices to deal with them.
3. Business Platform Success
We build business success platforms,
which are collections of systems that
serve business processes that have
information needs for people.
9. Cassandra
Architecture
Cluster / Data Centers
01Cassandra is not for tiny data. Do you NEED:
1. Fast read and write of terabytes of data?
2. Replication / availability around the world?
3. Never go down, always up?
Don’tuse Cassandra:
1. If you have gigabytes of data.
2. Your application can chill in one datacenter.
3. Your system can go down whenever it wants.
4. Want to be cool.
10. Cassandra Data Model
Keyspaces & Tables
02
Cassandra Tables / Column Families look like SQL Server /
MySQL / Postgres tables & databases. They are not.
1. CQL Supports queries with a Primary and optional
Clustering Key
2. CQL Does not support arbitrary queries on columns.
3. Cassandra shouldn’t be managing more than a 100-
150 tables across any number of keyspaces.
11. Cassandra Operations
Read / Write Paths
03
Cassandra does these things well.
1. Write: It writes data in an immutable way at first into
a commit log, adds it to the memtable to be available,
and then flushes it to disk: sstables.
2. Read: It figures out if the data is on a node (Orlando
Bloomfilter is involved) and reads from different
sstables, reconciles the immutable data + deletes into
the latest data.
3. It spreads the load around the ring so that you can
hundreds of nodes doing this and not break a sweat:
beast like performance.
13. Wide Partitions
01
Wide partitions will completely screw you you over on reads
and take a node out if there’s traffic.
1. Monitor using cfstats
(CompactedPartitionMaximumBytes)
2. Monitor in system.log “Compacting large partition”
3. Monitor using toppartitions
4. Monitor using OpsCenter (if usingDataStax)
14. Data Skew
02
Bad key design can lead to really, really bad data skew. In
some cases if the number of keys is only 1 or 2, that means
that the data only exists in one or two partitions replicated.
1. Monitor using cfstats(NumberOfKeys,
SpaceUsedLive, ReadCounts, WriteCounts)
2. Monitor using OpsCenter (if usingDataStax)
15. Tombstones
03
How to check for tombstones.
1. Monitor using cfstats(*Tombstones)
2. Monitor using syslog (“Tombstone Warn Threshold”)
3. Monitor using OpsCenter (if usingDataStax)
17. Good Key Design
01
Somethingsto NOTDO.
1. Avoid using Integer/Longkeys unless you couple it
with another composite partition key. (Unless you
can somehow show through realistic data generation
that it won’t coalesce data in some nodes)
2. Avoidusing Time/Date based keys or TimeUUID
unless you know for damn sure that you are going to
continuously create data at a given interval all day,
every day.
3. Don’t just import relational data and expect it to
magically work.
SomethingsTODO.
1. UUIDwill most likely work fine for any given table,
but how do you find it again? You will need to have
another table that has that information.
2. If you must use human readable keys, you can use a
synthetic shardingmechanism. Next Slide.
3. Can combine known things and take a chance but
should test with load: (String, Integer , String
,Integer) .
Somethingsto REMEMBER
1. Clustering Keysdon’tspreaddataaroundthecluster.
2. Remember ( Partition Key,ClusteringKey) are
different((PartitionKey1, Partition Key2))
3. UseRealistic Data: To properly scaleCassandra or
anyother Systemyouneedto create realistic data.
18. Spreading Data via
Synthetic Sharding
01
Sometimes you need to use the key that you have which is
human readable because that is the query path. How do deal
with that?
1. Primary Key : ((CountryName, StateName,
CityName, CompanyName))
2. Integer Shard Added ((CountryName, StateName,
CityName, CompanyName, ShardNumber))
3. ShardNumber couldbe 1-10, or 1-100dependingon
howbadly your datais spreading.
Let’s say you are using a time based key and notice coalescing
around a particular time of day, you could consider the
weekday itself as a part of the key .
1. Primary Key : (CreatedDate)
2. Week Day Number ((CreatedDate, WeekDay))
3. WeekDay would be 0-6 mapped to Sunday-Saturday
19. Just say now to Tombstones! The reason tombstones exist is
to make it possible to do insanely fast writes and updates and
still be able to send the data back performantly. (Side
conversation on Queues as Anti-pattern)
1. There is no need to set null values or delete data
actively.
2. You can always do soft deletes or use TTL values that
expire data automatically.
3. Watch out for prepared statements sending nulls.
Avoiding Tombstones
01
21. Confidential Customized for Lorem Ipsum LLC Version 1.0
We’re Partnering / Hiring
1. Professional Services
Datastax, Sitecore, Spark, Docker, Solr, Cassandra, Kafka, Elastic, AWS, Azure
2. Digital Services
React/Angular, TypeScript, ASP.NET, Node, Python
22. www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037
Data & Analytics
Cassandra, DataStax, Kafka, Spark
Customer Experience
Sitecore
Information Systems
Salesforce, Quickbooks, and more