SlideShare ist ein Scribd-Unternehmen logo
1 von 22
DataStax / Cassandra Data
Modeling Strategies
Avoiding The Three Stooges: Wide Partitions, Tombstones, Data Skew
Rahul Xavier Singh Anant Corporation
TOC
Core Concepts
Wide Partitions
Data Modeling
Synthetic Sharding
Key Design
Tombstones
Data Skew
Avoid tombstones
Business Platform Success
We build business success platforms,
which are collections of systems that
serve business processes that have
information needs for people.
Platform Thinking
How?
Project
Information
Client Service
Information
Corporate
Guides
Collaborative
Documents
Assets
& Files
Corporate
Assets
Business Platform
● Curateframeworkof
systems.
● Workwitha vettedteam
of experts.
● Connectit all together.
● Focuson finding,
analyzing,and actingon
knowledge&
communicationtowards
businesssuccess.
Streamline. Organize. Unify. Business Platform
Who we help Succeed
Cassandra / DataStax
Core Concepts
Cassandra
Architecture
Cluster / Data Centers
01Cassandra is not for tiny data. Do you NEED:
1. Fast read and write of terabytes of data?
2. Replication / availability around the world?
3. Never go down, always up?
Don’tuse Cassandra:
1. If you have gigabytes of data.
2. Your application can chill in one datacenter.
3. Your system can go down whenever it wants.
4. Want to be cool.
Cassandra Data Model
Keyspaces & Tables
02
Cassandra Tables / Column Families look like SQL Server /
MySQL / Postgres tables & databases. They are not.
1. CQL Supports queries with a Primary and optional
Clustering Key
2. CQL Does not support arbitrary queries on columns.
3. Cassandra shouldn’t be managing more than a 100-
150 tables across any number of keyspaces.
Cassandra Operations
Read / Write Paths
03
Cassandra does these things well.
1. Write: It writes data in an immutable way at first into
a commit log, adds it to the memtable to be available,
and then flushes it to disk: sstables.
2. Read: It figures out if the data is on a node (Orlando
Bloomfilter is involved) and reads from different
sstables, reconciles the immutable data + deletes into
the latest data.
3. It spreads the load around the ring so that you can
hundreds of nodes doing this and not break a sweat:
beast like performance.
Cassandra Operational
Pitfalls Visualized
Wide Partitions
01
Wide partitions will completely screw you you over on reads
and take a node out if there’s traffic.
1. Monitor using cfstats
(CompactedPartitionMaximumBytes)
2. Monitor in system.log “Compacting large partition”
3. Monitor using toppartitions
4. Monitor using OpsCenter (if usingDataStax)
Data Skew
02
Bad key design can lead to really, really bad data skew. In
some cases if the number of keys is only 1 or 2, that means
that the data only exists in one or two partitions replicated.
1. Monitor using cfstats(NumberOfKeys,
SpaceUsedLive, ReadCounts, WriteCounts)
2. Monitor using OpsCenter (if usingDataStax)
Tombstones
03
How to check for tombstones.
1. Monitor using cfstats(*Tombstones)
2. Monitor using syslog (“Tombstone Warn Threshold”)
3. Monitor using OpsCenter (if usingDataStax)
Cassandra Data Modeling
Best Practices
Good Key Design
01
Somethingsto NOTDO.
1. Avoid using Integer/Longkeys unless you couple it
with another composite partition key. (Unless you
can somehow show through realistic data generation
that it won’t coalesce data in some nodes)
2. Avoidusing Time/Date based keys or TimeUUID
unless you know for damn sure that you are going to
continuously create data at a given interval all day,
every day.
3. Don’t just import relational data and expect it to
magically work.
SomethingsTODO.
1. UUIDwill most likely work fine for any given table,
but how do you find it again? You will need to have
another table that has that information.
2. If you must use human readable keys, you can use a
synthetic shardingmechanism. Next Slide.
3. Can combine known things and take a chance but
should test with load: (String, Integer , String
,Integer) .
Somethingsto REMEMBER
1. Clustering Keysdon’tspreaddataaroundthecluster.
2. Remember ( Partition Key,ClusteringKey) are
different((PartitionKey1, Partition Key2))
3. UseRealistic Data: To properly scaleCassandra or
anyother Systemyouneedto create realistic data.
Spreading Data via
Synthetic Sharding
01
Sometimes you need to use the key that you have which is
human readable because that is the query path. How do deal
with that?
1. Primary Key : ((CountryName, StateName,
CityName, CompanyName))
2. Integer Shard Added ((CountryName, StateName,
CityName, CompanyName, ShardNumber))
3. ShardNumber couldbe 1-10, or 1-100dependingon
howbadly your datais spreading.
Let’s say you are using a time based key and notice coalescing
around a particular time of day, you could consider the
weekday itself as a part of the key .
1. Primary Key : (CreatedDate)
2. Week Day Number ((CreatedDate, WeekDay))
3. WeekDay would be 0-6 mapped to Sunday-Saturday
Just say now to Tombstones! The reason tombstones exist is
to make it possible to do insanely fast writes and updates and
still be able to send the data back performantly. (Side
conversation on Queues as Anti-pattern)
1. There is no need to set null values or delete data
actively.
2. You can always do soft deletes or use TTL values that
expire data automatically.
3. Watch out for prepared statements sending nulls.
Avoiding Tombstones
01
Questions?
Confidential Customized for Lorem Ipsum LLC Version 1.0
We’re Partnering / Hiring
1. Professional Services
Datastax, Sitecore, Spark, Docker, Solr, Cassandra, Kafka, Elastic, AWS, Azure
2. Digital Services
React/Angular, TypeScript, ASP.NET, Node, Python
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037
Data & Analytics
Cassandra, DataStax, Kafka, Spark
Customer Experience
Sitecore
Information Systems
Salesforce, Quickbooks, and more

Weitere ähnliche Inhalte

Was ist angesagt?

Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
DataStax
 
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
DataStax
 
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Johnny Miller
 

Was ist angesagt? (20)

C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
 
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
 
Processing 50,000 events per second with Cassandra and Spark
Processing 50,000 events per second with Cassandra and SparkProcessing 50,000 events per second with Cassandra and Spark
Processing 50,000 events per second with Cassandra and Spark
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraApache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyond
 
Cassandra
CassandraCassandra
Cassandra
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and Comparison
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
 
Scaling MySQL -- Swanseacon.co.uk
Scaling MySQL -- Swanseacon.co.uk Scaling MySQL -- Swanseacon.co.uk
Scaling MySQL -- Swanseacon.co.uk
 

Ähnlich wie Datastax / Cassandra Modeling Strategies

Deployment Preparedness
Deployment Preparedness Deployment Preparedness
Deployment Preparedness
MongoDB
 

Ähnlich wie Datastax / Cassandra Modeling Strategies (20)

DataStax & Cassandra Data Modeling Strategies
DataStax & Cassandra Data Modeling StrategiesDataStax & Cassandra Data Modeling Strategies
DataStax & Cassandra Data Modeling Strategies
 
Moving from a Relational Database to Cassandra: Why, Where, When, and How
Moving from a Relational Database to Cassandra: Why, Where, When, and HowMoving from a Relational Database to Cassandra: Why, Where, When, and How
Moving from a Relational Database to Cassandra: Why, Where, When, and How
 
Migrating from a Relational Database to Cassandra: Why, Where, When and How
Migrating from a Relational Database to Cassandra: Why, Where, When and HowMigrating from a Relational Database to Cassandra: Why, Where, When and How
Migrating from a Relational Database to Cassandra: Why, Where, When and How
 
Cassandra admin
Cassandra adminCassandra admin
Cassandra admin
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
Deployment Preparedness
Deployment Preparedness Deployment Preparedness
Deployment Preparedness
 
Big data and containers
Big data and containersBig data and containers
Big data and containers
 
Prácticas recomendadas en materia de arquitectura y errores que debes evitar
Prácticas recomendadas en materia de arquitectura y errores que debes evitarPrácticas recomendadas en materia de arquitectura y errores que debes evitar
Prácticas recomendadas en materia de arquitectura y errores que debes evitar
 
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl YeksigianC* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
 
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
Database Performance Tuning
Database Performance Tuning Database Performance Tuning
Database Performance Tuning
 
AWS Roadshow Herbst 2013: Datenanalyse und Business Intelligence
AWS Roadshow Herbst 2013: Datenanalyse und Business IntelligenceAWS Roadshow Herbst 2013: Datenanalyse und Business Intelligence
AWS Roadshow Herbst 2013: Datenanalyse und Business Intelligence
 
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Building an Analytic Extension to MySQL with ClickHouse and Open SourceBuilding an Analytic Extension to MySQL with ClickHouse and Open Source
Building an Analytic Extension to MySQL with ClickHouse and Open Source
 
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptxBuilding an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 

Mehr von Anant Corporation

NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 

Mehr von Anant Corporation (20)

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfKono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotData Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
 
YugabyteDB Developer Tools
YugabyteDB Developer ToolsYugabyteDB Developer Tools
YugabyteDB Developer Tools
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksCassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward Talks
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & FutureCassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
CL 121
CL 121CL 121
CL 121
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsApache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Datastax / Cassandra Modeling Strategies

  • 1. DataStax / Cassandra Data Modeling Strategies Avoiding The Three Stooges: Wide Partitions, Tombstones, Data Skew Rahul Xavier Singh Anant Corporation
  • 2. TOC Core Concepts Wide Partitions Data Modeling Synthetic Sharding Key Design Tombstones Data Skew Avoid tombstones
  • 3. Business Platform Success We build business success platforms, which are collections of systems that serve business processes that have information needs for people.
  • 5. How? Project Information Client Service Information Corporate Guides Collaborative Documents Assets & Files Corporate Assets Business Platform ● Curateframeworkof systems. ● Workwitha vettedteam of experts. ● Connectit all together. ● Focuson finding, analyzing,and actingon knowledge& communicationtowards businesssuccess.
  • 6. Streamline. Organize. Unify. Business Platform
  • 7. Who we help Succeed
  • 9. Cassandra Architecture Cluster / Data Centers 01Cassandra is not for tiny data. Do you NEED: 1. Fast read and write of terabytes of data? 2. Replication / availability around the world? 3. Never go down, always up? Don’tuse Cassandra: 1. If you have gigabytes of data. 2. Your application can chill in one datacenter. 3. Your system can go down whenever it wants. 4. Want to be cool.
  • 10. Cassandra Data Model Keyspaces & Tables 02 Cassandra Tables / Column Families look like SQL Server / MySQL / Postgres tables & databases. They are not. 1. CQL Supports queries with a Primary and optional Clustering Key 2. CQL Does not support arbitrary queries on columns. 3. Cassandra shouldn’t be managing more than a 100- 150 tables across any number of keyspaces.
  • 11. Cassandra Operations Read / Write Paths 03 Cassandra does these things well. 1. Write: It writes data in an immutable way at first into a commit log, adds it to the memtable to be available, and then flushes it to disk: sstables. 2. Read: It figures out if the data is on a node (Orlando Bloomfilter is involved) and reads from different sstables, reconciles the immutable data + deletes into the latest data. 3. It spreads the load around the ring so that you can hundreds of nodes doing this and not break a sweat: beast like performance.
  • 13. Wide Partitions 01 Wide partitions will completely screw you you over on reads and take a node out if there’s traffic. 1. Monitor using cfstats (CompactedPartitionMaximumBytes) 2. Monitor in system.log “Compacting large partition” 3. Monitor using toppartitions 4. Monitor using OpsCenter (if usingDataStax)
  • 14. Data Skew 02 Bad key design can lead to really, really bad data skew. In some cases if the number of keys is only 1 or 2, that means that the data only exists in one or two partitions replicated. 1. Monitor using cfstats(NumberOfKeys, SpaceUsedLive, ReadCounts, WriteCounts) 2. Monitor using OpsCenter (if usingDataStax)
  • 15. Tombstones 03 How to check for tombstones. 1. Monitor using cfstats(*Tombstones) 2. Monitor using syslog (“Tombstone Warn Threshold”) 3. Monitor using OpsCenter (if usingDataStax)
  • 17. Good Key Design 01 Somethingsto NOTDO. 1. Avoid using Integer/Longkeys unless you couple it with another composite partition key. (Unless you can somehow show through realistic data generation that it won’t coalesce data in some nodes) 2. Avoidusing Time/Date based keys or TimeUUID unless you know for damn sure that you are going to continuously create data at a given interval all day, every day. 3. Don’t just import relational data and expect it to magically work. SomethingsTODO. 1. UUIDwill most likely work fine for any given table, but how do you find it again? You will need to have another table that has that information. 2. If you must use human readable keys, you can use a synthetic shardingmechanism. Next Slide. 3. Can combine known things and take a chance but should test with load: (String, Integer , String ,Integer) . Somethingsto REMEMBER 1. Clustering Keysdon’tspreaddataaroundthecluster. 2. Remember ( Partition Key,ClusteringKey) are different((PartitionKey1, Partition Key2)) 3. UseRealistic Data: To properly scaleCassandra or anyother Systemyouneedto create realistic data.
  • 18. Spreading Data via Synthetic Sharding 01 Sometimes you need to use the key that you have which is human readable because that is the query path. How do deal with that? 1. Primary Key : ((CountryName, StateName, CityName, CompanyName)) 2. Integer Shard Added ((CountryName, StateName, CityName, CompanyName, ShardNumber)) 3. ShardNumber couldbe 1-10, or 1-100dependingon howbadly your datais spreading. Let’s say you are using a time based key and notice coalescing around a particular time of day, you could consider the weekday itself as a part of the key . 1. Primary Key : (CreatedDate) 2. Week Day Number ((CreatedDate, WeekDay)) 3. WeekDay would be 0-6 mapped to Sunday-Saturday
  • 19. Just say now to Tombstones! The reason tombstones exist is to make it possible to do insanely fast writes and updates and still be able to send the data back performantly. (Side conversation on Queues as Anti-pattern) 1. There is no need to set null values or delete data actively. 2. You can always do soft deletes or use TTL values that expire data automatically. 3. Watch out for prepared statements sending nulls. Avoiding Tombstones 01
  • 21. Confidential Customized for Lorem Ipsum LLC Version 1.0 We’re Partnering / Hiring 1. Professional Services Datastax, Sitecore, Spark, Docker, Solr, Cassandra, Kafka, Elastic, AWS, Azure 2. Digital Services React/Angular, TypeScript, ASP.NET, Node, Python
  • 22. www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037 Data & Analytics Cassandra, DataStax, Kafka, Spark Customer Experience Sitecore Information Systems Salesforce, Quickbooks, and more