SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
D ATA M O D E L I N G
C A S S A N D R A U S I N G C Q L 3
M I K E B I G L A N & E L I J A H H A M O V I T Z
A B O U T U S
Mike Biglan, M.S.
• Twenty Ideas, Inc – twentyideas.com
• Analytic Spot – spo.td
Elijah Hamovitz
• code.org
• Analytic Spot – spo.td
S I M P L I F I E D R E P R E S E N TAT I O N O F R E A L I T Y
A D ATA M O D E L M O D E L S D ATA
• Framework to store and organize data
• Models things, their differences, and
relationships between them
• Things can be real or virtual
I D E A L D ATA M O D E L P R O P E R T I E S
• Easy to create
• Easy to interface with
• Quick, flexible querying
• Writing is direct and simple
• Easily understandable
• Scalable: can read, write, and store huge amount of data safely
S C A L E A WAY, S C A L E A WAY, S C A L E A WAY
• Availability: fault tolerance, redundancy, supports
multiple data centers
• Consistency: strong or tunable
• Huge amounts of data (that won’t fit on single server)
• High-speed of incoming and/or accessed data
C O L U M N A R D ATA B A S E
• Document-Store (e.g. MongoDB) and Columnar (e.g. Cassandra, HBase,
Dynamo) are both “NoSQL”
• But modeling in Document-Store is quite different than Columnar
• Atomic unit of data storage
• Document-Store: document
• Relational Database: row
• Columnar: column
C A S S A N D R A C O L U M N :
- N A M E
- V A L U E ( O R T O M B S T O N E )
- T I M E S TA M P
- T T L
C A S S A N D R A
Highly Available Distributed Columnar Datastore that’s:
• Near-linearly scalable
• Fault tolerant, no master
• Tunable consistency
• Performant, especially for writes – don’t read before write
G E T T I N G O N T H E S C A L E
Phase 1 Install Cassandra
Phase 2
Phase 3 Scale!
CQL != SQL
S Q L / C Q L S TAT E M E N T F L O W
• SQL execution is complex
• CQL execution is relatively
simple, hence tiny subset
of syntax
• Much of CQL Query
complexity is which
node(s) to fetch/write/
confirm data from/to
• So denormalize!
Relational Database Cassandra
SQL Statement CQL Statement
Syntax/Semantic
Check
Query Plan &
Optimization
Result
Data Store Data Store
Syntax/Semantic
Check
Query Execution Query Execution
Result
C Q L ! = S Q L
SELECT <col1>, <col2>, …
FROM <table>
LEFT JOIN <table2>…
WHERE <where-clause>
GROUP BY <colx> HAVING …
ORDER BY <order-clause>
S E V E R E LY L I M I T E D
S E V E R E LY L I M I T E D
• CQL syntax is small subset of SQL
mechanics.flite.com/blog/2013/11/05/breaking-down-the-cql-where-clause/
S O W H Y T H E L I M I TAT I O N S ?
Thinking of Cassandra as a relational database, it’s
hard to understand:
• what is easy
• what is hard
• what is impossible
“Language serves not only to
express thought but to make
possible thoughts which could not
exist without it.”
— Bertrand Russell
T H E D I S T O R T I O N O F C Q L
• Broken mental model hinders optimal modeling
• CQL falsely implies a relational data model and
the design patterns that go with it
• To model Cassandra well, know the underlying
data structure
D ATA M O D E L I N G I N S Q L
( N O S H A R D I N G )
1. What are the Data?
2. What is the normalized data model?
… months pass …
3. How are the data going to be queried?
4. Optimize any slow areas and/or bottlenecks
• Add indexes, memcached/redis, sphinx/solr/elasticsearch, etc
D ATA M O D E L I N G I N C Q L
1. What are the data?
2. What read-queries are needed?
3. How to denormalize during writes?
• on initial write, or use external tools to make this sane
(Some) “premature” optimization is inherent and unavoidable
D ATA E C O S Y S T E M
To fully (and efficiently) enable everything SQL you are
used to, must rely on the big(ish) data ecosystem:
• ElasticSearch, Solr, Sphinx
• Redis, Memcached
• Spark
• Spark Streaming or Storm (and Kafka)
C O M P L E X I T Y O F I N I T I A L D ATA M O D E L
• Modeling with Relational DB
• Items & their relationships
• Modeling with Cassandra
• Items & their relationships
• How/Where they are stored
(sharding and hot spots)
• What data we want to read
• How (and how often) we write
data into those models
C A N O P T I M I Z E L AT E R
T O M O D E L , O P E N T H E B L A C K B O X
• Goal of a good black box is you can do a lot
without knowing much about what’s inside
• CQL DOES NOT allow you to ignore what’s
inside Cassandra
I N T H E B E G I N N I N G : T H R I F T
• Around Cassandra version 0.8, Thrift started getting
replaced with CQL
• Thrift too low-level, but the interface had a close
mapping to the underlying Cassandra data structure
T H R I F T & C Q L T E R M I N O L O G Y
T H R I F T C Q L
C O L U M N FA M I LY TA B L E
R O W PA R T I T I O N
C O L U M N C E L L
[ C E L L N A M E C O M P O N E N T O R VA L U E ] C O L U M N
[ G R O U P O F C E L L S W I T H S H A R E D
C O M P O N E N T P R E F I X E S ]
R O W
www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
C Q L TA B L E
CREATE TABLE
employees (
company text,
name text,
age int,
role text,
PRIMARY KEY
((company), name)
);
A N D C Q L S E L E C T
> SELECT * from employees;
company | name | age | role
----------------------------
Foo, inc | Fred | 31 | coder
Foo, inc | Sara | 39 | boss
BarCo | Bill | 50 | SQL guru
BarCo | Jane | 20 | hotshot
CREATE TABLE
employees (
company text,
name text,
age int,
role text,
PRIMARY KEY
((company), name)
);
V S H O W I T ’ S S T O R E D
employees = {
"Foo, inc" : {
"Fred:age" : 31,
"Fred:role" : "coder",
"Sara:age" : 39,
"Sara:role" : "boss"
},
"BarCo" : {
"Bill:age" : 50,
"Bill:role" : "SQL guru"
"Jane:age" : 20,
"Jane:role" : "hotshot",
}
}
CREATE TABLE
employees (
company text,
name text,
age int,
role text,
PRIMARY KEY
((company), name)
);
W I D E PA R T I T I O N S
( F O R M E R LY W I D E “ R O W S ” )
www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 51)
www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
• Based on Thrift “rows”, so
actually wide “partitions”
• Columns are the clustering
key values with the column-
name suffix
• Up to 2 billion (but don’t do this)
S E T S , M A P S , A N D L I S T S : O H M Y
• Sets/Maps/List still
column-level storage
• Enabling Schemaless,
but can result in long
column names
www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 52-71)
PA R T I T I O N A N D C L U S T E R I N G K E Y
CQL Where-Clause variants
• None (kind of)
• key1 & key2
• key1 & key2 & key3
• key1 & key2 & key3 & key4
company | year | month | day | employee | reason
-------------------------------------------------
Foo, Inc | 2014 | 11 | 27 | Fred | Thanksgiving
Foo, Inc | 2014 | 12 | 25 | Fred | Christmas
Foo, Inc | 2014 | 12 | 25 | Sara | Christmas
Foo, Inc | 2014 | 12 | 26 | Sara | Boxing day
CREATE TABLE breaks (
company text,
year int,
month int,
day int,
employee text,
reason text,
PRIMARY KEY
((company, year),
month, day, employee)
);
C O M P O S I T E K E Y S
C O M P O S I T E K E Y S
CREATE TABLE breaks (
company text,
year int,
month int,
day int,
employee text,
reason text,
PRIMARY KEY
((company, year),
month, day, employee)
);
breaks = {
"Foo, Inc:2014" : {
"11" : {
"27" : {
"Fred:reason" : "Thanksgiving"
}
},
"12" : {
"25" : {
"Fred:reason" : "Christmas",
"Sara:reason" : "Christmas"
},
"26" : {
"Sara:reason" : "Boxing day"
}
}
}
}
T H E N W H AT I S E A S Y ?
With a dictionary of ordered
dictionaries:
• Grabbing the data (or subset)
from a partition-key
• Getting a slice of data (uses linear
search) based on a partition-key
breaks = {
"Foo, Inc:2014" : {
"11" : {
"27" : {
"Fred:reason" : "Thanksgiving"
}
},
"12" : {
"25" : {
"Fred:reason" : "Christmas",
"Sara:reason" : "Christmas"
},
"26" : {
"Sara:reason" : "Boxing day"
}
}
}
}
W H AT I S H A R D ?
( I . E . C O M M O N S Q L PAT T E R N S )
• Unique and Group by
• Ordered
• Inverted Index
G R O U P - B Y & C O U N T E R S
• Often group-by is used for
counting
• Use counter columns or other
tools (e.g. elasticsearch)
CREATE TABLE
employee_break_counts (
company text,
employee text,
break_counts counter,
PRIMARY KEY
((company), employee)
);
O R D E R I N G O R I N V E R T E D - I N D E X
• Redundant table, but ordered by new
“column”
• Depending on needs, this can store
the order-field and lookup key OR
some/all of the other data in that table
• If a read will generate more than a
few subsequent child reads then
some/all the other data should be
included
CREATE TABLE
employees_by_age (
company text,
id int,
age int,
name text,
role text,
PRIMARY KEY
((company), age, id)
);
C * M O D E L I N G A N T I - PAT T E R N S
C * G U I D E L I N E S S Q L G U I D E L I N E S
W R I T E S A R E C H E A P / FA S T M I N I M I Z E W R I T E S
S T O R A G E I S C H E A P M I N I M I Z E D U P L I C AT I O N O F D ATA
PA R T I T I O N S A R E I N H E R E N T S H A R D AT Y O U R O W N R I S K
S T R I C T C O M P O S I T E K E Y S F L E X I B L E S E C O N D A RY I N D E X E S
S I M P L E Q U E R I E S C O M P L E X Q U E R I E S
C * M O D E L I N G PAT T E R N S
C * G U I D E L I N E S C * PAT T E R N S
W R I T E S A R E C H E A P / FA S T
D U P L I C AT E Y O U R D ATA
S T O R A G E I S C H E A P
PA R T I T I O N S A R E I N H E R E N T AV O I D H O T S P O T S
S T R I C T C O M P O S I T E K E Y S
D E S I G N TA B L E S A R O U N D Q U E R I E S
S I M P L E Q U E R I E S
Questions?
Mike Biglan
mike@twentyideas.com
@twentyideas
Elijah Hamovitz
elijah@code.org

Weitere ähnliche Inhalte

Ähnlich wie Cassandra Data Modelling with CQL (OSCON 2015)

From Content Strategy to Drupal Site Building - Connecting the dots
From Content Strategy to Drupal Site Building - Connecting the dotsFrom Content Strategy to Drupal Site Building - Connecting the dots
From Content Strategy to Drupal Site Building - Connecting the dots
Ronald Ashri
 
From Content Strategy to Drupal Site Building - Connecting the Dots
From Content Strategy to Drupal Site Building - Connecting the DotsFrom Content Strategy to Drupal Site Building - Connecting the Dots
From Content Strategy to Drupal Site Building - Connecting the Dots
Ronald Ashri
 

Ähnlich wie Cassandra Data Modelling with CQL (OSCON 2015) (20)

Bristol Uni - Use Cases of NoSQL
Bristol Uni - Use Cases of NoSQLBristol Uni - Use Cases of NoSQL
Bristol Uni - Use Cases of NoSQL
 
From Content Strategy to Drupal Site Building - Connecting the dots
From Content Strategy to Drupal Site Building - Connecting the dotsFrom Content Strategy to Drupal Site Building - Connecting the dots
From Content Strategy to Drupal Site Building - Connecting the dots
 
From Content Strategy to Drupal Site Building - Connecting the Dots
From Content Strategy to Drupal Site Building - Connecting the DotsFrom Content Strategy to Drupal Site Building - Connecting the Dots
From Content Strategy to Drupal Site Building - Connecting the Dots
 
Choosing the Right Database
Choosing the Right DatabaseChoosing the Right Database
Choosing the Right Database
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at Scale
 
Choosing the right database
Choosing the right databaseChoosing the right database
Choosing the right database
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
3. ldap
3. ldap3. ldap
3. ldap
 
Reduce, Reuse, Refactor
Reduce, Reuse, RefactorReduce, Reuse, Refactor
Reduce, Reuse, Refactor
 
Star Schema Overview
Star Schema OverviewStar Schema Overview
Star Schema Overview
 
BoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
BoSUSA18 | Bob Moesta| The 5 Skills Of An InnovatorBoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
BoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
 
Lecture 01 Intro to DSA
Lecture 01 Intro to DSALecture 01 Intro to DSA
Lecture 01 Intro to DSA
 
WordPress for the 99%
WordPress for the 99%WordPress for the 99%
WordPress for the 99%
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
ITB2016 - ColdBox 4 Modules
ITB2016 - ColdBox 4 ModulesITB2016 - ColdBox 4 Modules
ITB2016 - ColdBox 4 Modules
 
Reduce, Reuse, Refactor
Reduce, Reuse, RefactorReduce, Reuse, Refactor
Reduce, Reuse, Refactor
 
Content First in Action
Content First in ActionContent First in Action
Content First in Action
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really Doing
 
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensFive Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
 
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Cassandra Data Modelling with CQL (OSCON 2015)

  • 1. D ATA M O D E L I N G C A S S A N D R A U S I N G C Q L 3 M I K E B I G L A N & E L I J A H H A M O V I T Z
  • 2. A B O U T U S Mike Biglan, M.S. • Twenty Ideas, Inc – twentyideas.com • Analytic Spot – spo.td Elijah Hamovitz • code.org • Analytic Spot – spo.td
  • 3.
  • 4. S I M P L I F I E D R E P R E S E N TAT I O N O F R E A L I T Y
  • 5. A D ATA M O D E L M O D E L S D ATA • Framework to store and organize data • Models things, their differences, and relationships between them • Things can be real or virtual
  • 6. I D E A L D ATA M O D E L P R O P E R T I E S • Easy to create • Easy to interface with • Quick, flexible querying • Writing is direct and simple • Easily understandable • Scalable: can read, write, and store huge amount of data safely
  • 7. S C A L E A WAY, S C A L E A WAY, S C A L E A WAY • Availability: fault tolerance, redundancy, supports multiple data centers • Consistency: strong or tunable • Huge amounts of data (that won’t fit on single server) • High-speed of incoming and/or accessed data
  • 8. C O L U M N A R D ATA B A S E • Document-Store (e.g. MongoDB) and Columnar (e.g. Cassandra, HBase, Dynamo) are both “NoSQL” • But modeling in Document-Store is quite different than Columnar • Atomic unit of data storage • Document-Store: document • Relational Database: row • Columnar: column C A S S A N D R A C O L U M N : - N A M E - V A L U E ( O R T O M B S T O N E ) - T I M E S TA M P - T T L
  • 9. C A S S A N D R A Highly Available Distributed Columnar Datastore that’s: • Near-linearly scalable • Fault tolerant, no master • Tunable consistency • Performant, especially for writes – don’t read before write
  • 10. G E T T I N G O N T H E S C A L E Phase 1 Install Cassandra Phase 2 Phase 3 Scale!
  • 12. S Q L / C Q L S TAT E M E N T F L O W • SQL execution is complex • CQL execution is relatively simple, hence tiny subset of syntax • Much of CQL Query complexity is which node(s) to fetch/write/ confirm data from/to • So denormalize! Relational Database Cassandra SQL Statement CQL Statement Syntax/Semantic Check Query Plan & Optimization Result Data Store Data Store Syntax/Semantic Check Query Execution Query Execution Result
  • 13. C Q L ! = S Q L SELECT <col1>, <col2>, … FROM <table> LEFT JOIN <table2>… WHERE <where-clause> GROUP BY <colx> HAVING … ORDER BY <order-clause> S E V E R E LY L I M I T E D S E V E R E LY L I M I T E D • CQL syntax is small subset of SQL mechanics.flite.com/blog/2013/11/05/breaking-down-the-cql-where-clause/
  • 14. S O W H Y T H E L I M I TAT I O N S ? Thinking of Cassandra as a relational database, it’s hard to understand: • what is easy • what is hard • what is impossible
  • 15. “Language serves not only to express thought but to make possible thoughts which could not exist without it.” — Bertrand Russell
  • 16. T H E D I S T O R T I O N O F C Q L • Broken mental model hinders optimal modeling • CQL falsely implies a relational data model and the design patterns that go with it • To model Cassandra well, know the underlying data structure
  • 17. D ATA M O D E L I N G I N S Q L ( N O S H A R D I N G ) 1. What are the Data? 2. What is the normalized data model? … months pass … 3. How are the data going to be queried? 4. Optimize any slow areas and/or bottlenecks • Add indexes, memcached/redis, sphinx/solr/elasticsearch, etc
  • 18. D ATA M O D E L I N G I N C Q L 1. What are the data? 2. What read-queries are needed? 3. How to denormalize during writes? • on initial write, or use external tools to make this sane (Some) “premature” optimization is inherent and unavoidable
  • 19. D ATA E C O S Y S T E M To fully (and efficiently) enable everything SQL you are used to, must rely on the big(ish) data ecosystem: • ElasticSearch, Solr, Sphinx • Redis, Memcached • Spark • Spark Streaming or Storm (and Kafka)
  • 20. C O M P L E X I T Y O F I N I T I A L D ATA M O D E L • Modeling with Relational DB • Items & their relationships • Modeling with Cassandra • Items & their relationships • How/Where they are stored (sharding and hot spots) • What data we want to read • How (and how often) we write data into those models C A N O P T I M I Z E L AT E R
  • 21. T O M O D E L , O P E N T H E B L A C K B O X • Goal of a good black box is you can do a lot without knowing much about what’s inside • CQL DOES NOT allow you to ignore what’s inside Cassandra
  • 22. I N T H E B E G I N N I N G : T H R I F T • Around Cassandra version 0.8, Thrift started getting replaced with CQL • Thrift too low-level, but the interface had a close mapping to the underlying Cassandra data structure
  • 23. T H R I F T & C Q L T E R M I N O L O G Y T H R I F T C Q L C O L U M N FA M I LY TA B L E R O W PA R T I T I O N C O L U M N C E L L [ C E L L N A M E C O M P O N E N T O R VA L U E ] C O L U M N [ G R O U P O F C E L L S W I T H S H A R E D C O M P O N E N T P R E F I X E S ] R O W www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
  • 24. C Q L TA B L E CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY ((company), name) );
  • 25. A N D C Q L S E L E C T > SELECT * from employees; company | name | age | role ---------------------------- Foo, inc | Fred | 31 | coder Foo, inc | Sara | 39 | boss BarCo | Bill | 50 | SQL guru BarCo | Jane | 20 | hotshot CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY ((company), name) );
  • 26. V S H O W I T ’ S S T O R E D employees = { "Foo, inc" : { "Fred:age" : 31, "Fred:role" : "coder", "Sara:age" : 39, "Sara:role" : "boss" }, "BarCo" : { "Bill:age" : 50, "Bill:role" : "SQL guru" "Jane:age" : 20, "Jane:role" : "hotshot", } } CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY ((company), name) );
  • 27. W I D E PA R T I T I O N S ( F O R M E R LY W I D E “ R O W S ” ) www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 51) www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows • Based on Thrift “rows”, so actually wide “partitions” • Columns are the clustering key values with the column- name suffix • Up to 2 billion (but don’t do this)
  • 28. S E T S , M A P S , A N D L I S T S : O H M Y • Sets/Maps/List still column-level storage • Enabling Schemaless, but can result in long column names www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 52-71)
  • 29. PA R T I T I O N A N D C L U S T E R I N G K E Y CQL Where-Clause variants • None (kind of) • key1 & key2 • key1 & key2 & key3 • key1 & key2 & key3 & key4
  • 30. company | year | month | day | employee | reason ------------------------------------------------- Foo, Inc | 2014 | 11 | 27 | Fred | Thanksgiving Foo, Inc | 2014 | 12 | 25 | Fred | Christmas Foo, Inc | 2014 | 12 | 25 | Sara | Christmas Foo, Inc | 2014 | 12 | 26 | Sara | Boxing day CREATE TABLE breaks ( company text, year int, month int, day int, employee text, reason text, PRIMARY KEY ((company, year), month, day, employee) ); C O M P O S I T E K E Y S
  • 31. C O M P O S I T E K E Y S CREATE TABLE breaks ( company text, year int, month int, day int, employee text, reason text, PRIMARY KEY ((company, year), month, day, employee) ); breaks = { "Foo, Inc:2014" : { "11" : { "27" : { "Fred:reason" : "Thanksgiving" } }, "12" : { "25" : { "Fred:reason" : "Christmas", "Sara:reason" : "Christmas" }, "26" : { "Sara:reason" : "Boxing day" } } } }
  • 32. T H E N W H AT I S E A S Y ? With a dictionary of ordered dictionaries: • Grabbing the data (or subset) from a partition-key • Getting a slice of data (uses linear search) based on a partition-key breaks = { "Foo, Inc:2014" : { "11" : { "27" : { "Fred:reason" : "Thanksgiving" } }, "12" : { "25" : { "Fred:reason" : "Christmas", "Sara:reason" : "Christmas" }, "26" : { "Sara:reason" : "Boxing day" } } } }
  • 33. W H AT I S H A R D ? ( I . E . C O M M O N S Q L PAT T E R N S ) • Unique and Group by • Ordered • Inverted Index
  • 34. G R O U P - B Y & C O U N T E R S • Often group-by is used for counting • Use counter columns or other tools (e.g. elasticsearch) CREATE TABLE employee_break_counts ( company text, employee text, break_counts counter, PRIMARY KEY ((company), employee) );
  • 35. O R D E R I N G O R I N V E R T E D - I N D E X • Redundant table, but ordered by new “column” • Depending on needs, this can store the order-field and lookup key OR some/all of the other data in that table • If a read will generate more than a few subsequent child reads then some/all the other data should be included CREATE TABLE employees_by_age ( company text, id int, age int, name text, role text, PRIMARY KEY ((company), age, id) );
  • 36. C * M O D E L I N G A N T I - PAT T E R N S C * G U I D E L I N E S S Q L G U I D E L I N E S W R I T E S A R E C H E A P / FA S T M I N I M I Z E W R I T E S S T O R A G E I S C H E A P M I N I M I Z E D U P L I C AT I O N O F D ATA PA R T I T I O N S A R E I N H E R E N T S H A R D AT Y O U R O W N R I S K S T R I C T C O M P O S I T E K E Y S F L E X I B L E S E C O N D A RY I N D E X E S S I M P L E Q U E R I E S C O M P L E X Q U E R I E S
  • 37. C * M O D E L I N G PAT T E R N S C * G U I D E L I N E S C * PAT T E R N S W R I T E S A R E C H E A P / FA S T D U P L I C AT E Y O U R D ATA S T O R A G E I S C H E A P PA R T I T I O N S A R E I N H E R E N T AV O I D H O T S P O T S S T R I C T C O M P O S I T E K E Y S D E S I G N TA B L E S A R O U N D Q U E R I E S S I M P L E Q U E R I E S