I recently read about Cassandra concepts and internals to understand how it works and why it is suited for handling large volume of data. This is a very interesting and also complex subject and I have merely scratched the surface so far.
How to Troubleshoot Apps for the Modern Connected Worker
My experience with Cassandra concepts
1. My experience with Cassandra concepts
I recently read about Cassandra concepts and internals to understand how it works and why
it is suited for handling large volume of data. This is a very interesting and also complex
subject and I have merely scratched the surface so far.
Cassandra is an open source scalable and highly available "NoSQL" distributed database
management system from Apache. It is classified under the Column-Family NoSQL
category. It was initially developed by Facebook and was later taken over by Apache. The
core features of Cassandra have been extracted from Amazon’s Dynamo and Google’s
Bigtable.
Its support for dynamic columns and distributed counters will resolve a major problem of
being able to aggregate most statistics as they are, rather than aggregating them with
map/reduce at the later stage.
Another beautiful thing about Cassandra is that it can keep maximum data in its cache (if
given enough RAM).
Cassandra Data Model
The Cassandra data model consists of a keyspace (analogous to a database), column
families (analogous to tables in the relational model), keys and columns. Here’s what the
basic Cassandra table (also known as a column family) structure looks like:
Figure1Error! No text of specified style in document.-1 Structure of a super column family in Cassandra
Don’t think of a relational table
Instead, think of a nested, sorted map data structure.
The following relational model analogy is often used to introduce Cassandra to newcomers:
Figure 1Error! No text of specified style in document.-2 Relational vs. Cassandra Model
2. This analogy helps make the transition from the relational to non-relational world. But don’t
use this analogy while designing Cassandra column families. Instead, think of the Cassandra
column family as a map of a map: an outer map keyed by a row key, and an inner map
keyed by a column key. Both maps are sorted.
SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>>
Why?
A nested sorted map is a more accurate analogy than a relational table, and will help you
make the right decisions about your Cassandra data model.
Figure 1-3: Cassandra Data Model
How?
A map gives efficient key lookup, and the sorted nature gives efficient scans. In
Cassandra, we can use row keys and column keys to do efficient lookups and range
scans.
The number of column keys is unbounded. In other words, you can have wide rows.
A key can itself hold a value. In other words, you can have a valueless column.
Map<RowKey, SortedMap<ColumnKey, ColumnValue>>
Conclusion
It’s important to think carefully about your data and your technology choices, and
sometimes it can be difficult to do that in a data vacuum. Cassandra, Hive, and Hadoop are
considered as the right tools to resolve most of the data challenges.
Your mileage may vary, but feel free to ask us questions in the comments!