SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 1
REQUISITE SLIDE – WHO AM I?
- Brian Enochson
- Home is the Jersey Shore
- SW Engineer who has worked as designer / developer on Cassandra
- Consultant – HBO, ACS, CIBER –
- Specialize in SW Development, architecture and training
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 2
REQUISITE SLIDE # 2 – WHAT ARE WE TALKING
ABOUT?
• Cassandra Intro & Architecture
• Why Cassandra
• Architecture
• Internals
• Development
• Data Modeling Concepts
• Old vs. New Way
• Basics
• Composite Types
• Collections
• Time Series Data
• Counters
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 3
REQUISITE SLIDE #3 – THE NAME CASSANDRA
Cassandra was the most beautiful of the daughters of Priam ad Hecuba, the
king and queen of Troy. She was given the gift of prophecy by Apollo who
wished to seduce her; when she accepted his gift but refused his sexual
advances, he deprived her prophecies of the power to persuade.**
Relate this to a database ->
Can see into the future. Good result for a database. This is good…
Can predict the future, but no one believed what was said!!! This is not so
good….
** http://www.pantheon.org/articles/c/cassandra.html
Anyway, Cassandra is also universally known as C*
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 4
HISTORY
• Developed At Facebook, based on Google Big Table and Amazon Dynamo **
• Open Sourced in mid 2008
• Apache Project March 2009
• Commercial Support through Datastax (originally known as Riptano, founded
2010)
• Used at Netflix, eBay and many more. Reportedly 300 TB on 400 machines
largest installation
• Current verson is 1.2.x, 2.0 in RC1.
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 5
WHY EVEN CONSIDER C*
• Large data sets
• Require high availability
• Multi Data Center
• Require large scaling
• Write heavy applications
• Can design for queries
• Understand tunable consistency and implications (more to come)
• Willing to make the effort upfront for the reward
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 6
SOME BASICS
• ACID
• CAP Theorem
• BASE
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 7
ACID
Everyone has heard of ACID
• Atomic – All or None
• Consistency – What is written is valid
• Isolation – One operation at a time
• Durability – Once committed to the DB, it stays
This is the world we have lived in for a long time…
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 8
CAP THEOREM (BREWERS)
Many may have heard this one
CAP stands for Consistency, Availability and Partition Tolerance
• Consistency –like the C in ACID. Operation is all or nothing,
• Availability – service is available.
• Partition Tolerance – No failure other than complete network failure causes
system not to respond
So.. What does this mean?
** http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 9
YOU CAN ONLY HAVE 2 OF THEM
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 10
Or better said in C* terms you can have Availability and Partition-Tolerant
AND Eventual Consistency.
Means eventually all accesses will return the last updated value.
BASE
But maybe you have not heard this one…
Somewhat contrived but gives CAP Theorem an acronym to use against ACID…
Also created by Eric Brewer.
Basically Available – system does guarantee availability, as much as possible.
Soft State – state may change even without input. Required because of
eventual consistency
Eventually Consistent – it will become consistent over time.
** Also, as engineers we cannot believe in anything that isn’t an acronym!
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 11
C* - WHAT PROBLEM IS BEING SOLVED?
• Database for modern application requirements.
• Web Scale – massive amounts of data
• Scale Out – commodity hardware
• Flexible Schema (we will see this how this concept is evolving)
• Online admin (add to cluster, load balancing). Simpler operations
• CAP Theorem Aware
• Built based on
• Amazon Dynamo – Took partition and replication from here **
• Google Bigtable – log structured column family from here ***
** http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
*** http://research.google.com/archive/bigtable.html
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 12
C* BASICS
• No Single Point of Failure – highly available.
• Peer to Peer – no master
• Data Center Aware – distributed architecture
• Linear Scaling – just add hardware
• Eventual Consistency, tunable tradeoff between latency and consistency
• Architecture is optimized for writes.
• Can have 2 billion columns!
• Data modeling for reads. Design starts with looking at your queries.
• With CQL became more SQL-Like, but no joins, no subqueries, limited ordering (but very useful)
• Column Names can part of data, e.g. Time Series
Don’t be afraid of denormalized and redundant data for read performance.
In fact embrace it! Remember, writes are fast.
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 13
NOTE ABOUT EVENTUAL CONSISTENCY
** Important Term **
Quorum : Q = N / 2 + 1.
We get consistency in a BASE world by satisfying W + R > N
3 obvious ways:
1.W = 1, R = N
2.W = N, R = 1
3.W = Q, R = Q
(N is replication factor, R = read replica count, W = write replica count)
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 14
THE C* DATA MODEL
C* data model is made of these:
Column – a name, a value and a timestamp. Applications can use the name as
the data and not use value. (RDBMS like a column).
Row – a collection of columns identified by a unique key. Key is called a partition
key (RDBMS like a row).
Column Family – container for an ordered collection rows. Each row is an
ordered collection of columns. Each column has a key and maybe a value.
(RDBMS like a table). This is also known as a table now in C* terms.
Keyspace – administrative container for CF’s. It is a namespace. Also has a
replication strategy – more late. (RDBMS like a DB or schema).
Super Column Family – say what?
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 15
SUPER COLUMN FAMILY
Not recommended, but they exist. Rarely discussed
It is a key, that contains to one or more nested row keys and then these each
contain a collection of columns.
Can think of it as a hash table of hash tables that contain columns..
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 16
ARCHITECTURE (CONT.)
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 17
http://www.slideshare.net/gdusbabek/data-modeling-with-cassandra-column-
families
OR CAN ALSO BE VIEWED AS…
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 18
http://www.slideshare.net/gdusbabek/data-modeling-with-
cassandra-column-families
TOKENS
Tokens – partitioner dependent element on the ring.
Each node has a single unique token assigned.
Each node claims a range of tokens that is from its token to token of the previous node on the
ring.
Use this formula
Initial_Token= Zero_Indexed_Node_Number * ((2^127) / Number_Of_Nodes)
In cassandra.yaml
initial token=42535295865117307932921825928971026432
** http://blog.milford.io/cassandra-token-calculator/
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 19
C* PARTITIONER
RandomPartitioner – MD5 hash of key is token (128 bit number), gives you
even distribution in cluster. Default <= version 1.1
OrderPreservingPartitioner – tokens are UTF-8 strings. Uneven distribution.
Murmur3Partitioner – same functionally as RandomPartitioner, but is 3 – 5
times faster. Uses Murmur3 algorithm. Default >= 1.2
Set in cassandra.yaml
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 20
REPLICATION
• Replication is how many copies of each piece of data that should be stored.
In C* terms it is Replication Factor or “RF”.
• In C* RF is set at the keyspace level:
CREATE KEYSPACE drg_compare WITH replication = {'class':'SimpleStrategy',
'replication_factor':3};
• How the data is replicated is called the Replication Strategy
• SimpleStrategy – returns nodes “next” to each other on ring, Assumes
single DC
• NetworkTopologyStrategy – for configuring per data center. Rack and
DC’s aware.
update keyspace UserProfile with strategy_options=[{DC1:3, DC2:3}];
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 21
SNITCH
• Snitch maps IP’s to racks and data centers.
• Several kinds that are configured in cassandra.yaml. Must be same across the
cluster.
SimpleSnitch - does not recognize data center or rack information. Use it for
single-data center deployments (or single-zone in public clouds)
PropertyFileSnitch - This snitch uses a user-defined description of the network
details located in the property file cassandra-topology.properties. Use this snitch
when your node IPs are not uniform or if you have complex replication grouping
requirements.
RackInferringSnitch - The RackInferringSnitch infers (assumes) the topology of
the network by the octet of the node's IP address.
EC2* - EC2Snitch, EC2MultiRegionSnitch
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 22
RING TOPOLOGY
When thinking of Cassandra best to think of nodes as part of ring topology, even
for multiple DC.
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 23
SimpleStrategy
Using token generation values from before. 4 node cluster. Write value with
token 32535295865117307932921825928971026432
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 24
SimpleStrategy #2
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 25
SimpleStrategy #3
With RF of 3 replication works like this:
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 26
NetworkTopologyStrategy
Using LOCAL_QUORUM, allows write to DC #2 to be asynchronous. Marked as
success when writes to 2 of 3 nodes (http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-
data-centers)
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 27
COORDINATOR & CL
• When writing, Coordinator Node will be selected. Selected at write (or read) time.
Not a SPF!
• Using Gossip Protocol nodes share information with each other. Who is up, who
is down, who is taking which token ranges, etc. Every second, each node shares
with 1 to 3 nodes.
• Consistency Level (CL) – says how many nodes must agree before an operation
is a success. Set at read or write operation.
• ONE – coordinator will wait for one node to ack write (also TWO, THREE). One is
default if none provided.
• QUORUM – we saw that before. N / 2 + 1. LOCAL_QUORUM, EACH_QUORUM
• ANY – waits for some replicate. If all down, still succeeds. Only for writes. Doesn’t
guarantee it can be read.
• ALL– Blocks waiting for all replicas
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 28
ENSURING CONSISTENCY
3 important concepts:
Read Repair - At time of read, inconsistencies are noticed between nodes and
replicas are updated. Direct and background. Direct is determined by CL.
Anti-Entropy Node Repair - For data that is not read frequently, or to update
data on a node that has been down for a while, the nodetool repair process
(also called anti-entropy repair). Builds Merkle trees, compares nodes and
does repair.
Hinted Handoff - Writes are always sent to all replicas for the specified row
regardless of the consistency level specified by the client. If a node happens
to be down at the time of write, its corresponding replicas will save hints
about the missed writes, and then handoff the affected rows once the node
comes back online. This notification happens is via Gossip. Default 1 hour.
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 29
ENTER VNODES
• Virtual Nodes introduced with Cassandra 1.2
• Allow spreading of a node across physical servers
• Why
• Distribution of load
• No token management
• Concurrent streaming across hosts
• Two ways to specify in cassandra.yaml
• initial_tokens: <token1>,<token2>,<token3>,<token4>, …..
Or
 num_tokens: 256 # recommended as a starter
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 30
VNODES - HOW DOES IT LOOK
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 31
STORAGE
Commit Log – Cassandra appends to the commit log first. Uses fsync in the
background to force writing of these changes. Durability, can redo writes in case
of crash.
Memtable – in memory cache stored by key. After appending to commit log, writes to
Memtable. Then write is considered successful. Each CF has different Memtable.
SSTables – Immutable files. When memtable runs out of space or hits a defined key
limit it writes out to SSTable asynchronously. When they reach threshold,
compaction (Minor) takes place they are merged (uses a structure called Merkle
Tree for this).
Bloom Filter – for each SSTable, checks here first if key exists.
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 32
C* PERSISTENCE MODEL
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 33
WRITE PATH
1. Determine all applicable replica nodes across all DC’s
2. Coordinator Node sends to all replicas in the local DC
3. Coordinator sends to ONE replica in remote DC
4. Selected remote replica then sends on to other remote replica nodes
5. All respond back to coordinator
• The CL is how long the coordinator blocks before it returns success or failure
to client. (remember ANY, ONE, TWO, LOCAL_QUORUM etc.)
• If a replica node is down (or does not respond), HINTED HANDOFF kicks in.
The hint is target replica id + mutation data.
• Can configure max hint time and after hints no longer stored.
• Hinted handoff runs every 10 minutes and tries to send hints to nodes.
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 34
READ PATH
• Consistency level determines how many nodes must have data
• Read request goes to a single node, Storage Proxy, it determines nodes that
hold replicas of the data, uses Snitch to determine “proximity”.
• CL of ONE - first find it returns
• CL of Quorum – wait for majority. Better guarantee to get most recent.
• After first read, digest calculated of data and used to determine whether
nodes not consistent. If not, in background Read Repair is activated.
• Different SSTables may have different columns so need to merge results.
Reads involve checking Bloom Filter to see if SSTable has key. Key cache to get
direct access to data and row cache. Very dependent on configuration.
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 35
DELETIONS
Distributed systems present unique problem for deletes. If it actually deleted data
and a node was down and didn’t receive the delete notice it would try and
create record when came back online. So…
C* uses the concept of a Tombstone. The data is replaced with a special value
called a Tombstone, works within distributed architecture
Tombstones are cleaned up after defined time (GCGraceSeconds, default 10
days) during compaction.
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 36
THRIFT TO CQL
• Thrift - original RPC based API and is still fully supported. Used by clients
such as Hector, Pycassa, etc.
• Interact with Cassandra using the cassandra-cli.
• CQL3 – new and fully supported API. Table orientated, schema driven query
language. It resembled (very closely at times) SQL.
• Interact with Cassandra using cqlsh.
• Cassandra Table – defined as a sparse collection of well known and
ordered columns.
• Still uses same underlying storage as Thrift.
• More intuitive data modeling for many.
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 37
APPLICATION DEVELOPMENT
• Interaction with Cassandra can be done using one of supplied clients such as CLI
or CQL. Otherwise client applications are built using a language client library.
• Many clients in multiple languages. Including Java, .NET, Python, Scala, Go, PHP,
Node.js, Perl, Ruby, etc.
• Java:
• Hector wraps the underlying Thrift API. Hector is one of the most commonly
used client libraries.
• Astyanax is a client library developed by Netflix .
• Datastax CQL – newest CQL driver, will be very familiar to JDBC
developers
• And many more … (JPA)
• Also exists Datastax OPSCenter and other various GUI’s and REST API (Virgil)
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 38
HECTOR EXAMPLE
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 39
DATASTAX CQL DRIVER
From version 1.2. onward. Uses new CQL method for C* interaction.
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 40
INSTALLATION
You have 3 options
• Cassandra Cluster Manager – installs multi node cluster automatically with just a few
parameters. Great for dev and testing (and VERY cool).
• https://github.com/pcmanus/ccm
• Datastax Community Edition – Ease of installation and you get free version of
OpsCenter. Has a Windows installer (msi) if using Windows.
• http://planetcassandra.org/Download/DataStaxCommunityEdition **
• Open Source Version – from Cassandra, most recent versions, RC’s etc.
• http://cassandra.apache.org/download/.
• CQL Driver - http://www.datastax.com/download/clientdrivers
** Planet Cassandra - Great source of Cassandra info!!
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 41
DATA MODELING
• NoSQL you don’t need to model your data in Cassandra. Or do you?
• Flexible schema = no (or little) data modeling done.
• Proved to be problematic from application and query perspective.
• Cassandra is extremely good at reading data in order it was stored.
• CQL makes data modeling much easier, nice bridge from SQL world if that knowledge is
there.
• Typically in modeling in C*, you will denormalize data to use the storage engine order.
• Essential is to create a good data model is understanding of queries that will be used.
• Remember no joins, no enforced foreign keys. The app (client) heavily influences the
model and we don’t think in terms of “normal form”.
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 42
OUR USE CASES
Our use case involves health care data and comparison data.
General requirement is we have 3 files.
• One summary by DRG (diagnosis related group) codes
• One broken down by state
• One by practitioner.
• We want to be able to store data (entities), perform performant queries by
state, DRG code and provider to get comparisons.
• Let’s model the data, have an app load and look at the decisions we will
make.
• Will use CQL3, no compact storage for backward compatability.
• Will use Datastax CQL Driver
• OpenCSV for reading files.
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 43
SAMPLE DATA – DIAGNOSIS RELATED GROUP
DRG Definition Provider Id Provider Name Provider Street Address Provider City
039 - EXTRACRANIAL
PROCEDURES W/O
CC/MCC 10001SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 44
State Zip Total Discharges Average Covered Charges Average Total Payments
AL 36301 91 32963.07692 5777.241758
RDBMS ENTITY MODEL
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 45
CASSANDRA DATA MODEL
• Tables (Column Family) for complete data storage
• Index tables with compound keys for query
• Application will handle required joining and foreign keys etc.
• Cassandra handles the quick writes and other important matters (replication
and availability).
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 46
STANDARD ENTITY TABLES
• Used to store all the data for the different inputs
• Can be used for lookups, but remember don’t be afraid to denormalize.
• Following entities
• Summary
• State
• Provider
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 47
MATERIALIZED VIEWS
• These are optimized for queries.
• Composite columns for utilizing storage order.
• Denormalized for accessing required info
• The index tables make sense initially
• Drg codes
• Providers
• State
• Can use secondary indexes to add in crtieria, e.g. state
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 48
WHAT ABOUT TIME SERIES?
• Cassandra is highly performant at storing time series (event) data
• First data modeling pattern utilizes single DRG Code per row, the DRG Code
is the partition key and they timestamp is the column.
• Works well as again we are using C* built-in sorting
• Second pattern, can be used if too many columns so limit row size
• Use row partitioning by adding date portion into row key. So all data for one
day is in one row.
• Gets interesting when you use TTL, which gives you “automatic” data
expiring!
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 49
COLLECTIONS
• Can have set, list or map
• Set – return sorted when queried
• List – when sorted order is not natural order. Also, value can be in multiple
times
• Map – key / value as name implies
• Inserts, updates and deletes all allowed
• Syntax takes some practice, + and -
• Can expire each element setting a TTL
• Each element is stored as a column
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 50
COUNTERS
• Distributed counters present another problem.
• C* has a counter type
• One PK and a counter column all that is allowed
• Update right away, no inserts
• Can only increment or decrement.
• Test thoroughly under load.
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 51
SUMMARY
C* Provider highly available, distributed, DC aware DB with tuneable consistency out
of the box.
A lot of tools at your disposal.
Work close with ops or devops .
Test, test and test again.
Don’t be afraid to use the C* community.
Thank you!
C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 52

Weitere ähnliche Inhalte

Was ist angesagt?

Cloud schedulers and Scheduling in Hadoop
Cloud schedulers and Scheduling in HadoopCloud schedulers and Scheduling in Hadoop
Cloud schedulers and Scheduling in HadoopPallav Jha
 
Scylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the WestScylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the WestScyllaDB
 
Draft sas and r and sas (may, 2018 asa meeting)
Draft sas and r and sas (may, 2018 asa meeting)Draft sas and r and sas (may, 2018 asa meeting)
Draft sas and r and sas (may, 2018 asa meeting)Barry DeCicco
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...Spark Summit
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
Scylla Summit 2017: Planning Your Queries for Maximum Performance
Scylla Summit 2017: Planning Your Queries for Maximum PerformanceScylla Summit 2017: Planning Your Queries for Maximum Performance
Scylla Summit 2017: Planning Your Queries for Maximum PerformanceScyllaDB
 
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...ScyllaDB
 
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...ScyllaDB
 
If You Care About Performance, Use User Defined Types
If You Care About Performance, Use User Defined TypesIf You Care About Performance, Use User Defined Types
If You Care About Performance, Use User Defined TypesScyllaDB
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in HadoopBotond Balázs
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Scylla Summit 2017: How We Got to 1 Millisecond Latency in 99% Under Repair, ...
Scylla Summit 2017: How We Got to 1 Millisecond Latency in 99% Under Repair, ...Scylla Summit 2017: How We Got to 1 Millisecond Latency in 99% Under Repair, ...
Scylla Summit 2017: How We Got to 1 Millisecond Latency in 99% Under Repair, ...ScyllaDB
 
OrientDB - the 2nd generation of (Multi-Model) NoSQL
OrientDB - the 2nd generation  of  (Multi-Model) NoSQLOrientDB - the 2nd generation  of  (Multi-Model) NoSQL
OrientDB - the 2nd generation of (Multi-Model) NoSQLLuigi Dell'Aquila
 
Scylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
Scylla Summit 2017: A Deep Dive on Heat Weighted Load BalancingScylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
Scylla Summit 2017: A Deep Dive on Heat Weighted Load BalancingScyllaDB
 
Scylla Summit 2017: Scylla on Kubernetes
Scylla Summit 2017: Scylla on KubernetesScylla Summit 2017: Scylla on Kubernetes
Scylla Summit 2017: Scylla on KubernetesScyllaDB
 

Was ist angesagt? (19)

Cloud schedulers and Scheduling in Hadoop
Cloud schedulers and Scheduling in HadoopCloud schedulers and Scheduling in Hadoop
Cloud schedulers and Scheduling in Hadoop
 
Scylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the WestScylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the West
 
Draft sas and r and sas (may, 2018 asa meeting)
Draft sas and r and sas (may, 2018 asa meeting)Draft sas and r and sas (may, 2018 asa meeting)
Draft sas and r and sas (may, 2018 asa meeting)
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Scylla Summit 2017: Planning Your Queries for Maximum Performance
Scylla Summit 2017: Planning Your Queries for Maximum PerformanceScylla Summit 2017: Planning Your Queries for Maximum Performance
Scylla Summit 2017: Planning Your Queries for Maximum Performance
 
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
 
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
 
If You Care About Performance, Use User Defined Types
If You Care About Performance, Use User Defined TypesIf You Care About Performance, Use User Defined Types
If You Care About Performance, Use User Defined Types
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Cassandra1.2
Cassandra1.2Cassandra1.2
Cassandra1.2
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in Hadoop
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Scylla Summit 2017: How We Got to 1 Millisecond Latency in 99% Under Repair, ...
Scylla Summit 2017: How We Got to 1 Millisecond Latency in 99% Under Repair, ...Scylla Summit 2017: How We Got to 1 Millisecond Latency in 99% Under Repair, ...
Scylla Summit 2017: How We Got to 1 Millisecond Latency in 99% Under Repair, ...
 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
 
OrientDB - the 2nd generation of (Multi-Model) NoSQL
OrientDB - the 2nd generation  of  (Multi-Model) NoSQLOrientDB - the 2nd generation  of  (Multi-Model) NoSQL
OrientDB - the 2nd generation of (Multi-Model) NoSQL
 
Scylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
Scylla Summit 2017: A Deep Dive on Heat Weighted Load BalancingScylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
Scylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
 
Scylla Summit 2017: Scylla on Kubernetes
Scylla Summit 2017: Scylla on KubernetesScylla Summit 2017: Scylla on Kubernetes
Scylla Summit 2017: Scylla on Kubernetes
 

Andere mochten auch

Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraDataStax
 
Medical Information Workshop (23 Jan 2007 )
Medical Information Workshop (23 Jan 2007 )Medical Information Workshop (23 Jan 2007 )
Medical Information Workshop (23 Jan 2007 )rwakefor
 
Verdaderas Obras De Arte
Verdaderas Obras De ArteVerdaderas Obras De Arte
Verdaderas Obras De Arteblackangel
 
Kathleen's Powerpoint Presentation in Sir Rey's Computer Class
Kathleen's Powerpoint Presentation in Sir Rey's Computer ClassKathleen's Powerpoint Presentation in Sir Rey's Computer Class
Kathleen's Powerpoint Presentation in Sir Rey's Computer Classrey ayento
 
Postal De Nadal 2008 09 Manel Sons
Postal De Nadal 2008 09 Manel SonsPostal De Nadal 2008 09 Manel Sons
Postal De Nadal 2008 09 Manel Sonsmanelagui
 
Qt编程介绍
Qt编程介绍Qt编程介绍
Qt编程介绍easychen
 
El Pollo Loco
El Pollo LocoEl Pollo Loco
El Pollo Locoburnsc62
 
Importance Of Being Driven
Importance Of Being DrivenImportance Of Being Driven
Importance Of Being DrivenAntonio Terreno
 
Amy&Per Erik
Amy&Per ErikAmy&Per Erik
Amy&Per Erikvinion
 
Lost In The Kingdom Of Vorg
Lost In The Kingdom Of VorgLost In The Kingdom Of Vorg
Lost In The Kingdom Of VorgKwan Tuck Soon
 
Presentation to CIPR local public services conference - October 2009
Presentation to CIPR local public services conference - October 2009Presentation to CIPR local public services conference - October 2009
Presentation to CIPR local public services conference - October 2009simonwakeman
 
Presentatie Lizzy Jongma Masterclass Open Cultuur Data
Presentatie Lizzy Jongma Masterclass Open Cultuur DataPresentatie Lizzy Jongma Masterclass Open Cultuur Data
Presentatie Lizzy Jongma Masterclass Open Cultuur DataKennisland
 

Andere mochten auch (20)

Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache Cassandra
 
La Playa Luna
La Playa LunaLa Playa Luna
La Playa Luna
 
Medical Information Workshop (23 Jan 2007 )
Medical Information Workshop (23 Jan 2007 )Medical Information Workshop (23 Jan 2007 )
Medical Information Workshop (23 Jan 2007 )
 
Verdaderas Obras De Arte
Verdaderas Obras De ArteVerdaderas Obras De Arte
Verdaderas Obras De Arte
 
Kathleen's Powerpoint Presentation in Sir Rey's Computer Class
Kathleen's Powerpoint Presentation in Sir Rey's Computer ClassKathleen's Powerpoint Presentation in Sir Rey's Computer Class
Kathleen's Powerpoint Presentation in Sir Rey's Computer Class
 
Postal De Nadal 2008 09 Manel Sons
Postal De Nadal 2008 09 Manel SonsPostal De Nadal 2008 09 Manel Sons
Postal De Nadal 2008 09 Manel Sons
 
Qt编程介绍
Qt编程介绍Qt编程介绍
Qt编程介绍
 
Something About The Web
Something About The WebSomething About The Web
Something About The Web
 
El Pollo Loco
El Pollo LocoEl Pollo Loco
El Pollo Loco
 
Biz Cafe
Biz CafeBiz Cafe
Biz Cafe
 
Importance Of Being Driven
Importance Of Being DrivenImportance Of Being Driven
Importance Of Being Driven
 
Amy&Per Erik
Amy&Per ErikAmy&Per Erik
Amy&Per Erik
 
PALM2 Brochure.pdf
PALM2 Brochure.pdfPALM2 Brochure.pdf
PALM2 Brochure.pdf
 
Lost In The Kingdom Of Vorg
Lost In The Kingdom Of VorgLost In The Kingdom Of Vorg
Lost In The Kingdom Of Vorg
 
Byť trpezliví jeden s druhým
Byť trpezliví jeden s druhýmByť trpezliví jeden s druhým
Byť trpezliví jeden s druhým
 
Presentation to CIPR local public services conference - October 2009
Presentation to CIPR local public services conference - October 2009Presentation to CIPR local public services conference - October 2009
Presentation to CIPR local public services conference - October 2009
 
Roysville
RoysvilleRoysville
Roysville
 
Trojica
TrojicaTrojica
Trojica
 
独特的荷兰风车
独特的荷兰风车独特的荷兰风车
独特的荷兰风车
 
Presentatie Lizzy Jongma Masterclass Open Cultuur Data
Presentatie Lizzy Jongma Masterclass Open Cultuur DataPresentatie Lizzy Jongma Masterclass Open Cultuur Data
Presentatie Lizzy Jongma Masterclass Open Cultuur Data
 

Ähnlich wie Cassandra Deep Diver & Data Modeling

NoSQL Intro with cassandra
NoSQL Intro with cassandraNoSQL Intro with cassandra
NoSQL Intro with cassandraBrian Enochson
 
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan OttTrivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan OttTrivadis
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystemAlex Thompson
 
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Andre Essing
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandraPL dream
 
Cassandra presentation
Cassandra presentationCassandra presentation
Cassandra presentationSergey Enin
 
TechEvent Apache Cassandra
TechEvent Apache CassandraTechEvent Apache Cassandra
TechEvent Apache CassandraTrivadis
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAndre Essing
 
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...MongoDB
 
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...DataStax Academy
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorRussell Spitzer
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Russell Spitzer
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSDataStax Academy
 
Cassandra To Infinity And Beyond
Cassandra To Infinity And BeyondCassandra To Infinity And Beyond
Cassandra To Infinity And BeyondRomain Hardouin
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into CassandraBrent Theisen
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectMorningstar Tech Talks
 

Ähnlich wie Cassandra Deep Diver & Data Modeling (20)

NoSQL Intro with cassandra
NoSQL Intro with cassandraNoSQL Intro with cassandra
NoSQL Intro with cassandra
 
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan OttTrivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
 
Cassandra presentation
Cassandra presentationCassandra presentation
Cassandra presentation
 
TechEvent Apache Cassandra
TechEvent Apache CassandraTechEvent Apache Cassandra
TechEvent Apache Cassandra
 
No sql
No sqlNo sql
No sql
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep Dive
 
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
MongoDB Europe 2016 - Using MongoDB to Build a Fast and Scalable Content Repo...
 
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra Connector
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWS
 
Presentation
PresentationPresentation
Presentation
 
Cassandra To Infinity And Beyond
Cassandra To Infinity And BeyondCassandra To Infinity And Beyond
Cassandra To Infinity And Beyond
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra Project
 

Mehr von Brian Enochson

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Big Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBig Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBrian Enochson
 
NoSQL and MongoDB Introdction
NoSQL and MongoDB IntrodctionNoSQL and MongoDB Introdction
NoSQL and MongoDB IntrodctionBrian Enochson
 

Mehr von Brian Enochson (6)

Hadoop20141125
Hadoop20141125Hadoop20141125
Hadoop20141125
 
Cassandra20141113
Cassandra20141113Cassandra20141113
Cassandra20141113
 
Cassandra20141009
Cassandra20141009Cassandra20141009
Cassandra20141009
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Big Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBig Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and Cassasdra
 
NoSQL and MongoDB Introdction
NoSQL and MongoDB IntrodctionNoSQL and MongoDB Introdction
NoSQL and MongoDB Introdction
 

Kürzlich hochgeladen

Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 

Kürzlich hochgeladen (20)

Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 

Cassandra Deep Diver & Data Modeling

  • 1. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 1
  • 2. REQUISITE SLIDE – WHO AM I? - Brian Enochson - Home is the Jersey Shore - SW Engineer who has worked as designer / developer on Cassandra - Consultant – HBO, ACS, CIBER – - Specialize in SW Development, architecture and training C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 2
  • 3. REQUISITE SLIDE # 2 – WHAT ARE WE TALKING ABOUT? • Cassandra Intro & Architecture • Why Cassandra • Architecture • Internals • Development • Data Modeling Concepts • Old vs. New Way • Basics • Composite Types • Collections • Time Series Data • Counters C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 3
  • 4. REQUISITE SLIDE #3 – THE NAME CASSANDRA Cassandra was the most beautiful of the daughters of Priam ad Hecuba, the king and queen of Troy. She was given the gift of prophecy by Apollo who wished to seduce her; when she accepted his gift but refused his sexual advances, he deprived her prophecies of the power to persuade.** Relate this to a database -> Can see into the future. Good result for a database. This is good… Can predict the future, but no one believed what was said!!! This is not so good…. ** http://www.pantheon.org/articles/c/cassandra.html Anyway, Cassandra is also universally known as C* C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 4
  • 5. HISTORY • Developed At Facebook, based on Google Big Table and Amazon Dynamo ** • Open Sourced in mid 2008 • Apache Project March 2009 • Commercial Support through Datastax (originally known as Riptano, founded 2010) • Used at Netflix, eBay and many more. Reportedly 300 TB on 400 machines largest installation • Current verson is 1.2.x, 2.0 in RC1. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 5
  • 6. WHY EVEN CONSIDER C* • Large data sets • Require high availability • Multi Data Center • Require large scaling • Write heavy applications • Can design for queries • Understand tunable consistency and implications (more to come) • Willing to make the effort upfront for the reward C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 6
  • 7. SOME BASICS • ACID • CAP Theorem • BASE C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 7
  • 8. ACID Everyone has heard of ACID • Atomic – All or None • Consistency – What is written is valid • Isolation – One operation at a time • Durability – Once committed to the DB, it stays This is the world we have lived in for a long time… C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 8
  • 9. CAP THEOREM (BREWERS) Many may have heard this one CAP stands for Consistency, Availability and Partition Tolerance • Consistency –like the C in ACID. Operation is all or nothing, • Availability – service is available. • Partition Tolerance – No failure other than complete network failure causes system not to respond So.. What does this mean? ** http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 9
  • 10. YOU CAN ONLY HAVE 2 OF THEM C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 10 Or better said in C* terms you can have Availability and Partition-Tolerant AND Eventual Consistency. Means eventually all accesses will return the last updated value.
  • 11. BASE But maybe you have not heard this one… Somewhat contrived but gives CAP Theorem an acronym to use against ACID… Also created by Eric Brewer. Basically Available – system does guarantee availability, as much as possible. Soft State – state may change even without input. Required because of eventual consistency Eventually Consistent – it will become consistent over time. ** Also, as engineers we cannot believe in anything that isn’t an acronym! C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 11
  • 12. C* - WHAT PROBLEM IS BEING SOLVED? • Database for modern application requirements. • Web Scale – massive amounts of data • Scale Out – commodity hardware • Flexible Schema (we will see this how this concept is evolving) • Online admin (add to cluster, load balancing). Simpler operations • CAP Theorem Aware • Built based on • Amazon Dynamo – Took partition and replication from here ** • Google Bigtable – log structured column family from here *** ** http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html *** http://research.google.com/archive/bigtable.html C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 12
  • 13. C* BASICS • No Single Point of Failure – highly available. • Peer to Peer – no master • Data Center Aware – distributed architecture • Linear Scaling – just add hardware • Eventual Consistency, tunable tradeoff between latency and consistency • Architecture is optimized for writes. • Can have 2 billion columns! • Data modeling for reads. Design starts with looking at your queries. • With CQL became more SQL-Like, but no joins, no subqueries, limited ordering (but very useful) • Column Names can part of data, e.g. Time Series Don’t be afraid of denormalized and redundant data for read performance. In fact embrace it! Remember, writes are fast. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 13
  • 14. NOTE ABOUT EVENTUAL CONSISTENCY ** Important Term ** Quorum : Q = N / 2 + 1. We get consistency in a BASE world by satisfying W + R > N 3 obvious ways: 1.W = 1, R = N 2.W = N, R = 1 3.W = Q, R = Q (N is replication factor, R = read replica count, W = write replica count) C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 14
  • 15. THE C* DATA MODEL C* data model is made of these: Column – a name, a value and a timestamp. Applications can use the name as the data and not use value. (RDBMS like a column). Row – a collection of columns identified by a unique key. Key is called a partition key (RDBMS like a row). Column Family – container for an ordered collection rows. Each row is an ordered collection of columns. Each column has a key and maybe a value. (RDBMS like a table). This is also known as a table now in C* terms. Keyspace – administrative container for CF’s. It is a namespace. Also has a replication strategy – more late. (RDBMS like a DB or schema). Super Column Family – say what? C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 15
  • 16. SUPER COLUMN FAMILY Not recommended, but they exist. Rarely discussed It is a key, that contains to one or more nested row keys and then these each contain a collection of columns. Can think of it as a hash table of hash tables that contain columns.. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 16
  • 17. ARCHITECTURE (CONT.) C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 17 http://www.slideshare.net/gdusbabek/data-modeling-with-cassandra-column- families
  • 18. OR CAN ALSO BE VIEWED AS… C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 18 http://www.slideshare.net/gdusbabek/data-modeling-with- cassandra-column-families
  • 19. TOKENS Tokens – partitioner dependent element on the ring. Each node has a single unique token assigned. Each node claims a range of tokens that is from its token to token of the previous node on the ring. Use this formula Initial_Token= Zero_Indexed_Node_Number * ((2^127) / Number_Of_Nodes) In cassandra.yaml initial token=42535295865117307932921825928971026432 ** http://blog.milford.io/cassandra-token-calculator/ C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 19
  • 20. C* PARTITIONER RandomPartitioner – MD5 hash of key is token (128 bit number), gives you even distribution in cluster. Default <= version 1.1 OrderPreservingPartitioner – tokens are UTF-8 strings. Uneven distribution. Murmur3Partitioner – same functionally as RandomPartitioner, but is 3 – 5 times faster. Uses Murmur3 algorithm. Default >= 1.2 Set in cassandra.yaml partitioner: org.apache.cassandra.dht.Murmur3Partitioner C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 20
  • 21. REPLICATION • Replication is how many copies of each piece of data that should be stored. In C* terms it is Replication Factor or “RF”. • In C* RF is set at the keyspace level: CREATE KEYSPACE drg_compare WITH replication = {'class':'SimpleStrategy', 'replication_factor':3}; • How the data is replicated is called the Replication Strategy • SimpleStrategy – returns nodes “next” to each other on ring, Assumes single DC • NetworkTopologyStrategy – for configuring per data center. Rack and DC’s aware. update keyspace UserProfile with strategy_options=[{DC1:3, DC2:3}]; C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 21
  • 22. SNITCH • Snitch maps IP’s to racks and data centers. • Several kinds that are configured in cassandra.yaml. Must be same across the cluster. SimpleSnitch - does not recognize data center or rack information. Use it for single-data center deployments (or single-zone in public clouds) PropertyFileSnitch - This snitch uses a user-defined description of the network details located in the property file cassandra-topology.properties. Use this snitch when your node IPs are not uniform or if you have complex replication grouping requirements. RackInferringSnitch - The RackInferringSnitch infers (assumes) the topology of the network by the octet of the node's IP address. EC2* - EC2Snitch, EC2MultiRegionSnitch C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 22
  • 23. RING TOPOLOGY When thinking of Cassandra best to think of nodes as part of ring topology, even for multiple DC. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 23
  • 24. SimpleStrategy Using token generation values from before. 4 node cluster. Write value with token 32535295865117307932921825928971026432 C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 24
  • 25. SimpleStrategy #2 C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 25
  • 26. SimpleStrategy #3 With RF of 3 replication works like this: C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 26
  • 27. NetworkTopologyStrategy Using LOCAL_QUORUM, allows write to DC #2 to be asynchronous. Marked as success when writes to 2 of 3 nodes (http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple- data-centers) C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 27
  • 28. COORDINATOR & CL • When writing, Coordinator Node will be selected. Selected at write (or read) time. Not a SPF! • Using Gossip Protocol nodes share information with each other. Who is up, who is down, who is taking which token ranges, etc. Every second, each node shares with 1 to 3 nodes. • Consistency Level (CL) – says how many nodes must agree before an operation is a success. Set at read or write operation. • ONE – coordinator will wait for one node to ack write (also TWO, THREE). One is default if none provided. • QUORUM – we saw that before. N / 2 + 1. LOCAL_QUORUM, EACH_QUORUM • ANY – waits for some replicate. If all down, still succeeds. Only for writes. Doesn’t guarantee it can be read. • ALL– Blocks waiting for all replicas C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 28
  • 29. ENSURING CONSISTENCY 3 important concepts: Read Repair - At time of read, inconsistencies are noticed between nodes and replicas are updated. Direct and background. Direct is determined by CL. Anti-Entropy Node Repair - For data that is not read frequently, or to update data on a node that has been down for a while, the nodetool repair process (also called anti-entropy repair). Builds Merkle trees, compares nodes and does repair. Hinted Handoff - Writes are always sent to all replicas for the specified row regardless of the consistency level specified by the client. If a node happens to be down at the time of write, its corresponding replicas will save hints about the missed writes, and then handoff the affected rows once the node comes back online. This notification happens is via Gossip. Default 1 hour. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 29
  • 30. ENTER VNODES • Virtual Nodes introduced with Cassandra 1.2 • Allow spreading of a node across physical servers • Why • Distribution of load • No token management • Concurrent streaming across hosts • Two ways to specify in cassandra.yaml • initial_tokens: <token1>,<token2>,<token3>,<token4>, ….. Or  num_tokens: 256 # recommended as a starter C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 30
  • 31. VNODES - HOW DOES IT LOOK C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 31
  • 32. STORAGE Commit Log – Cassandra appends to the commit log first. Uses fsync in the background to force writing of these changes. Durability, can redo writes in case of crash. Memtable – in memory cache stored by key. After appending to commit log, writes to Memtable. Then write is considered successful. Each CF has different Memtable. SSTables – Immutable files. When memtable runs out of space or hits a defined key limit it writes out to SSTable asynchronously. When they reach threshold, compaction (Minor) takes place they are merged (uses a structure called Merkle Tree for this). Bloom Filter – for each SSTable, checks here first if key exists. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 32
  • 33. C* PERSISTENCE MODEL C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 33
  • 34. WRITE PATH 1. Determine all applicable replica nodes across all DC’s 2. Coordinator Node sends to all replicas in the local DC 3. Coordinator sends to ONE replica in remote DC 4. Selected remote replica then sends on to other remote replica nodes 5. All respond back to coordinator • The CL is how long the coordinator blocks before it returns success or failure to client. (remember ANY, ONE, TWO, LOCAL_QUORUM etc.) • If a replica node is down (or does not respond), HINTED HANDOFF kicks in. The hint is target replica id + mutation data. • Can configure max hint time and after hints no longer stored. • Hinted handoff runs every 10 minutes and tries to send hints to nodes. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 34
  • 35. READ PATH • Consistency level determines how many nodes must have data • Read request goes to a single node, Storage Proxy, it determines nodes that hold replicas of the data, uses Snitch to determine “proximity”. • CL of ONE - first find it returns • CL of Quorum – wait for majority. Better guarantee to get most recent. • After first read, digest calculated of data and used to determine whether nodes not consistent. If not, in background Read Repair is activated. • Different SSTables may have different columns so need to merge results. Reads involve checking Bloom Filter to see if SSTable has key. Key cache to get direct access to data and row cache. Very dependent on configuration. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 35
  • 36. DELETIONS Distributed systems present unique problem for deletes. If it actually deleted data and a node was down and didn’t receive the delete notice it would try and create record when came back online. So… C* uses the concept of a Tombstone. The data is replaced with a special value called a Tombstone, works within distributed architecture Tombstones are cleaned up after defined time (GCGraceSeconds, default 10 days) during compaction. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 36
  • 37. THRIFT TO CQL • Thrift - original RPC based API and is still fully supported. Used by clients such as Hector, Pycassa, etc. • Interact with Cassandra using the cassandra-cli. • CQL3 – new and fully supported API. Table orientated, schema driven query language. It resembled (very closely at times) SQL. • Interact with Cassandra using cqlsh. • Cassandra Table – defined as a sparse collection of well known and ordered columns. • Still uses same underlying storage as Thrift. • More intuitive data modeling for many. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 37
  • 38. APPLICATION DEVELOPMENT • Interaction with Cassandra can be done using one of supplied clients such as CLI or CQL. Otherwise client applications are built using a language client library. • Many clients in multiple languages. Including Java, .NET, Python, Scala, Go, PHP, Node.js, Perl, Ruby, etc. • Java: • Hector wraps the underlying Thrift API. Hector is one of the most commonly used client libraries. • Astyanax is a client library developed by Netflix . • Datastax CQL – newest CQL driver, will be very familiar to JDBC developers • And many more … (JPA) • Also exists Datastax OPSCenter and other various GUI’s and REST API (Virgil) C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 38
  • 39. HECTOR EXAMPLE C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 39
  • 40. DATASTAX CQL DRIVER From version 1.2. onward. Uses new CQL method for C* interaction. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 40
  • 41. INSTALLATION You have 3 options • Cassandra Cluster Manager – installs multi node cluster automatically with just a few parameters. Great for dev and testing (and VERY cool). • https://github.com/pcmanus/ccm • Datastax Community Edition – Ease of installation and you get free version of OpsCenter. Has a Windows installer (msi) if using Windows. • http://planetcassandra.org/Download/DataStaxCommunityEdition ** • Open Source Version – from Cassandra, most recent versions, RC’s etc. • http://cassandra.apache.org/download/. • CQL Driver - http://www.datastax.com/download/clientdrivers ** Planet Cassandra - Great source of Cassandra info!! C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 41
  • 42. DATA MODELING • NoSQL you don’t need to model your data in Cassandra. Or do you? • Flexible schema = no (or little) data modeling done. • Proved to be problematic from application and query perspective. • Cassandra is extremely good at reading data in order it was stored. • CQL makes data modeling much easier, nice bridge from SQL world if that knowledge is there. • Typically in modeling in C*, you will denormalize data to use the storage engine order. • Essential is to create a good data model is understanding of queries that will be used. • Remember no joins, no enforced foreign keys. The app (client) heavily influences the model and we don’t think in terms of “normal form”. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 42
  • 43. OUR USE CASES Our use case involves health care data and comparison data. General requirement is we have 3 files. • One summary by DRG (diagnosis related group) codes • One broken down by state • One by practitioner. • We want to be able to store data (entities), perform performant queries by state, DRG code and provider to get comparisons. • Let’s model the data, have an app load and look at the decisions we will make. • Will use CQL3, no compact storage for backward compatability. • Will use Datastax CQL Driver • OpenCSV for reading files. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 43
  • 44. SAMPLE DATA – DIAGNOSIS RELATED GROUP DRG Definition Provider Id Provider Name Provider Street Address Provider City 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 44 State Zip Total Discharges Average Covered Charges Average Total Payments AL 36301 91 32963.07692 5777.241758
  • 45. RDBMS ENTITY MODEL C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 45
  • 46. CASSANDRA DATA MODEL • Tables (Column Family) for complete data storage • Index tables with compound keys for query • Application will handle required joining and foreign keys etc. • Cassandra handles the quick writes and other important matters (replication and availability). C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 46
  • 47. STANDARD ENTITY TABLES • Used to store all the data for the different inputs • Can be used for lookups, but remember don’t be afraid to denormalize. • Following entities • Summary • State • Provider C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 47
  • 48. MATERIALIZED VIEWS • These are optimized for queries. • Composite columns for utilizing storage order. • Denormalized for accessing required info • The index tables make sense initially • Drg codes • Providers • State • Can use secondary indexes to add in crtieria, e.g. state C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 48
  • 49. WHAT ABOUT TIME SERIES? • Cassandra is highly performant at storing time series (event) data • First data modeling pattern utilizes single DRG Code per row, the DRG Code is the partition key and they timestamp is the column. • Works well as again we are using C* built-in sorting • Second pattern, can be used if too many columns so limit row size • Use row partitioning by adding date portion into row key. So all data for one day is in one row. • Gets interesting when you use TTL, which gives you “automatic” data expiring! C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 49
  • 50. COLLECTIONS • Can have set, list or map • Set – return sorted when queried • List – when sorted order is not natural order. Also, value can be in multiple times • Map – key / value as name implies • Inserts, updates and deletes all allowed • Syntax takes some practice, + and - • Can expire each element setting a TTL • Each element is stored as a column C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 50
  • 51. COUNTERS • Distributed counters present another problem. • C* has a counter type • One PK and a counter column all that is allowed • Update right away, no inserts • Can only increment or decrement. • Test thoroughly under load. C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 51
  • 52. SUMMARY C* Provider highly available, distributed, DC aware DB with tuneable consistency out of the box. A lot of tools at your disposal. Work close with ops or devops . Test, test and test again. Don’t be afraid to use the C* community. Thank you! C A S S A N D R A - A R C H I T E C T U R E & D A T A M O D E L I N G 52