Cassandra from the trenches: migrating Netflix

Cassandra from the trenches:
migrating Netflix
Jason Brown
Senior Software Engineer
Netflix
@jasobrown jasedbrown@gmail.com

http://www.linkedin.com/in/jasedbrown

Your host for the evening
• Sr. Software Engineer at Netflix > 3 years
– Currently lead a team developing and operating
AB testing infrastructure in EC2
– Spent time migrating core e-commerce
functionality out of PL/SQL and scaling it up
• MLB Advanced Media
– Ran Ecommerce engineering group
• Wandered about in the wireless space
(J2ME, BREW)

History
• In the beginning, there was the webapp
– And a database, too
– In one datacenter
• Then we grew, and grew, and grew
– More databases, all conjoined
– Database links with PL/SQL and M views
– Multi-Master replication

History,2
• Then it melted down (2008)
– Oracle MMR between two databases
– SPOF – one Oracle instance for website (no
backup)
• Couldn’t ship DVDs for ~3 days

History,3
• Time to rethink everything
– Abandon datacenter for EC2
• We’re not in the business of building datacenters
– Ditch monolithic webapp for distributed systems
• Greater independence for all teams/initiatives
– Migrate SPOF database to …

History,4
• SimpleDb/S3
– Somebody else manages your database (yeah!)
– Tried it out, but didn’t quite work well for us
– High latency, rate limiting (throttling), (no) auto-
sharding, no backup problems
• Time to try out one of them (other) new
fangled NoSql things…

Shiny new toy
• We selected Cassandra
– Dynamo-model appealed to us
– Column-based, key-value data model seemed
sufficient for most needs
– Performance looked great (rudimentary tests)
• Now what?
– Put something into it
– Run it in EC2
– Sounds easy enough…

• Data Modeling
– Where the rubber meets the road

About Netflix’s AB Testing
• We use it everywhere (no, really)
• Basic concepts
– Test – An experiment where several competing
behaviors are implemented and compared
– Cell – different experiences within a test that are
being compared against each other
– Allocation – a customer-specific assignment to a
cell within a test
• Customer can only be in one cell of a test at a time
• Generally immutable (very important for analysis)

Data Modeling - background
• AB has two sets of data
– metadata about tests
– allocations
• Both need to be migrated out of Oracle and
into Cassandra in the cloud

AB - allocations
• Single table to hold allocations
– Currently at ~950 million records
– Plus indices!
• One record for every test that every customer
is allocated into
• Unique constraint on customer/test

AB - metadata
• Fairly typical parent-child table relationship
• Not updated frequently, so service can cache

Data modeling in cassandra
• Every where I looked, the internets told me to
understand my data use patterns
– Understand the questions that you need to
answer from the data
• Meaning: know how to query your data structure the
persistence model to match

• There’s no free lunch here, apparently

Identifying the AB questions that need
to be answered
• get all allocations for a customer
• get count of customers in test/cell
• find all customers in a test/cell
– So we can kick them out of the test
– So we can clean up ancient data
– So we can move them to a different cell in test
• find all customers allocated to test within a
date range
– So we can kick them out of the test

Modeling allocations in cassandra
• As we’re read-heavy, read all allocations for a
customer as fast as possible
– Denormalize allocations into a single row
– But, how do I denormalize?
• Find all of customers in a test/cell = reverse
index
• Get count of customers in test/cell = count the
entries in the reverse index

Denormalization-HOWTO
• The internets talk about it, but no real world
examples
– ‘Normalization is for sissies’, Pat Helland
• Denormalizing allocations per customer
– Trivial with a schema-less database

Denormalized allocations
• Sample normalized data

• Sample denormalized data (sparse!)

Implementing allocations
• As allocation for a customer has a handful of
data points, they logically can be grouped
together
• Hello, super columns
• Avoided blobs, json or otherwise
– data race concerns
– BI integration
– Serialization alg changes could tank the data

Implementing allocations, second
round
• But, cassandra devs secretly despise don’t
enjoy super columns
• Switched to standard column family, using
composite columns
• Composite columns are sorted by each ‘token’
in name
– This sorts each allocation’s data together (by
testId)

Composite columns
• Allocation column naming convention
– <testId>:<field>
– 42:cell = 2
– 42:enabled = Y
– 47:cell = 0
– 47:enabled = Y
• Using terse field names, but still have column
name overhead (~15 bytes)

Implementing indices
• Cassandra’s secondary indices vs. hand-built
and maintained alternate indices
• Secondary indices work great on uniform data
between rows
• But sparse column data not so easy

Hand-built Indices, 1

• Reverse index
– Test/cell (key) to custIds (columns)
• Column value is timestamp
• Mutate on allocating a customer into test

Hand-built indices, 2
• Counter column family
– Test/cell to count of customers in test columns
– Mutate on allocating a customer into test
• Counters are not idempotent!
• Mutates need to write to every node that
hosts that key

Index rebuilding
• Yeah, even Oracle needs to have it’s indices
rebuilt
• Easy enough to rebuild the reverse index, but
how about that counter column?
– Read the reverse index for the count and write
that as counter’s value

Modeling AB metadata in cassandra
• Explored several models, including json
blobs, spreading across multiple CFs, differing
degrees of denormalization
• Reverse index to identify all tests for loading

Implementing metadata
• One CF, one row for all test’s data
– Every data point is a column – no blobs
• Composite columns
– type:id:field
• Types = base info, cells, allocation plans
• Id = cell number, allocation plan (gu)id
• Field = type-specific
– Base info = test name, description, enabled
– Cell’s name / description
– Plan’s start/end dates, country to allocate to

Into the real world … here comes the hurt

Allocation mutates
• AB allocations are immutable, so how do you
prevent mutating?
– Oracle – unique constraint on table
– Cassandra – read before write
• Read before write in a distributed system is a
data race

Running cassandra
• Compactions happen
– Part of the Cassandra lifestyle
– Mutations are written to memory (memtable)
– Flushed to disk (sstable) on triggering threshold
• Time
• Size
• Operations against column family
– Eventually, Cassandra decides to merge sstables as
data for a individual rows becomes scattered

Compactions, 2
• Spikes happen, esp. on read-heavy systems
– Everything can slow down
– Sometimes, average latency > 95%ile
– Throttling in newer Cass versions helps, I think
– Affects clients (hector, astyanax)

Repairs
• Different from read repair!
• Fix all the data in a single node by pulling
shared ranges from neighbor nodes

Repairs, 2
• Replication factor determines number of
nodes involved in repair of single node
• Neighbor nodes will perform validation
compaction
– Pushes disk and network hard dep. on data size
• Guess what happens when you run a multi-
region cluster?

Client libraries
• Round-robin is not the way to go for
connection pooling
– Coordinator Cassandra nodes will incorrectly be
marked down rather than target slow node
• Token-aware is safer, faster, but harder to
implement

Tunings, 1
• Key and row caches
– Left unbounded can chew up jvm memory needed
for normal work
– Latencies will spike as the jvm needs to fight for
memory
– Off-heap row cache is better but still maintains
data structures on-heap

Tunings, 2
• mmap() as in-memory cache
– When process terminated, mmap pages are added
to the free list

Tunings, 3
• Sizing memtable flushes for optimizing
compactions
– Easier when writes are uniformly
distributed, timewise – easier to reason about
flush patterns
– Best to optimize flushes based on memtable
size, not time

Tunings, 4
• Sharding
– Not dead yet!
– If a single row has disproportionately high
gets/mutates, the nodes holding it will become
hot spots
– If a row grows too large, it won’t fit into memory

Takeaways
• Netflix is making all of our components
distributed and fault tolerant as we grow
domestically and internationally.

• Cassandra is a core piece of our cloud
infrastructure.

終わり(The End)

• Q&A

@jasobrown jasedbrown@gmail.com

http://www.linkedin.com/in/jasedbrown

References
• Pat Helland, ‘Normalization Is for Sissies”
http://blogs.msdn.com/b/pathelland/archive/
2007/07/23/normalization-is-for-sissies.aspx
• btoddb, “Storage Sizing” http://btoddb-cass-
storage.blogspot.com/

Cassandra from the trenches: migrating Netflix

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Cassandra from the trenches: migrating Netflix

Similar to Cassandra from the trenches: migrating Netflix (20)

Recently uploaded

Recently uploaded (20)

Cassandra from the trenches: migrating Netflix

Editor's Notes