This is a talk about Netflix's path to Cassandra. The first few slides may look similar to previous presentations, but they are just to set the context. Most the content is brand new!
3. Motivation
Circa late 2008, Netflix had a single data center
Single-point-of-failure (a.k.a. SPOF)
Approaching limits on cooling, power, space, traffic
capacity
Alternatives
Build more data centers
Outsource the majority of capacity planning and scale
out
Allows us to focus on core competencies
@r39132 3
4. Motivation
Winner : Outsource the majority of capacity planning and scale
out
Leverage a leading Infrastructure-as-a-service provider
Amazon Web Services
@r39132 4
5.
6. Cloud Migration Strategy
Components
Applications and Software Infrastructure
Data
Migration Considerations
Avoid sensitive data for now
PII and PCI DSS stays in our DC, rest can go to the cloud
Favor Web Scale applications & data
@r39132 6
7. Cloud Migration Strategy
Examples of Data that can be moved
Video-centric data
Critics’ and Users’ reviews
Video Metadata (e.g. director, actors, plot description, etc…)
User-video-centric data – some of our largest data sets
Video Queue
Watched History
Video Ratings (i.e. a 5-star rating system)
Video Playback Metadata (e.g. streaming bookmarks, activity
logs)
@r39132 7
13. Pick a Data Store in the Cloud
An ideal storage solution should have the following features:
• Be hosted in AWS
• We wanted a database-as-a-service
• Be highly scalable and available and have acceptable latencies
• It should automatically scale with Netflix’s traffic growth
• It should be as available as television – i.e. zero downtime
• Support SQL
• Developers already familiar with the model
@r39132 13
14. Pick a Data Store in the Cloud
We picked SimpleDB and S3
SimpleDB was targeted as the AP (c.f. CAP theorem)
equivalent of our RDBMS databases in our Data
Center
S3 was used for data sets where item or row data
exceeded SimpleDB limits and could be looked up
purely by a single key (i.e. does not require
secondary indices and complex query semantics)
Video encodes
Streaming device activity logs (i.e. CLOB, BLOB, etc…)
Compressed (old) Rental History
@r39132 14
16. Technology Overview : SimpleDB
Terminology
SimpleDB Hash Table Relational Databases
Domain Hash Table Table
Item Entry Row
Item Name Key Mandatory Primary Key
Attribute Part of the Entry Value Column
@r39132 16
17. Technology Overview : SimpleDB
Soccer Players
Key Value
Nickname = Wizard of Teams = Leeds United,
ab12ocs12v9 First Name = Harold Last Name = Kewell Oz Liverpool, Galatasaray
Nickname = Czech Teams = Lazio,
b24h3b3403b First Name = Pavel Last Name = Nedved Cannon Juventus
Teams = Sporting,
Manchester United,
cc89c9dc892 First Name = Cristiano Last Name = Ronaldo Real Madrid
SimpleDB’s salient characteristics
• SimpleDB offers a range of consistency options
• SimpleDB domains are sparse and schema-less
• The Key and all Attributes are indexed
• Each item must have a unique Key
• An item contains a set of Attributes
• Each Attribute has a name
• Each Attribute has a set of values
• All data is stored as UTF-8 character strings (i.e. no support for types such as numbers or dates)
@r39132 17
18. Technology Overview : SimpleDB
What does the API look like?
Manage Domains
CreateDomain
DeleteDomain
ListDomains
DomainMetaData
Access Data
Retrieving Data
GetAttributes – returns a single item
Select – returns multiple items using SQL syntax
Writing Data
PutAttributes – put single item
BatchPutAttributes – put multiple items
Removing Data
DeleteAttributes – delete single item
BatchDeleteAttributes – delete multiple items
@r39132 18
19. Technology Overview : SimpleDB
Options available on reads and writes
Consistent Read
Read the most recently committed write
May have lower throughput/higher latency/lower
availability
Conditional Put/Delete
i.e. Optimistic Locking
Useful if you want to build a consistent multi-master data
store – you will still require your own consistency
checking
@r39132 19
21. Major Issues with SimpleDB
Manual data set partitioning was needed to work around size and
throughput limits
Read & Write latency variance could be large
SimpleDB multi-tenancy created both throughput and latency issues
Bad cost model – poor performance cost us more money
Not Global – new requirement : global expansion
No (external) back-up and recovery available – new requirement : refresh
Test DBs from Prod DBs
@r39132 21
22.
23. Pick Another Data Store in the Cloud
An ideal storage solution should have the following
features:
• New Requirements
• Support Global Use-cases
• Support Back-up and Recovery
• Retained Requirements
• Be highly scalable and available and have acceptable
latencies
• Obsolete Requirements
• Be hosted in AWS
• Support SQL
@r39132 23
24. Pick Another Data Store in the Cloud
We picked Cassandra (i.e. Dynamo + BigTable)
• Support Global Use-cases
• Easy to add new nodes to the cluster, even if the new nodes
are on a different continent
•
• Be easy to own and operate
• Dynamo’s masterless design avoids SPOF
• Dynamo’s repair mechanisms (i.e. read repair, anti-entropy
repair, and hinted handoff) promote self-management
@r39132 24
25. Pick Another Data Store in the Cloud
We picked Cassandra (i.e. Dynamo + BigTable)
• Be highly scalable and available and have acceptable
latencies
• Known to scale for writes
• Netflix achieved >1 million writes/sec
• Netflix leverages caches for reads
• Bonus
• Data model identical to SimpleDB – i.e. one less thing
for developers to learn
@r39132 25
27. Technology Overview : Cassandra
Features
• Consistent Hashing
• Tunable Consistency at the Request-level (like Simple DB)
• Automatic Healing
• Read Repair
• Anti-entropy Repair
• Hinted-Handoff
• Failure Detection
• Clusters are configurable & upgradeable without restart
• Infinite Incremental Write Scalability
@r39132 27
28. Consistent Hashing
How does Consistent Hashing
work in Cassandra?
• Take a number line from [0-159]
• Wrap it on itself, so you now
have a number ring
@r39132 28
29. Consistent Hashing
• Given a key k, map the key to the ring
using:
• hash_func(k)%160 = token = position
on the ring
@r39132 29
30. Consistent Hashing
• You can then manually map
machines to the ring
• 8 machines are mapped to the
ring here
• N1 0
• N2 20
• N3 40
@r39132 30
31. Consistent Hashing
Now map key ranges to
machines:
• Host N2 owns all tokens in
the range (0, 20]
• In other words, “foo” is
mapped to the bucket 9
but assigned to server N2
• Host N5 owns all tokens in
the range (60, 80]
• “bar” is mapped to bucket
79 and assigned to server
N5
@r39132 31
32. Consistent Hashing : Dead
node?
What happens if a node dies or
becomes unresponsive?
With a replication factor of say 3, data
is always written to 3 places
• Writes to tokens in N2’s primary
token range are also written to N3
and N4
• Writes to tokens in N5’s primary
token range are also written to N6
and N7
@r39132 32
34. The Write Path
• Cassandra clients are not
required to know token-ring
mapping
• Client can send a request to any
node in the cluster
• The receiving node is called the
“coordinator” for that request
• The coordinator will always
execute <Replication Factor>
number of writes (e.g. 3)
@r39132 34
35. The Write Path
• Coordinators take care of:
• Key routing
• Executing the consistency level
of the request
• e.g. “CL=1” means that the
coordinator will wait till 1
node ACKs the write before
sending a response to the
cllent
• e.g. “CL=Quorum” means
that the coordinator will wait
@r39132
till 2 of 3 nodes ACK the 35
write
36. The Write Path
• If node N3 is presumed dead by
the failure detection algorithm or
if N3 is too busy to respond
• N5 will log the write to N3
and deliver it as soon as N3
comes back
• This is called Hinted
Handoff
@r39132 36
37. The Write Path
• Now that N5 is holding
uncommitted writes, what
happens if N5 dies before it can
replay the failed write?
• How will the logged write
ever make it to N3?
• Another repair mechanism
helps in this case: Read
Repair (stay tuned)
@r39132 37
39. The Read Path
• Again, N5 acts as a proxy, this
time for a get request
• Assume that the the
consistency_level on the
request is “Quorum”, N5 will
send a full read request to N2
and digest requests to N3 & N4
• This is a network
optimization
@r39132 39
40. The Read Path : Read Repair
• N5 will compare the responses as
soon as any 2 requests return
• At least one response will be a
digest (e.g. say N2 and N3)
• If the digest is more recent than the
full read, N5 doesn’t have the latest
data
• N5 then ask the N3 replica for a
full read
@r39132 40
41. The Read Path : Read Repair
• Once N5 receives the response of
the full read from N3, N5 returns this
data to the client
• But wait! What do we do about the
stale data on N2?
• Then schedule an async update for
the stale node(s)
• This is called read repair
@r39132 41
43. Consistent Hashing : Growing the
Ring
• To double the capacity of
the ring and keep the data
distributed evenly, we add
nodes (N9, N10, etc… ) in
the interstitial positions
(10, 30, etc… )
• Cassandra bootstraps new
nodes via data streaming
• For large data sets, Netflix
recovers from backup and
runs AER
@r39132 43
44. Consistent Hashing : Growing the
Ring
• After new nodes have
been added, they take
over half of the primary
token ranges of each old
node
@r39132 44
45. A Global Ring
• One of the benefits of
Cassandra is that it
can support
deployment of a
global ring
@r39132 45
46. A Global Ring
• In January 2012,
Netflix rolled out its
streaming service to
the UK and Ireland
• It did this by running a
few Cassandra
clusters between Va
and the UK
@r39132 46
47. A Global Ring
• Growing the ring into the UK from
its origin in Va was easy
• Double the ring as mentioned
before
• New interstitial nodes are in the
UK
• Increase the global replication
factor 3 6
• Now each key is covered by 6
nodes, 3 in Va, 3 in the UK
@r39132 47
49. Single Node Design
A single node is implemented as per the LSM
tree (i.e. log structured merge tree) pattern:
• Writes first append to the commit log and
then write to the memtable in memory
• This model gives fast writes
When the memtable is full, it is flushed to disk
to give SSTable 3
The SSTables are immutable
@r39132 49
50. Single Node Design
In a pure KV LSM store (e.g. Level DB),
when a Read occurs, the memtable is first
consulted
If the key is found, the data is returned from
the memtable
This is a fast read
If the data is not in the memtable, then it is
read from the latest SSTable
This is a slower read
@r39132 50
51. Single Node Design
Since Cassandra supports multiple value
columns, only a subset of which may be
written at any time, a read must consult
memtable and potentially multiple sstables
Hence, reads can be slower than they would
be for pure LSM KV stores
@r39132 51
52. Single Node Design
To reduce the number of SSTables
that need to be read, periodic
compaction needs to be run
This puts an additional load on GC
and I/O
The end result of compaction is
fewer files with hopefully fewer total
rows
To mitigate the problems inherent in
compaction, Cassandra 1.x support
Leveled Compaction
@r39132 52
54. Current Status
Timeline of Migration
• Jan 2009 – Start investigating the AWS cloud & SimpleDB
• Feb 2010 – Deliver the app to production
• Dec 2010 – ~95% of Netflix traffic has been moved into both the
AWS cloud and SimpleDB (+10 months)
• Apr 2011 – Start migration to Cassandra
• Jan 2012 – Netflix EU launched on Cassandra (+9 months)
@r39132 54
Hinweis der Redaktion
Existing functionality needs to move in phasesLimits the risk and exposure to bugsLimits conflicts with new product launches
Existing functionality needs to move in phasesLimits the risk and exposure to bugsLimits conflicts with new product launches
Existing functionality needs to move in phasesLimits the risk and exposure to bugsLimits conflicts with new product launches