Speaker: Rick Branson, Infrastructure Engineer at Instagram
Cassandra is a critical part of Instagram's large scale site infrastructure that supports more than 100 million active users. This talk is a practical deep dive into data models, systems architecture, and challenges encountered during the implementation process.
18. commit 3f2e4f2e5da6fe99d7f3fc13c0da09b464b3a9e0
Author: Rick Branson
Date: Wed Nov 21 09:50:21 2012 -0800
Drop key cache size on C*UA cluster: was causing heap
issues, and apparently 1GB is _WAY_ outside of the normal
range of operation for nodes of this size.
23. commit 84982635d5c807840d625c22a8bd4407c1879eba
Author: Rick Branson
Date: Thu Jan 31 09:43:56 2013 -0800
Switch Cassandra from tokens to vnodes
commit e990acc5dc69468c8a96a848695fca56e79f8b83
Author: Rick Branson
Date: Sun Feb 10 20:26:32 2013 -0800
We aren't ready for vnodes yet guys
41. user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id <activity> <activity> ... <activity>user_id
timestamp1 timestamp2
...
timestamp101
delete(<user_id>,
timestamp=<timestamp101>)
Row Delete
Deletes any data on a row with a timestamp
value equal to or less than the timestamp
provided in the delete operation.
48. SuperColumn = Old/Busted
AntiColumn = New/Hotness
user_id
(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)
user_id
anti-column activity activity
"Anti-Column"
Contains an MD5 hash of the activity data it
is marking as deleted.
49. user_id
(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)
user_id
anti-column activity activity
Composite Column
First component is zero for anti-columns,
splitting the row into two independent lists,
and ensuring the anti-columns always appear
at the head.
50. Replica
[A, B, C]
Replica
[A, C]
Writer Writer
insert B insert COK
Replica
[A, B, C]
FAIL
"like Z" undo "like Z"
Diverging Replicas: Solved
OK
51. TAKEAWAY
Read-before-write is a smell. Try to model data as
a log of user "intent" rather than manhandling the
data into place.
52. •Keep 30% "buffer" for trims.
•Undo without read. (thumbsup)
•Large lists suck for this. (thumbsdown)
•CASSANDRA-5527
62. for vnode clusters, multiple tokens are
selected randomly when a node is
bootstrapped.
63. IP address is effectively the "primary key"
for nodes in a ring.
64. What had happened was.
1. Rebuilding node generated entirely
new tokens and joined cluster.
2. Rest of cluster dropped the previously
stored token data associated with the
rebuilding node's IP address.
3. Token ranges shifted massively.
68. kill -3 <cassandra>
"AntiEntropyStage:1"
java.lang.Thread.State: RUNNABLE
<...>
at io.sstable.SSTableReader.decodeKey(SSTableReader.java:1014)
at io.sstable.SSTableReader.getPosition(SSTableReader.java:802)
at io.sstable.SSTableReader.getPosition(SSTableReader.java:717)
at io.sstable.SSTableReader.getPositionsForRanges(SSTableReader.java:664)
at streaming.StreamOut.createPendingFiles(StreamOut.java:155)
at streaming.StreamOut.transferSSTables(StreamOut.java:140)
at streaming.StreamingRepairTask.initiateStreaming(StreamingRepairTask.java:
at streaming.StreamingRepairTask.run(StreamingRepairTask.java:115)
<...>
Every repair task was scanning every
SSTable file to find ranges to repair.
69. Scan all the things.
•Standard Compaction: Only a few
dozen SSTables.
•Non-VNodes: Repair is done once per
token, and there is only one token.
72. TAKEAWAY
If you want to use VNodes and
LeveledCompactionStrategy, wait until the 1.2.6
release when CASSANDRA-5569 is merged in.
73. Where were we?
It was a bad thing to not know data was
inconsistent until we saw an increase in user
reported problems.
74. CASSANDRA-5618
$ nodetool netstats
Mode: NORMAL
Not sending any streams.
Not receiving any streams.
Read Repair Statistics:
Attempted: 3192520
Mismatch (Blocking): 0
Mismatch (Background): 11584
Pool Name Active Pending Completed
Commands n/a 0 1837765727
Responses n/a 1 1750784545
UPDATE COLUMN FAMILY
InboxActivitiesByUserID
WITH read_repair_chance = 0.01;
99.63% consistent
75. TAKEAWAY
The way to rebuild a box in a vnode cluster is to
build a brand new node, then remove the old one
with "nodetool removenode."
76.
77. Fetch & Deserialize Time (measured from app)
Mean vs P90 (ms), trough-to-peak
78. Column Family: InboxActivitiesByUserID
SSTable count: 3264
SSTables in each level: [1, 10, 105/100, 1053/1000, 2095, 0, 0]
Space used (live): 80114509324
Space used (total): 80444164726
Memtable Columns Count: 2315159
Memtable Data Size: 112197632
Memtable Switch Count: 1312
Read Count: 316192445
Read Latency: 1.982 ms.
Write Count: 1581610760
Write Latency: 0.031 ms.
Pending Tasks: 0
Bloom Filter False Positives: 481617
Bloom Filter False Ratio: 0.08558
Bloom Filter Space Used: 54723960
Compacted row minimum size: 25
Compacted row maximum size: 545791
Compacted row mean size: 3020