The document discusses how Infochimps refactored their Cube analytics platform to scale from terabytes to petabytes of data. They measured performance to identify bottlenecks, harmonized the system design to better utilize resources like separating events and metrics into different databases, and tuned configurations like using capped collections. This allowed them to scale to billions of events per day while keeping metrics calculations fast. They also discussed techniques for handling increased volume like using queues, caching, and cloud infrastructure.
5. Lightweight Dashboards
• Understand what’s happening
• Understand data in context
• NOT exploratory analytics
• real-time insight...
but not just about real-time
mainline: j.mp/sqcube
hi-scale branch: j.mp/icscube
13. Events vs Metrics
Event:
•{ time: "2013-02-15T01:02:03Z",
type: "tweet", id: 8675309, data: {
text: "MongoDB talk yay",
retweet_count: 121,
user: { screen_name: "infochimps",
followers_count: 7851,
lang: "en", ...} } }
Metrics:
• “# of tweets in 10s bucket at 1:02:10 on 2013-02-15”
• “# of non-english-language tweets in 1hr bucket at ...”
14. Events vs Metrics
Event:
•{ time: "2013-02-15T01:02:03Z",
type: "webreq", data: {
path: "/order", method: "POST",
duration: 50.7, status: 400,
ua:"...MSIE 6.0..." } }
Metrics:
• “# of requests in 10s bucket at 3:05:10 on 2013-02-15”
• “Average duration of requests with 4xx status in the 5
minute bucket at 3:05:00 on 2013-02-15”
28. Grok: client-side
• Made a sprayer to inject data
• invalidate a time range at max speed
• writes variously-shaped data: noise, ramp, sine, etc
• Or just reach into the DB and poke
• delete range of metrics, leave events
• delete range of events, leave metrics
29. Fault injection
• raise when packet comes in with certain flag
•{ time: "2013...", data: {...},
_raise:"db_write" }
• (only in development mode, obvs.)
30. app-side tracing
metalog.event('connect',
{ method: 'ws',
ip: connection.remoteAddress,
path: request.url }, 'minor');
• “Metalog” announces lifecycle progress:
• writes to log...
• ... or as cube metrics!
37. Still CPU and Memory Use
• Problem
• Mongo seems to be working
• but high resident memory and fault rate
• Memory-mapped Files
• 1Tb data served by 4Gb ram is no good
38. Capped Collections
• Fixed size circular queue
• records are in order of insertion
A B C D A E F
• oldest records are discarded when full
...G H C D A E F G ...
39. Capped Collections
• Extremely efficient on write
A B C D A E F
• Extremely efficient for insertion-order reads
• Very efficient if queries are ‘local’
• events in same timebucket
typically arrived at nearby times
and so are nearby on disk
50. Pyramidal Aggregation
90
5min
10 20 15 25 10 10
1min
1 5 2 0 2 0 6 4 7 1 0 2 2 3 2 4 2 2 5 5 4 6 4 1 2 7 0 0 0 1 6 0 0 1 0 3
10s
ev ev ev ev ev ev ...
51. Pyramidal Aggregation
5min
1min
1 5 2 0 2 0 6 4 7 1 0 2 2 3 2 4 2 2
10s
ev ev ev ev ev ev ...
52. Uses Cached Results
5min
10 20 15 25 10
1min
1 5 2 0 2 0 6 4 7 1 0 2 2 3 2 4 2 2 5 5 4 6 4 1 2 7 0 0 0 1
10s
ev ev ev ev ev ev ...
53. Pyramidal Aggregation
• calculates metrics...
• from metrics and constants ... from metrics ...
• from events
• (then stores them, cached)
5 min
1 min
10 sec
ev ev ev ev ev....
59. Inserts Stop Every 5s
• working
• working
• ANGRY
• ANGRY
• working
• working
60. Thanks, mongostat!
• working
• working
• ANGRY
...
• ANGRY
• working
• working
(simulated)
61. Inserts Stop Every 5s
Events Collection
...G H C D A E F G ...
hi-speed writes localized reads
Metrics Collection
. . . . . . . . . . . . . . . . . .
.
.
. x x xx x x
. . . . . .
. . . . . .
x .
randomish hi-speed
reads deletes
updates
62. Inserts Stop Every 5s
Events Collection
...G H C D A E F G ...
hi-speed writes localized reads
Metrics Collection
. . . . . . . . . . . . . . . . . .
.
.
. x x xx x x
. . . . . .
. . . . . .
x .
randomish hi-speed
reads deletes
updates
63. Inserts Stop Every 5s
• What’s really going on?
• Database write locks
• Events and metrics have conflicting locks
• Solution: split the databases
Events Collection
...G H C D A E F G ...
hi-speed writes localized reads
Metrics Collection
. . . . . . . . . . . . . . . . . .
.
.
. x x xx x x
. . . . . .
. . . . . .
x .
randomish hi-speed
reads deletes
67. Non-pyramidal Aggregates
• Can’t calculate from warmed metrics
• Store values with counts in metrics
• Counts can be vivified for aggregations
• Smaller footprint than full events
• Works best for dense, finite values
78. Locks: update VS remove
• Uncapped metrics allow ‘remove’ as
invalidation option
• Remove doesn’t help with database locks
• It was a stupid idea anyway: that’s OK
• “Hey, poke it and see what happens!”
79. Mongo Aggregations
• Mongo has aggregations!
• Node ends up working better
• Mongo aggregations aren’t faster
• Less flexible
• Would require query language rewrite
80. Why not Graphite?
• Data model
• Metrics-centric vs Events-centric
(metrics code not intertwingled with app code)
• Environment familiarity
• Cube: d3, node.js, mongo
• Graphite: Django, Whisper, C