1. Realtime Analytics
with Apache
Cassandra
Tom Wilkie
Founder & CTO, Acunu Ltd
@tom_wilkie
2. 101
• BigTable-style datamodel combined with
Dynamo-style consistency
• Simple queries - put, get, range queries
• Multi-master architecture: no SPOF
• Tunable consistency, multi-DC aware
• Optimised for random writes & random
range queries
• Atomic counters, wide rows, composite keys
2
Analytics
3. Combining “big” and “real-time” is hard
Live & historical Drill downs
Trends...
aggregates... and roll ups
3
Analytics
4. Solution Con
Scalability
$$$
Not realtime
Spartan query semantics =>
complex, DIY solutions
4
Analytics
5. Example I
eg “show me the number of mentions of
‘Acunu’ per day, between May and
November 2011, on Twitter”
Batch (Hadoop) approach would require
processing ~30 billion tweets, or ~4.2
TB of data
http://blog.twitter.com/2011/03/numbers.html
5
Analytics
6. Okay, so how are we going to
do it?
For each tweet,
increment a bunch of counters,
such that answering a query
is as easy as reading some counters
6
Analytics
7. Preparing the data
12:32:15 I like #trafficlights
Step 1: Get a feed of 12:33:43 Nobody expects...
the tweets 12:33:49 I ate a #bee; woe is...
12:34:04 Man, @acunu rocks!
Step 2: Tokenise the
tweet
Step 3: Increment counters [1234, man] +1
in time buckets for [1234, acunu] +1
each token [1234, rock] +1
7
Analytics
8. Querying
start: [01/05/11, acunu]
Step 1: Do a range query end: [30/05/11, acunu]
Key #Mentions
[01/05/11 00:01, acunu] 3
Step 2: Result table [01/05/11 00:02, acunu] 5
... ...
90
Step 3: Plot pretty graph 45
0
May Jun Jul Aug Sept Oct Nov
8
Analytics
9. k2 k4
k3 k1
Cassandra keys distributed based on
hash or row key, ie randomly
9
Analytics
10. Instead of this...
Key #Mentions
[01/05/11 00:01, acunu] 3
[01/05/11 00:02, acunu] 5
... ...
We do this
Key 00:01 00:02 ...
[01/05/11, acunu] 3 5 ...
[02/05/11, acunu] 12 4 ...
... ... ...
Row key is ‘big’ Column key is ‘small’
time bucket time bucket
10
Analytics
11. Towards a more
general solution...
(Example II)
11
Analytics
12. count
grouped by ...
day
count
distinct
(session)
count ... geography
avg(duration)
... browser
12
Analytics
17. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
17
Analytics
18. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
where geography=UK US all→354 user01→4 user04→8 user56→17 ...
group all by user, ...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
18
Analytics
19. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
where geography=UK US all→354 user01→4 user04→8 user56→17 ...
group all by user, ...
UK, 22:00 all→1905 ...
count all ∅ all→87315 UK→239 US→354 ...
19
Analytics
20. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
where geography=UK US all→354 user01→4 user04→8 user56→17 ...
group all by user, ...
UK, 22:00 all→1905 ...
count all ∅ all→87315 UK→239 US→354 ...
group all by geo
20
Analytics
23. Count Distinct
Plan A: keep a list of all the things you’ve seen
count them at query time
Quick to update
... but at scale ...
Takes lots of space
Takes a long time to query
23
Analytics
24. Approximate Distinct
max # leading zeroes seen so far
item hash leading zeroes max so far
x 00101001110... 2 2
y 11010100111... 0 2
z 00011101011... 3 3
...
... to see a max of M takes about 2M items
24
Analytics
25. Approximate Distinct
to reduce var, average over m=2k sub-streams
item hash index, zeroes max so far
x 00101001110... 0, 0 0,0,0,0
y 11010100111... 3, 1 0,0,0,1
z 00011101011... 0, 1 1,0,0,1
...
take the harmonic mean
25
Analytics
27. Analytics
counter
updates
Click stream events
Acunu
Sensor data
Analytics
etc
• Aggregate incrementally, on the fly
• Store live + historical aggregates
30. “Up and running in about 4 hours”
“We found out a competitor
was scraping our data”
“We keep discovering use cases
we hadn’t thought of ”
http://vimeo.com/54026096
Analytics
31. "Quick, efficient and easy to
get started"
"We're still finding new and
interesting use cases, which
just aren't possible with our
current datastores." Analytics