Realtime analytics is seen as a key approach in extracting value from the new wave of Big Data. In this talk we will discuss different approaches to realtime analytics, before focusing on how to build realtime analytics application using Apache Cassandra. We will talk about some of the common usecases, how to model the data, and some more advances topics, such as algorithms for approximate analytics.
Take control of your SAP testing with UiPath Test Suite
Realtime analytics with Apache Cassandra - Tom Wilkie
1. Realtime Analytics
with Apache
Cassandra
Tom Wilkie
Founder & CTO, Acunu Ltd
@tom_wilkie
2. Combining “big” and “real-time” is hard
Live & historical Drill downs
Trends...
aggregates... and roll ups
2
Analytics
3. Solution Con
Scalability
$$$
Not realtime
Spartan query semantics =>
complex, DIY solutions
3
Analytics
4. Example I
eg “show me the number of mentions of
‘Acunu’ per day, between May and
November 2011, on Twitter”
Batch (Hadoop) approach would require
processing ~30 billion tweets, or ~4.2
TB of data
http://blog.twitter.com/2011/03/numbers.html
4
Analytics
5. Okay, so how are we going to
do it?
For each tweet,
increment a bunch of counters,
such that answering a query
is as easy as reading some counters
5
Analytics
6. Preparing the data
12:32:15 I like #trafficlights
Step 1: Get a feed of 12:33:43 Nobody expects...
the tweets 12:33:49 I ate a #bee; woe is...
12:34:04 Man, @acunu rocks!
Step 2: Tokenise the
tweet
Step 3: Increment counters [1234, man] +1
in time buckets for [1234, acunu] +1
each token [1234, rock] +1
6
Analytics
7. Querying
start: [01/05/11, acunu]
Step 1: Do a range query end: [30/05/11, acunu]
Key #Mentions
[01/05/11 00:01, acunu] 3
Step 2: Result table [01/05/11 00:02, acunu] 5
... ...
90
Step 3: Plot pretty graph 45
0
May Jun Jul Aug Sept Oct Nov
7
Analytics
8. Instead of this...
Key #Mentions
[01/05/11 00:01, acunu] 3
[01/05/11 00:02, acunu] 5
... ...
We do this
Key 00:01 00:02 ...
[01/05/11, acunu] 3 5 ...
[02/05/11, acunu] 12 4 ...
... ... ...
Row key is ‘big’ Column key is ‘small’
time bucket time bucket
8
Analytics
9. Towards a more
general solution...
(Example II)
9
Analytics
10. count
grouped by ...
day
count
distinct
(session)
count ... geography
avg(duration)
... browser
10
Analytics
15. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
15
Analytics
16. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
where geography=UK US all→354 user01→4 user04→8 user56→17 ...
group all by user, ...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
16
Analytics
17. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
where geography=UK US all→354 user01→4 user04→8 user56→17 ...
group all by user, ...
UK, 22:00 all→1905 ...
count all ∅ all→87315 UK→239 US→354 ...
17
Analytics
18. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
where geography=UK US all→354 user01→4 user04→8 user56→17 ...
group all by user, ...
UK, 22:00 all→1905 ...
count all ∅ all→87315 UK→239 US→354 ...
group all by geo
18
Analytics
21. Count Distinct
Plan A: keep a list of all the things you’ve seen
count them at query time
Quick to update
... but at scale ...
Takes lots of space
Takes a long time to query
21
Analytics
22. Approximate Distinct
max # leading zeroes seen so far
item hash leading zeroes max so far
x 00101001110... 2 2
y 11010100111... 0 2
z 00011101011... 3 3
...
... to see a max of M takes about 2M items
22
Analytics
23. Approximate Distinct
to reduce var, average over m=2k sub-streams
item hash index, zeroes max so far
x 00101001110... 0, 0 0,0,0,0
y 11010100111... 3, 1 0,0,0,1
z 00011101011... 0, 1 1,0,0,1
...
take the harmonic mean
23
Analytics
25. Analytics
counter
updates
Click stream events
Acunu
Sensor data
Analytics
etc
• Aggregate incrementally, on the fly
• Store live + historical aggregates
28. “Up and running in about 4 hours”
“We found out a competitor
was scraping our data”
“We keep discovering use cases
we hadn’t thought of ”
Analytics
29. "Quick, efficient and easy to
get started"
"We're still finding new and
interesting use cases, which just
aren't possible with our
current datastores."
Analytics