Talk for the Cassandra Seattle Meetup April 2013: http://www.meetup.com/cassandra-seattle/events/114988872/
Cassandra's got some properties which make it an ideal fit for building real-time analytics applications -- but getting from atomic increments to live dashboards and streaming queries is quite a stretch. In this talk, Tim Moreton, CTO at Acunu, talks about how and why they built Acunu Analytics, which adds rich SQL-like queries and a RESTful API on top of Cassandra, and looks at how it keeps Cassandra's spirit of denormalization under the hood.
2. 2
•Scalable. No single point of {failure, bottleneck}
•Fast. Especially for writes
•Available. Effortless Multi-DC support
•Maturing fast. Lots of production deployments
WE C*
Monday, 29 April 13
6. 5
Session storage
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
•Many more reads than writes
•Updates to existing records
(ideally, transactionally)
•Probably fits in RAM:
distribute for availability
C*: Two uses
Monday, 29 April 13
7. 5
Real-time analytics
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
•Many more writes than reads
•Almost all reads are to results
•Almost no writes are ‘updates’
•Distribute for availability,
performance, capacity
Session storage
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
•Many more reads than writes
•Updates to existing records
(ideally, transactionally)
•Probably fits in RAM:
distribute for availability
C*: Two uses
Monday, 29 April 13
8. 5
Real-time analytics
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
•Many more writes than reads
•Almost all reads are to results
•Almost no writes are ‘updates’
•Distribute for availability,
performance, capacity
Session storage
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
•Many more reads than writes
•Updates to existing records
(ideally, transactionally)
•Probably fits in RAM:
distribute for availability
C*: Two uses
Monday, 29 April 13
9. 6
C*on
•Rich, SQL-like queries
•RESTful HTTP APIs, JSON-based
•Automated denormalization
•Update semantics < less critical for analytics
Supplement Cassandra with:
Monday, 29 April 13
16. 9
Who uses Acunu?
Location DataWeb and Visitor
Market/Tick Data
Infrastructure
Sensor Data
Social Media
Social GamingSmart Grid
Production Line
Monday, 29 April 13
19. 10
API
event
stream
event
store
roll-up
cubes
Ingest
Processing
dashboard queries programatic interface
API
event
stream
event
store
roll-up
cubes
Ingest
Processing
dashboard queries programatic interface
Cassandra stores raw events and intermediate aggregates
API
event
store
roll-up
cubes
dashboard queries programatic interface
Acunu Analytics is a Cassandra client mapping new events,
queries and schema changes to aggregate reads and writes
!
API
event
stream
event
store
roll-up
cubes
Ingest
Processing
dashboard queries programatic int
Monday, 29 April 13
20. 10
API
event
stream
event
store
roll-up
cubes
Ingest
Processing
dashboard queries programatic interface
API
event
stream
event
store
roll-up
cubes
Ingest
Processing
dashboard queries programatic interface
Cassandra stores raw events and intermediate aggregates
Acunu Dashboards provides embeddable,
custom data visualization using HTTP API
API
event
store
roll-up
cubes
dashboard queries programatic interface
Acunu Analytics is a Cassandra client mapping new events,
queries and schema changes to aggregate reads and writes
!
API
event
stream
event
store
roll-up
cubes
Ingest
Processing
dashboard queries programatic int
Monday, 29 April 13
21. CREATE TABLE APICalls (
time TIME(‘PST’, HOUR, MIN, SEC),
path PATH(/),
useragent STRING,
latitude DOUBLE(0.1, 0.01),
longitude DOUBLE(0.1, 0.01)
);
CREATE CUBE SELECT COUNT, AVG(respTime) FROM APICalls
WHERE time, path
GROUP BY time, path;
CREATE CUBE SELECT COUNT FROM APICalls
WHERE latitude, longitude
GROUP BY latitude, longitude;
11
(Loosely) Define a schema
• Tables have HTTP endpoint; map to a set of ColumnFamilys
• Dimensions map keys in events; allow hierarchical aggregation
• Cubes defines dimensions and aggregate to maintain
Monday, 29 April 13
22. CREATE CUBE SELECT SUM(a) FROM t
WHERE x, y GROUP BY g, h, i;
12
Aggregation
estMonday, 29 April 13
23. CREATE CUBE SELECT SUM(a) FROM t
WHERE x, y GROUP BY g, h, i;
12
Aggregation
est
New event:
Apply SUM(v, v’)
on this cell
v
A: v’
X: x
Y: y
Z: z
y
x
(g, h, i)
Monday, 29 April 13
24. CREATE CUBE SELECT SUM(a) FROM t
WHERE x, y GROUP BY g, h, i;
12
Aggregation
• Hierarchical dimensions cause multiple writes per event
(That’s ok: Cassandra’s good at writes)
• Most aggregates result in atomic counter increments
est
New event:
Apply SUM(v, v’)
on this cell
v
A: v’
X: x
Y: y
Z: z
y
x
(g, h, i)
Monday, 29 April 13
25. SELECT SUM(a) FROM t
WHERE x = .. and y = .. GROUP BY g, h, i;
13
Queries
est
• WHEREs map to a Cassandra row and GROUP BY to a
compound column key in that row (very roughly)
Monday, 29 April 13
26. SELECT SUM(a) FROM t
WHERE x = .. and y = .. GROUP BY g, h, i;
13
Queries
est
New query:
•Locate slice that
matches WHERE
•Return all mappings
from GROUP BY tuples
to cell values
v
y
x
(g, h, i)
• WHEREs map to a Cassandra row and GROUP BY to a
compound column key in that row (very roughly)
Monday, 29 April 13
27. 21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3221 :00→22 :01→19 :02→104 ...
... ...
UK all→228 user01→1 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1904 ...
∅ all→87314 UK→238 US→354 ...
14
A concrete example
Monday, 29 April 13
29. 21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→355 ...
{
cust_id: user01,
session_id: 102,
geography: UK,
browser: IE,
time: 22:02,
}
15
Each event updates
multiple aggregates:
WHERE time IN (22:00,23:00)
GROUP BY minute
A concrete example
Monday, 29 April 13
30. 21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→355 ...
{
cust_id: user01,
session_id: 102,
geography: UK,
browser: IE,
time: 22:02,
}
15
Each event updates
multiple aggregates:
WHERE time IN (22:00,23:00)
GROUP BY minute
WHERE geography=US
GROUP BY user
A concrete example
Monday, 29 April 13
31. 16
SELECT `SUM(x)/(MAX(y) -
MIN(y) + 0.5) AS 'spread'
FROM ...
Arithmetic expressions
SELECT a - b AS lbound, a + b AS ubound
FROM
(SELECT AVG(score) AS a FROM scores
WHERE year = 2012)
JOIN
(SELECT STDDEV(score) AS b FROM scores)
USING (school)
Fast inner joins
SELECT COUNT UNIQUE (visitors) GROUP
BY time(DAY(‘US/Pacific’))
Time zone support
SELECT SUM(size) FROM ..
WHERE path MATCHES /usr/*
Hierarchical aggregation
SELECT DRILL FROM errors WHERE
category IN (“warn”, “error”)
Drill down to raw events
SELECT COUNT (items) FROM ..
GROUP BY category LIMIT 3,
country
... HAVING AVG(rating) < 2.0
AND COUNT >= 10
Limits
Query-time filtering
Rich queries
Monday, 29 April 13
33. Apache,Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logos
are trademarks of the Apache Software Foundation.
Thank You.
Tim Moreton CTO
@timmoreton
Monday, 29 April 13