The document discusses running batch analytics queries on Cassandra databases by using Spark and Shark to directly access the SSTables. Current solutions like running Hive on Cassandra have performance issues. The author's solution uses Spark workers running on Cassandra nodes to read SSTables directly, avoiding the filesystem cache and CQL interface. Performance tests show this approach is 2.5x faster than using the CQL interface and has lower and more predictable query latency, even under write load. The author calls for further development and contributions to the technique.
5. Batch and real-time analytics
* Wherever there is data there are unforeseeable
queries
* Real-time databases are optimized for real-time
queries
* Large queries may not be possible
* Or will impact your real-time SLA
#CASSANDRAEU
@richardalow
6. Example
* User accounts database
* Read-heavy
* Must be low latency
* Other tables on same database
* Some are write heavy
* A good fit for Cassandra!
#CASSANDRAEU
@richardalow
7. Example data model
CREATE TABLE user_accounts (
userid uuid PRIMARY KEY,
username text,
email text,
password text,
last_visited timestamp,
country text
);
#CASSANDRAEU
@richardalow
8. Example data model
SELECT * FROM user_accounts LIMIT 2;
userid
| country | email
| last_visited
| password | username
---------+---------+---------------------+---------------------+----------+--------a03dcf03 |
UK | richard@wentnet.com | 2013-10-07 09:07:36 | td7rjxwp | rlow
b3f1871e |
FR | jean@yahoo.com
| 2013-08-17 13:07:36 | moh7eksn | jean88
#CASSANDRAEU
@richardalow
10. Ad-hoc query
“Please can you find all users from Brazil who haven’t
logged in since July and have an email @yahoo.com.
I need the answer by Monday.”
#CASSANDRAEU
@richardalow
11. Ad-hoc query observations
* We have 500k users from Brazil
* 60MB of raw data
* No way to extract by country from data model
* It’s on unchanging data*
* Can take hours, not days
* No expectation this query will need rerunning
* Mostly, some of the people who haven’t visited for a while may suddenly come back
#CASSANDRAEU
@richardalow
12. Why?
* Underrepresented use case in plethora of tools
* Seen days of dev time wasted
* Want to see what can be done
#CASSANDRAEU
@richardalow
15. Options
* Run Hive query on top of Cassandra
* Will compete with Cassandra for
* I/O
* Memory
* CPU
* Network
* Will cause extra GC pressure on Cassandra
* Could flush filesystem cache
#CASSANDRAEU
@richardalow
16. Options
* Write ETL script and load into another DB
#CASSANDRAEU
@richardalow
17. Options
* Write ETL script and load into another DB
* All custom code
* Single threaded
* Unreliable
* Will still flush cache on Cassandra nodes
#CASSANDRAEU
@richardalow
19. Options
* Clone the cluster
* Worst possible network load
* Manual import each time
* No incremental update
* Need duplicate hardware
#CASSANDRAEU
@richardalow
21. Options
* Add ‘batch analytics’ DC and run Hive there
* Initial copy slow and affects real-time
performance
* Need duplicate hardware
* Will drop writes when really busy
#CASSANDRAEU
@richardalow
23. Spark
* Developed by Amplab
* Distributed computation, like Hadoop
* Designed for iterative algorithms
* Much faster for queries with working sets that fit
in RAM
* Reliability from storing lineage rather than
intermediate results
* Runs on Mesos or YARN
#CASSANDRAEU
@richardalow
24. Spark is used by
Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
#CASSANDRAEU
@richardalow
25. Shark
* Hive on Spark
* Completely compatible with Hive
* Same QL, UDFs and storage handlers
* Can cache tables
#CASSANDRAEU
@richardalow
26. Shark
* Hive on Spark
* Completely compatible with Hive
* Same QL, UDFs and storage handlers
* Can cache tables
CREATE TABLE user_accounts_cached as
SELECT * FROM user_accounts WHERE
country = ‘BR’;
#CASSANDRAEU
@richardalow
28. Shark on Cassandra
* CqlStorageHandler
* Can use existing hive-cassandra storage handler
* Can work well - see Evan Chan’s talk (Ooyala) from
#cassandra13
* But suffers from same problems as Hive+Hadoop
on Cassandra
#CASSANDRAEU
@richardalow
29. Shark on Cassandra direct
* SSTableStorageHandler
* Run spark workers on the Cassandra nodes
* Read directly from SSTables in separate JVM
* Limit CPU and memory through Spark/Mesos/
YARN
* Limit I/O by rate limiting raw disk access
* Skip filesystem cache
#CASSANDRAEU
@richardalow
30. Cassandra on Spark: through CQL interface
Spark worker JVM
FS Cache
Cassandra JVM
Deserialize
Merge
Serialize
SSTables
Deserialize
Process
Remote client
Latency
spikes!
#CASSANDRAEU
@richardalow
31. Cassandra on Spark: SSTables direct
Spark worker JVM
Deserialize
Process
SSTables
#CASSANDRAEU
Remote client
Deserialize
Merge
Serialize
FS Cache
Cassandra JVM
Constant
latency
@richardalow
32. Disadvantages
* Equivalent to CL.ONE
* Always runs task local with the data
* Doesn’t read data in memtables
#CASSANDRAEU
@richardalow
35. Setup
* Cassandra 1.2.10
* 3 GB heap
* 256 tokens per node
* RF 3
* Preloaded 100M randomly generated records
* Each node started with 9GB of data
* No optimization or tuning
#CASSANDRAEU
@richardalow
36. Tools
* codahale Metrics
* Ganglia
* Load generator using DataStax Java driver
* Google spreadsheet
#CASSANDRAEU
@richardalow
37. Result 1
* No Cassandra load
* Run caching query:
CREATE TABLE user_accounts_cached as
SELECT * FROM user_accounts WHERE
country = ‘BR’;
* Takes 33 mins through CQL
* Takes 13 mins through SSTables
* 130k records/s
* => SSTables is 2.5x faster
* Even better since CQL has access to both cores
#CASSANDRAEU
@richardalow
38. Using cached results
* Now have results cached, can run super fast
queries
* No I/O or extra memory
* Bounded number of cores
SELECT count(*) FROM user_accounts_cached
WHERE unix_timestamp(last_visited)<
unix_timestamp('2013-08-01 00:00:00') AND
email LIKE '%@c9%';
* Took 18 seconds
#CASSANDRAEU
@richardalow
39. Result 2
* Add read load
* Read-modify-write of accounts info
* 200 ops/s
* Measure latency
* Slow down SSTable loader to same rate as CQL
#CASSANDRAEU
@richardalow
41. Analysis
* Average latency 17% lower
* Probably due to less CPU used by query
* Max 95th %ile latency 33% lower and much more
predictable
* Possibly due to less GC pressure
* Still have a latency increase over base
* Probably due to I/O use
#CASSANDRAEU
@richardalow
42. Result 3
* Keep read workload
* Measure same latency
* Add insert workload
* Insert into separate table
* 2500 ops/s
#CASSANDRAEU
@richardalow
44. Analysis
* Lots of latency, but there is anyway
#CASSANDRAEU
@richardalow
45. Performance wrap up
* 2.5x faster with less CPU
=> uses less resources to do the same thing
* Lower, more predictable latencies when at same
speed
=> controlled resource usage lowers latency
impact
* Could limit further to make impact unnoticeable
#CASSANDRAEU
@richardalow
47. Summary
* Discussed analytics use case not well served by
current tools
* Spark, Shark
* SSTableStorageHandler
* Performance results
#CASSANDRAEU
@richardalow
48. Future
* Needs a name
* Github
* Speak to me if you want to use it
* Speak to me if you want to contribute
#CASSANDRAEU
@richardalow