The document discusses using Apache Cassandra for end-to-end analytics. It provides an overview of Cassandra's capabilities for analytics like Pig and Hive integration, and recommends use cases like trends analysis, data problem detection, and backpopulating historical data. It also provides tips on data modeling, output formats, and integrating Cassandra with tools like Oozie, Pig, and Hadoop distributions.
3. Why Cassandra?
• Excellent Hadoop capabilities built-in
• Multi-datacenter support, load isolation
• Operationally, order of magnitude simpler
• DSE Analytics is all-in-one, simpler still
• Review your requirements, do homework
C* #cassandra12
4. Use Cases
• Trends, recommendations, reporting, etc.
• Detect and fix problems in data
• New realtime (or analytic) query pattern?
• Backpopulate new CF with historical data
C* #cassandra12
5. Data Model
• Consider growth patterns and analytic
query patterns for your data
• One-off inquiry or regular processing?
• Fast growing, want only small slices?
• Consider active/archive CFs
• Secondary indexes for small inputs
C* #cassandra12
6. Miscellaneous Tips
• Don’t forget about tombstones
• BOP to enable range slices (Rows with
keys ‘A*’ to ‘F*’)
C* #cassandra12
7. Cassandra + Oozie
• Workflows: cohesive, nestable, scheduled
• Web UI, CLI, web service
• Cassandra properties in oozie job
properties, and workflow.xml
• Writing out to Cassandra:
mapreduce.fileoutputcommitter.marksuccessfuljobs to false
• DSE Analytics works with Oozie 3.2.1+
C* #cassandra12
8. Cassandra + Pig
• Data with validators
• Pig tuples (address.name, address.value)
• Data with default validator
• Bag of key, value pairs (tuples)
• Unmarshal with Pygmalion
• Select by regex (eg ‘1369*’, ‘link*’)
C* #cassandra12
9. Cassandra + Pig
• Output to Cassandra
• Can output directly with (key, (name,
value), (name, value)...) format
• For tabular data, format output with
Pygmalion’s ToCassandraBag
• Use BulkOutputFormat (C* 1.1)
C* #cassandra12
10. Cassandra + Pig
• Composite column support (C* 1.0.9+)
• Counter support (C* 1.0.9+)
• Secondary Index support for relatively
small slices (C* 1.1+)
• Wide row support (C* 1.1+)
• Composite key support (C* 1.1.3+)
C* #cassandra12
11. Cassandra + Hadoop X
• Example: Cassandra + CDH3
• Start with Cassandra ring
• Add NN, JT, Oozie server
• TaskTracker, DataNode on each node
• Jobs/launching point have Cassandra info
• Segregate out analytics with virtual DC
C* #cassandra12
12. Current/Future Work
• Cassandra core Hive support (outside of
Brisk) (CASSANDRA-4131)
C* #cassandra12
13. For More Information
• Follow @CassandraHadoop on Twitter
• http://wiki.apache.org/cassandra/
HadoopSupport
C* #cassandra12
wide row, 2I and composite key support in 1.1.\ncomposite column support in 1.0.9+\nwide row support is in both pig and hive\n
\n
So I have all this data...\nAll the normal use cases for Hadoop, plus some interesting use cases that are specific to things like Cassandra.\n
If you are going to seriously use Cassandra with Hadoop, your analytic query patterns also have to be considered.\nFor active/archive CFs, denormalize by query - same as with RT query patterns\nYou can have lots of CFs, so no worries there\n
\n
OOZIE-477: hardcoded goodness if don’t want to use Oozie 3.2.1 (tiny patch)\n
\n
\n
\n
Dachis Group: in production with Cassandra + CDH3 going on 18 months.\nLaunching point may be a server from which you test or submit jobs. On that node, just put the Cassandra information in the default Hadoop conf (mapred-site.xml).\nFor tuning, just realize that Cassandra and Hadoop will share each node’s resources, but that you can scale one with the other. Both require memory, CPU, and IO.\n