This document discusses using Apache Hadoop for analytics on data stored in Apache Cassandra. It describes how Cassandra is optimized for fast writes and tunable consistency, while Hadoop supports analytics through MapReduce and tools like Pig and Hive. The document provides a recipe for overlaying Hadoop on a Cassandra cluster to leverage data locality for analytics processing. Examples are given of using Hadoop streaming and Cassandra input/output formats to integrate the two systems.
2. BigTable + Dynamo
Semi-structured data model
Decentralized – no special roles
Ridiculously fast writes, fast reads
Tunably consistent
Cross-DC capable
3. You design your data model based off of your
query model
Real-time ad-hoc queries aren’t viable
Secondary indexes help (0.7)
What about analytics?
4. Hadoop has analytics
MapReduce
Pig/Hive and other tools built above MapReduce
Configurable data sources/destinations
Many already familiar with it
Active community
5. Always able to output to Cassandra directly
0.6
ColumnFamilyInputFormat
Pig support – Cassandra LoadFunc
0.7
ColumnFamilyOutputFormat
Hadoop Streaming Output
Streamlined configuration
6. Recipe
Overlay Hadoop on top of Cassandra
Separate server for name node and job tracker
Co-locate task trackers with Cassandra nodes
Add data nodes to taste
Voilà
Data locality
Analytics engine scales with data
Example
7. Cassandra specific InputFormat
Configuration – ConfigHelper, Hadoop variables
InputSplits over the data – tunable
Example usage in contrib/word_count
8. OutputFormat
Configuration – ConfigHelper, Hadoop variables
Batches output – tunable
Don’t have to use Cassandra api
Some optimizations (e.g. ConsistencyLevel.ONE)
Example usage in contrib/word_count
9. 60,000+ Documented UFO Sightings
Data set from http://infochimps.com
sighted_at reported_at location shape duration description
19951009 19951009 Iowa City, IA
Man repts.Witnessing “flash,
followed by a classic UFO, w/ a
tailfin at back.” …
19940801 19950220 Renton, WA
Man repts. seeing 2x large
ships hovering in night sky
while using Russian-made
night binoculars.
19970111 19970111 St. Cloud, MN pyramid 2 min.
Summary : Right when me and
my friend left my house we
saw a bright green glowing
object that looked like a 4
sided pyramid then after about
2 min it took off straight into
the sky leaving a yellow trail
behind it…
10. What about languages outside of Java?
Build on what Hadoop uses - Streaming
Output streaming in 0.7.0
Example in contrib/hadoop_streaming_output
Input streaming in progress, likely 0.7.1
11. Developed atYahoo!
PigLatin/Grunt shell
Powerful scripting language for analytics
Example usage in contrib/pig
Configuration – Hadoop/Env variables
12. Raptr.com
Home grown solution -> Cassandra + Hadoop
Query time: hours -> minutes
Pig obviated their need for multi-lingual MR
Speed and ease are enabling
Imagini/Visual DNA
US Government (Digital Reasoning)
See http://github.com/digitalreasoning/PyStratus
13. Hive support in progress (HIVE-1434)
Hadoop Input Streaming (likely 0.7.1)
Performance improvements
14. Hadoop analytics for Cassandra
Data locality for processing
Scales with the cluster
15. More information
http://cassandra.apache.org
http://wiki.apache.org/cassandra/HadoopSupport
Cassandra:The Definitive Guide
About me:
jeremy.hanna@rackspace.com
@jeromatron onTwitter
jeromatron on IRC in #cassandra
Editor's Notes
Talk a little about background of the theme – hippies, The Turtles, readability.
Mention Jeff Hodges, Johan, Stu, and Todd Lipcon.
Mention how InputSplit works and how it can choose among replicas – array of locations returned.
Highlight how this is the same extension point that is used with HDFS, HBase and any other data source/destination for MapReduce.
IOW, are people using this stuff in the real world? In production?
Put some notes in here about raptr and imagini’s use cases.