This document discusses using Pig with Cassandra to perform analytics and data processing tasks. Pig allows running queries over Cassandra data and storing intermediate results in HDFS or Cassandra. Example uses include analytics, data exploration, validation, and correction. Configuration involves splitting the cluster into virtual datacenters and setting properties. Future work includes improving data type handling and adding support for secondary indexes and wide rows.
2. Motivation What’s our need? How do we get at data in Cassandra with ad-hoc queries Don’t reinvent the wheel
3. Enter Pig Pig was created at Yahoo! as an abstraction for MapReduce Designed to eat anything loadstorefunc created for Cassandra
4. How it works Perform queries over all rows in a column family or set of column families Intermediate results stored in HDFS or CFS Can mixand match inputs and outputs
5. Uses Analytics Data exploration How many items did I get from New Jersey? Data validation How many items were missing a field and when were they created? Data correction Company name correction over all data Expand Cassandra data model Make a new column family for querying by US State and back-populate with Pig Bootstrap local dev environment
6. Pygmalion Figure in Greek mythology, sounds like Pig UDFs, examples scripts for using Pig with Cassandra Used in production at The Dachis Group https://github.com/jeromatron/pygmalion/
8. Tips Develop incrementally Output intermediate data frequently to verify Validate data on input if possible Use Cassandra data type validation for inputs and outputs Pygmalion for tabular data Penny in Pig 0.9!
9. Cluster Configuration Split cluster – virtual datacenters Brisk (built-in pig support in 1.0 beta 2+) Task trackers on all analytic nodes With HDFS: Separate namenode/jobtracker Data nodes on all analytic nodes A few settings to bridge the two Start the server processes Distributed cache and intermediate data With Brisk: Startup includes CFS, job tracker, and task trackers
11. Configuration Priorities Data locality Data locality – no really, biggest performance factor Memory needs Cassandra requires lots of memory Hadoop requires lots of memory Plan with your data model and analytics in mind CPU needs Cassandra doesn’t need a lot of CPU horsepower Hadoop loves CPU cores Interconnected Analytic nodes need to be close to one another
13. Future Work Better data type handling (Cassandra-2777) MapReduce over subsets of rows (Cassandra-1600) MapReduce over secondary indexes (Cassandra-1600) Pig pushdown projection Pig pushdown filter HCatalog support for Cassandra Better Cassandra wide-row support (Cassandra-2688) Support for immutable/snapshot inputs (Cassandra-2527)
14. Questions Contact info Jeremy Hanna @jeromatron on twitter jeremy.hanna1234 <at> gmail jeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)
Hinweis der Redaktion
Make this section interactiveHow many are using Cassandra – find out why, what types of dataHow many are using Hadoop – what types of dataWhat they would like to get from that data