End-to-end Analytics          with Apache CassandraC*                                #cassandra12
Basics     • CFIF/CFOF/CFRR/CFRW     • BulkOutputFormat     • Input locality (identical to HDFS)     • Wide row, 2I, compo...
Why Cassandra?     • Excellent Hadoop capabilities built-in     • Multi-datacenter support, load isolation     • Operation...
Use Cases     • Trends, recommendations, reporting, etc.     • Detect and fix problems in data     • New realtime (or analy...
Data Model     • Consider growth patterns and analytic       query patterns for your data      • One-off inquiry or regula...
Miscellaneous Tips     • Don’t forget about tombstones     • BOP to enable range slices (Rows with       keys ‘A*’ to ‘F*’...
Cassandra + Oozie     • Workflows: cohesive, nestable, scheduled     • Web UI, CLI, web service     • Cassandra properties ...
Cassandra + Pig     • Data with validators      • Pig tuples (address.name, address.value)     • Data with default validat...
Cassandra + Pig     • Output to Cassandra      • Can output directly with (key, (name,         value), (name, value)...) f...
Cassandra + Pig     • Composite column support (C* 1.0.9+)     • Counter support (C* 1.0.9+)     • Secondary Index support...
Cassandra + Hadoop X     • Example: Cassandra + CDH3      • Start with Cassandra ring      • Add NN, JT, Oozie server     ...
Current/Future Work     • Cassandra core Hive support (outside of       Brisk) (CASSANDRA-4131)C*                         ...
For More Information     • Follow @CassandraHadoop on Twitter     • http://wiki.apache.org/cassandra/       HadoopSupportC...
Questions?C*                #cassandra12
Nächste SlideShare
Wird geladen in …5
×

End-to-end Analytics with Apache Cassandra

8.706 Aufrufe

Veröffentlicht am

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

End-to-end Analytics with Apache Cassandra

  1. End-to-end Analytics with Apache CassandraC* #cassandra12
  2. Basics • CFIF/CFOF/CFRR/CFRW • BulkOutputFormat • Input locality (identical to HDFS) • Wide row, 2I, composite support • Pig, Hive, Mahout, Sqoop, Oozie, DSE AnalyticsC* #cassandra12
  3. Why Cassandra? • Excellent Hadoop capabilities built-in • Multi-datacenter support, load isolation • Operationally, order of magnitude simpler • DSE Analytics is all-in-one, simpler still • Review your requirements, do homeworkC* #cassandra12
  4. Use Cases • Trends, recommendations, reporting, etc. • Detect and fix problems in data • New realtime (or analytic) query pattern? • Backpopulate new CF with historical dataC* #cassandra12
  5. Data Model • Consider growth patterns and analytic query patterns for your data • One-off inquiry or regular processing? • Fast growing, want only small slices? • Consider active/archive CFs • Secondary indexes for small inputsC* #cassandra12
  6. Miscellaneous Tips • Don’t forget about tombstones • BOP to enable range slices (Rows with keys ‘A*’ to ‘F*’)C* #cassandra12
  7. Cassandra + Oozie • Workflows: cohesive, nestable, scheduled • Web UI, CLI, web service • Cassandra properties in oozie job properties, and workflow.xml • Writing out to Cassandra: mapreduce.fileoutputcommitter.marksuccessfuljobs to false • DSE Analytics works with Oozie 3.2.1+C* #cassandra12
  8. Cassandra + Pig • Data with validators • Pig tuples (address.name, address.value) • Data with default validator • Bag of key, value pairs (tuples) • Unmarshal with Pygmalion • Select by regex (eg ‘1369*’, ‘link*’)C* #cassandra12
  9. Cassandra + Pig • Output to Cassandra • Can output directly with (key, (name, value), (name, value)...) format • For tabular data, format output with Pygmalion’s ToCassandraBag • Use BulkOutputFormat (C* 1.1)C* #cassandra12
  10. Cassandra + Pig • Composite column support (C* 1.0.9+) • Counter support (C* 1.0.9+) • Secondary Index support for relatively small slices (C* 1.1+) • Wide row support (C* 1.1+) • Composite key support (C* 1.1.3+)C* #cassandra12
  11. Cassandra + Hadoop X • Example: Cassandra + CDH3 • Start with Cassandra ring • Add NN, JT, Oozie server • TaskTracker, DataNode on each node • Jobs/launching point have Cassandra info • Segregate out analytics with virtual DCC* #cassandra12
  12. Current/Future Work • Cassandra core Hive support (outside of Brisk) (CASSANDRA-4131)C* #cassandra12
  13. For More Information • Follow @CassandraHadoop on Twitter • http://wiki.apache.org/cassandra/ HadoopSupportC* #cassandra12
  14. Questions?C* #cassandra12

×