End-to-end Analytics with Apache Cassandra

8.177 Aufrufe

Veröffentlicht am

Veröffentlicht in: Technologie
0 Kommentare
5 Gefällt mir
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe insgesamt
Auf SlideShare
Aus Einbettungen
Anzahl an Einbettungen
Gefällt mir
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie
  • \n
  • wide row, 2I and composite key support in 1.1.\ncomposite column support in 1.0.9+\nwide row support is in both pig and hive\n
  • \n
  • So I have all this data...\nAll the normal use cases for Hadoop, plus some interesting use cases that are specific to things like Cassandra.\n
  • If you are going to seriously use Cassandra with Hadoop, your analytic query patterns also have to be considered.\nFor active/archive CFs, denormalize by query - same as with RT query patterns\nYou can have lots of CFs, so no worries there\n
  • \n
  • OOZIE-477: hardcoded goodness if don’t want to use Oozie 3.2.1 (tiny patch)\n
  • \n
  • \n
  • \n
  • Dachis Group: in production with Cassandra + CDH3 going on 18 months.\nLaunching point may be a server from which you test or submit jobs. On that node, just put the Cassandra information in the default Hadoop conf (mapred-site.xml).\nFor tuning, just realize that Cassandra and Hadoop will share each node’s resources, but that you can scale one with the other. Both require memory, CPU, and IO.\n
  • \n
  • \n
  • \n
  • End-to-end Analytics with Apache Cassandra

    1. End-to-end Analytics with Apache CassandraC* #cassandra12
    2. Basics • CFIF/CFOF/CFRR/CFRW • BulkOutputFormat • Input locality (identical to HDFS) • Wide row, 2I, composite support • Pig, Hive, Mahout, Sqoop, Oozie, DSE AnalyticsC* #cassandra12
    3. Why Cassandra? • Excellent Hadoop capabilities built-in • Multi-datacenter support, load isolation • Operationally, order of magnitude simpler • DSE Analytics is all-in-one, simpler still • Review your requirements, do homeworkC* #cassandra12
    4. Use Cases • Trends, recommendations, reporting, etc. • Detect and fix problems in data • New realtime (or analytic) query pattern? • Backpopulate new CF with historical dataC* #cassandra12
    5. Data Model • Consider growth patterns and analytic query patterns for your data • One-off inquiry or regular processing? • Fast growing, want only small slices? • Consider active/archive CFs • Secondary indexes for small inputsC* #cassandra12
    6. Miscellaneous Tips • Don’t forget about tombstones • BOP to enable range slices (Rows with keys ‘A*’ to ‘F*’)C* #cassandra12
    7. Cassandra + Oozie • Workflows: cohesive, nestable, scheduled • Web UI, CLI, web service • Cassandra properties in oozie job properties, and workflow.xml • Writing out to Cassandra: mapreduce.fileoutputcommitter.marksuccessfuljobs to false • DSE Analytics works with Oozie 3.2.1+C* #cassandra12
    8. Cassandra + Pig • Data with validators • Pig tuples (address.name, address.value) • Data with default validator • Bag of key, value pairs (tuples) • Unmarshal with Pygmalion • Select by regex (eg ‘1369*’, ‘link*’)C* #cassandra12
    9. Cassandra + Pig • Output to Cassandra • Can output directly with (key, (name, value), (name, value)...) format • For tabular data, format output with Pygmalion’s ToCassandraBag • Use BulkOutputFormat (C* 1.1)C* #cassandra12
    10. Cassandra + Pig • Composite column support (C* 1.0.9+) • Counter support (C* 1.0.9+) • Secondary Index support for relatively small slices (C* 1.1+) • Wide row support (C* 1.1+) • Composite key support (C* 1.1.3+)C* #cassandra12
    11. Cassandra + Hadoop X • Example: Cassandra + CDH3 • Start with Cassandra ring • Add NN, JT, Oozie server • TaskTracker, DataNode on each node • Jobs/launching point have Cassandra info • Segregate out analytics with virtual DCC* #cassandra12
    12. Current/Future Work • Cassandra core Hive support (outside of Brisk) (CASSANDRA-4131)C* #cassandra12
    13. For More Information • Follow @CassandraHadoop on Twitter • http://wiki.apache.org/cassandra/ HadoopSupportC* #cassandra12
    14. Questions?C* #cassandra12