SlideShare ist ein Scribd-Unternehmen logo
1 von 14
End-to-end Analytics
          with Apache Cassandra




C*                                #cassandra12
Basics
     • CFIF/CFOF/CFRR/CFRW
     • BulkOutputFormat
     • Input locality (identical to HDFS)
     • Wide row, 2I, composite support
     • Pig, Hive, Mahout, Sqoop, Oozie, DSE
       Analytics

C*                                  #cassandra12
Why Cassandra?

     • Excellent Hadoop capabilities built-in
     • Multi-datacenter support, load isolation
     • Operationally, order of magnitude simpler
     • DSE Analytics is all-in-one, simpler still
     • Review your requirements, do homework
C*                                       #cassandra12
Use Cases

     • Trends, recommendations, reporting, etc.
     • Detect and fix problems in data
     • New realtime (or analytic) query pattern?
      • Backpopulate new CF with historical data

C*                                    #cassandra12
Data Model
     • Consider growth patterns and analytic
       query patterns for your data
      • One-off inquiry or regular processing?
      • Fast growing, want only small slices?
     • Consider active/archive CFs
     • Secondary indexes for small inputs
C*                                      #cassandra12
Miscellaneous Tips


     • Don’t forget about tombstones
     • BOP to enable range slices (Rows with
       keys ‘A*’ to ‘F*’)




C*                                     #cassandra12
Cassandra + Oozie
     • Workflows: cohesive, nestable, scheduled
     • Web UI, CLI, web service
     • Cassandra properties in oozie job
       properties, and workflow.xml
     • Writing out to Cassandra:
       mapreduce.fileoutputcommitter.marksuccessfuljobs to false


     • DSE Analytics works with Oozie 3.2.1+
C*                                                            #cassandra12
Cassandra + Pig
     • Data with validators
      • Pig tuples (address.name, address.value)
     • Data with default validator
      • Bag of key, value pairs (tuples)
      • Unmarshal with Pygmalion
        • Select by regex (eg ‘1369*’, ‘link*’)
C*                                      #cassandra12
Cassandra + Pig
     • Output to Cassandra
      • Can output directly with (key, (name,
         value), (name, value)...) format
      • For tabular data, format output with
         Pygmalion’s ToCassandraBag
      • Use BulkOutputFormat (C* 1.1)
C*                                          #cassandra12
Cassandra + Pig
     • Composite column support (C* 1.0.9+)
     • Counter support (C* 1.0.9+)
     • Secondary Index support for relatively
       small slices (C* 1.1+)
     • Wide row support (C* 1.1+)
     • Composite key support (C* 1.1.3+)
C*                                     #cassandra12
Cassandra + Hadoop X
     • Example: Cassandra + CDH3
      • Start with Cassandra ring
      • Add NN, JT, Oozie server
      • TaskTracker, DataNode on each node
      • Jobs/launching point have Cassandra info
      • Segregate out analytics with virtual DC
C*                                     #cassandra12
Current/Future Work


     • Cassandra core Hive support (outside of
       Brisk) (CASSANDRA-4131)




C*                                     #cassandra12
For More Information


     • Follow @CassandraHadoop on Twitter
     • http://wiki.apache.org/cassandra/
       HadoopSupport




C*                                  #cassandra12
Questions?




C*                #cassandra12

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Nextag talk
Nextag talkNextag talk
Nextag talk
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤
 
Apache drill
Apache drillApache drill
Apache drill
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Partners in Crime: Cassandra Analytics and ETL with Hadoop
Partners in Crime: Cassandra Analytics and ETL with HadoopPartners in Crime: Cassandra Analytics and ETL with Hadoop
Partners in Crime: Cassandra Analytics and ETL with Hadoop
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Hadoop
HadoopHadoop
Hadoop
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 

Andere mochten auch

Andere mochten auch (9)

Plagger はじめよう
Plagger はじめようPlagger はじめよう
Plagger はじめよう
 
Friends Don't Let Friends Clap on One and Three: a Backbeat Clapping Study
Friends Don't Let Friends Clap on One and Three: a Backbeat Clapping StudyFriends Don't Let Friends Clap on One and Three: a Backbeat Clapping Study
Friends Don't Let Friends Clap on One and Three: a Backbeat Clapping Study
 
Build Your Brand and Build Your Business
Build Your Brand and Build Your BusinessBuild Your Brand and Build Your Business
Build Your Brand and Build Your Business
 
ISUCONの勝ち方 YAPC::Asia Tokyo 2015
ISUCONの勝ち方 YAPC::Asia Tokyo 2015ISUCONの勝ち方 YAPC::Asia Tokyo 2015
ISUCONの勝ち方 YAPC::Asia Tokyo 2015
 
Design process
Design processDesign process
Design process
 
El Lenguaje Fotografico
El Lenguaje FotograficoEl Lenguaje Fotografico
El Lenguaje Fotografico
 
Enduring CSS
Enduring CSSEnduring CSS
Enduring CSS
 
120 Awesome Marketing Stats, Charts and Graphs
120 Awesome Marketing Stats, Charts and Graphs120 Awesome Marketing Stats, Charts and Graphs
120 Awesome Marketing Stats, Charts and Graphs
 
DISPLAY LUMAscape
DISPLAY LUMAscapeDISPLAY LUMAscape
DISPLAY LUMAscape
 

Ähnlich wie End-to-end Analytics with Apache Cassandra

"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-..."Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
hamidsamadi
 
Developing with Cassandra
Developing with CassandraDeveloping with Cassandra
Developing with Cassandra
Sperasoft
 

Ähnlich wie End-to-end Analytics with Apache Cassandra (20)

PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
 
NoSQL Intro with cassandra
NoSQL Intro with cassandraNoSQL Intro with cassandra
NoSQL Intro with cassandra
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Cassandra 2.0 (Introduction)
Cassandra 2.0 (Introduction)Cassandra 2.0 (Introduction)
Cassandra 2.0 (Introduction)
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-..."Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
 
Spark Introduction
Spark IntroductionSpark Introduction
Spark Introduction
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practice
 
Hindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraHindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to Cassandra
 
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael KjellmanC* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
 
Munich March 2015 - Cassandra + Spark Overview
Munich March 2015 -  Cassandra + Spark OverviewMunich March 2015 -  Cassandra + Spark Overview
Munich March 2015 - Cassandra + Spark Overview
 
Cassandra integrations
Cassandra integrationsCassandra integrations
Cassandra integrations
 
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan OttTrivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Developing with Cassandra
Developing with CassandraDeveloping with Cassandra
Developing with Cassandra
 

Mehr von Jeremy Hanna (9)

Göteborg Distributed: Eventual Consistency in Apache Cassandra
Göteborg Distributed: Eventual Consistency in Apache CassandraGöteborg Distributed: Eventual Consistency in Apache Cassandra
Göteborg Distributed: Eventual Consistency in Apache Cassandra
 
Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real World
 
Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real World
 
Modern Cassandra for Developers
Modern Cassandra for DevelopersModern Cassandra for Developers
Modern Cassandra for Developers
 
Troubleshooting Cassandra
Troubleshooting CassandraTroubleshooting Cassandra
Troubleshooting Cassandra
 
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
Cassandra + Hadoop: Analisi Batch con Apache CassandraCassandra + Hadoop: Analisi Batch con Apache Cassandra
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
 
Cassandra eu
Cassandra euCassandra eu
Cassandra eu
 
Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon
 
Cassandra+Hadoop
Cassandra+HadoopCassandra+Hadoop
Cassandra+Hadoop
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

End-to-end Analytics with Apache Cassandra

  • 1. End-to-end Analytics with Apache Cassandra C* #cassandra12
  • 2. Basics • CFIF/CFOF/CFRR/CFRW • BulkOutputFormat • Input locality (identical to HDFS) • Wide row, 2I, composite support • Pig, Hive, Mahout, Sqoop, Oozie, DSE Analytics C* #cassandra12
  • 3. Why Cassandra? • Excellent Hadoop capabilities built-in • Multi-datacenter support, load isolation • Operationally, order of magnitude simpler • DSE Analytics is all-in-one, simpler still • Review your requirements, do homework C* #cassandra12
  • 4. Use Cases • Trends, recommendations, reporting, etc. • Detect and fix problems in data • New realtime (or analytic) query pattern? • Backpopulate new CF with historical data C* #cassandra12
  • 5. Data Model • Consider growth patterns and analytic query patterns for your data • One-off inquiry or regular processing? • Fast growing, want only small slices? • Consider active/archive CFs • Secondary indexes for small inputs C* #cassandra12
  • 6. Miscellaneous Tips • Don’t forget about tombstones • BOP to enable range slices (Rows with keys ‘A*’ to ‘F*’) C* #cassandra12
  • 7. Cassandra + Oozie • Workflows: cohesive, nestable, scheduled • Web UI, CLI, web service • Cassandra properties in oozie job properties, and workflow.xml • Writing out to Cassandra: mapreduce.fileoutputcommitter.marksuccessfuljobs to false • DSE Analytics works with Oozie 3.2.1+ C* #cassandra12
  • 8. Cassandra + Pig • Data with validators • Pig tuples (address.name, address.value) • Data with default validator • Bag of key, value pairs (tuples) • Unmarshal with Pygmalion • Select by regex (eg ‘1369*’, ‘link*’) C* #cassandra12
  • 9. Cassandra + Pig • Output to Cassandra • Can output directly with (key, (name, value), (name, value)...) format • For tabular data, format output with Pygmalion’s ToCassandraBag • Use BulkOutputFormat (C* 1.1) C* #cassandra12
  • 10. Cassandra + Pig • Composite column support (C* 1.0.9+) • Counter support (C* 1.0.9+) • Secondary Index support for relatively small slices (C* 1.1+) • Wide row support (C* 1.1+) • Composite key support (C* 1.1.3+) C* #cassandra12
  • 11. Cassandra + Hadoop X • Example: Cassandra + CDH3 • Start with Cassandra ring • Add NN, JT, Oozie server • TaskTracker, DataNode on each node • Jobs/launching point have Cassandra info • Segregate out analytics with virtual DC C* #cassandra12
  • 12. Current/Future Work • Cassandra core Hive support (outside of Brisk) (CASSANDRA-4131) C* #cassandra12
  • 13. For More Information • Follow @CassandraHadoop on Twitter • http://wiki.apache.org/cassandra/ HadoopSupport C* #cassandra12
  • 14. Questions? C* #cassandra12

Hinweis der Redaktion

  1. \n
  2. wide row, 2I and composite key support in 1.1.\ncomposite column support in 1.0.9+\nwide row support is in both pig and hive\n
  3. \n
  4. So I have all this data...\nAll the normal use cases for Hadoop, plus some interesting use cases that are specific to things like Cassandra.\n
  5. If you are going to seriously use Cassandra with Hadoop, your analytic query patterns also have to be considered.\nFor active/archive CFs, denormalize by query - same as with RT query patterns\nYou can have lots of CFs, so no worries there\n
  6. \n
  7. OOZIE-477: hardcoded goodness if don’t want to use Oozie 3.2.1 (tiny patch)\n
  8. \n
  9. \n
  10. \n
  11. Dachis Group: in production with Cassandra + CDH3 going on 18 months.\nLaunching point may be a server from which you test or submit jobs. On that node, just put the Cassandra information in the default Hadoop conf (mapred-site.xml).\nFor tuning, just realize that Cassandra and Hadoop will share each node’s resources, but that you can scale one with the other. Both require memory, CPU, and IO.\n
  12. \n
  13. \n
  14. \n