Cassandra+Hadoop

•Download as KEY, PDF•

29 likes•7,519 views

This document discusses using MapReduce with Cassandra. It describes how writing to Cassandra from MapReduce has always been possible, while reading was enabled starting with Cassandra 0.6.x. Using MapReduce with Cassandra provides analytics capabilities and avoids single points of failure compared to MapReduce with HBase. The document covers setup and configuration considerations like locality, and provides examples of a separate cluster approach and hybrid cluster approach. It also outlines future work like improving output to Cassandra and adding Hive support.

Technology Business

MR + Cassandra - History

Writing to Cassandra - always been possible

MR + Cassandra - History

Writing to Cassandra - always been possible
Cassandra 0.6.x enables reading data

MR + Cassandra - History

Writing to Cassandra - always been possible
Cassandra 0.6.x enables reading data
Uses its own InputSplit, InputFormat, RecordReader

Why MR + Cassandra?

Cassandra is a great data store but what about
analytics? MapReduce!
Arguable win over MapReduce + HBase, no SPOF

Setup and Conﬁguration
Job/Task Trackers

Setup and Conﬁguration
Job/Task Trackers
On already established cluster

Setup and Conﬁguration
Job/Task Trackers
On already established cluster
Overlays Cassandra cluster

Setup and Conﬁguration
Job/Task Trackers
On already established cluster
Overlays Cassandra cluster
Hybrid

Setup and Conﬁguration
Job/Task Trackers
On already established cluster
Overlays Cassandra cluster
Hybrid
Locality

Setup and Conﬁguration
Job/Task Trackers
On already established cluster
Overlays Cassandra cluster
Hybrid
Locality
Gives data’s host information to job tracker

A Complete Overlay
Separate
Job Tracker

Task Trackers
Collocated with
Cassandra Nodes

A Complete Overlay
Separate
Job Tracker

Task Trackers
Collocated with
Cassandra Nodes
- Bonus -
Data locality!

A Hybrid Cluster

Task Trackers
on
Cassandra nodes

A Hybrid Cluster

- Bonus -
Data locality
Integrate w/Cluster

Task Trackers
on
Cassandra nodes

Pig + Cassandra

contrib/pig - a Cassandra speciﬁc storage backing
Requires latest Pig - 0.7

Future Work

Simple output to Cassandra - Cassandra-1101
OutputFormat, OutputReducer, OutputWriter

Future Work

Simple output to Cassandra - Cassandra-1101
OutputFormat, OutputReducer, OutputWriter
Hive support - Cassandra-913

Future Work

Simple output to Cassandra - Cassandra-1101
OutputFormat, OutputReducer, OutputWriter
Hive support - Cassandra-913
Optimizations for start/end row - Cassandra-1125

Questions...

jeromatron on twitter
jeromatron on #cassandra channel on freenode irc
jeremy (dot) hanna (at) rackspace (dot) com

What's hot

Intro to py spark (and cassandra)Jon Haddad

Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer

Scala+dataSamir Bessalah

Advanced Apache Cassandra Operations with JMXzznate

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Hive Anatomynzhang

Learning CassandraDave Gardner

C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax

PySpark in practice slidesDat Tran

The Hadoop EcosystemMathias Herberts

Hadoop Pig: MapReduce the easy way!Nathan Bijnens

Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016DataStax

MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Hadoop & HDFS for BeginnersRahul Jain

mesos-devoxx14Samir Bessalah

Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Analytics with Cassandra & SparkMatthias Niehoff

Bulk Loading Data into CassandraDataStax

Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

What's hot (20)

Intro to py spark (and cassandra)

Cassandra and Spark: Optimizing for Data Locality

Scala+data

Advanced Apache Cassandra Operations with JMX

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab

Hive Anatomy

Learning Cassandra

C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016

PySpark in practice slides

The Hadoop Ecosystem

Hadoop Pig: MapReduce the easy way!

Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016

MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab

Hadoop & HDFS for Beginners

mesos-devoxx14

Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab

Analytics with Cassandra & Spark

Bulk Loading Data into Cassandra

Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab

Viewers also liked

Introduciton to Apache Cassandra for Java Developers (JavaOne)zznate

Cassandra Tutorialmubarakss

Introduction to NoSQL and CassandraPatricio Echagüe

Hadoop - Splitting big problems into manageable pieces.Nathan Milford

Cassandra Troubleshooting for 2.1 and laterJ.B. Langston

SF ElasticSearch Meetup 2013.04.06 - MonitoringSushant Shankar

Hardening cassandra q2_2016zznate

Cassandra Troubleshooting (for 2.0 and earlier)J.B. Langston

Cassandra at Instagram (August 2013)Rick Branson

Open Source Monitoring for Java with JMX and Graphite (GeeCON 2013)Cyrille Le Clerc

Elasticsearch in Production (London version)foundsearch

AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services

Wayne State University & DataStax: World's best data modeling tool for Apache...DataStax Academy

Cassandra Basics: IndexingBenjamin Black

LogStash - Yes, logging can be awesomeJames Turnbull

Down and dirty with Elasticsearchclintongormley

Cassandra at NoSql Matters 2012jbellis

Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsAcunu

Elasticsearch in NetflixDanny Yuan

Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy

Viewers also liked (20)

Introduciton to Apache Cassandra for Java Developers (JavaOne)

Cassandra Tutorial

Introduction to NoSQL and Cassandra

Hadoop - Splitting big problems into manageable pieces.

Cassandra Troubleshooting for 2.1 and later

SF ElasticSearch Meetup 2013.04.06 - Monitoring

Hardening cassandra q2_2016

Cassandra Troubleshooting (for 2.0 and earlier)

Cassandra at Instagram (August 2013)

Open Source Monitoring for Java with JMX and Graphite (GeeCON 2013)

Elasticsearch in Production (London version)

AWS Webcast - Managing Big Data in the AWS Cloud_20140924

Wayne State University & DataStax: World's best data modeling tool for Apache...

Cassandra Basics: Indexing

LogStash - Yes, logging can be awesome

Down and dirty with Elasticsearch

Cassandra at NoSql Matters 2012

Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts

Elasticsearch in Netflix

Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay

Similar to Cassandra+Hadoop

Spark Cassandra Connector DataframesRussell Spitzer

Developing with CassandraSperasoft

Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...DataStax Academy

SSTable Reader Cassandra Day Denver 2014Ben Vanberg

Developers summit cassandraで見るNoSQLRyu Kobayashi

Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax

Tokyo Cassandra Summit 2014: Tunable Consistency by Al TobeyDataStax Academy

Analyzing_Data_with_Spark_and_CassandraRich Beaudoin

Escape from HadoopDataStax Academy

Hadoopソースコードリーディング第3回 Hadopo MR + CassandraRyu Kobayashi

End-to-end Analytics with Apache CassandraJeremy Hanna

MariaDB and Cassandra InteroperabilityColin Charles

Stratio big data spainÁlvaro Agea Herradón

Cassandra Summit 2014: Apache Cassandra on Pivotal CloudFoundryDataStax Academy

Lightning fast analytics with Spark and Cassandranickmbailey

Cassandra Lunch #23: Lucene Based Indexes on CassandraAnant Corporation

Maximum Overdrive: Tuning the Spark Cassandra ConnectorRussell Spitzer

Apache Spark ArchitectureAlexey Grishchenko

PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph

Big dataKevin Cawley

Similar to Cassandra+Hadoop (20)

Spark Cassandra Connector Dataframes

Developing with Cassandra

Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...

SSTable Reader Cassandra Day Denver 2014

Developers summit cassandraで見るNoSQL

Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...

Tokyo Cassandra Summit 2014: Tunable Consistency by Al Tobey

Analyzing_Data_with_Spark_and_Cassandra

Escape from Hadoop

Hadoopソースコードリーディング第3回 Hadopo MR + Cassandra

End-to-end Analytics with Apache Cassandra

MariaDB and Cassandra Interoperability

Stratio big data spain

Cassandra Summit 2014: Apache Cassandra on Pivotal CloudFoundry

Lightning fast analytics with Spark and Cassandra

Cassandra Lunch #23: Lucene Based Indexes on Cassandra

Maximum Overdrive: Tuning the Spark Cassandra Connector

Apache Spark Architecture

PySpark Cassandra - Amsterdam Spark Meetup

Big data

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

How to write a Business Continuity PlanDatabarracks

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

"ML in Production",Oleksandr BaganFwdays

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Dev Dives: Streamline document processing with UiPath Studio Web

Are Multi-Cloud and Serverless Good or Bad?

DSPy a system for AI to Write Prompts and Do Fine Tuning

Connect Wave/ connectwave Pitch Deck Presentation

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

DevEX - reference for building teams, processes, and platforms

DMCC Future of Trade Web3 - Special Edition

How to write a Business Continuity Plan

Nell’iperspazio con Rocket: il Framework Web di Rust!

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Scanning the Internet for External Cloud Exposures via SSL Certs

"ML in Production",Oleksandr Bagan

How AI, OpenAI, and ChatGPT impact business and software.

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Unraveling Multimodality with Large Language Models.pdf

The Ultimate Guide to Choosing WordPress Pros and Cons

What's New in Teams Calling, Meetings and Devices March 2024

Take control of your SAP testing with UiPath Test Suite

Cassandra+Hadoop

1. CASSANDRA + HADOOP

2. Two Aspects MapReduce Pig

3. MR + Cassandra - History

4. MR + Cassandra - History Writing to Cassandra - always been possible

5. MR + Cassandra - History Writing to Cassandra - always been possible Cassandra 0.6.x enables reading data

6. MR + Cassandra - History Writing to Cassandra - always been possible Cassandra 0.6.x enables reading data Uses its own InputSplit, InputFormat, RecordReader

7. Why MR + Cassandra? Cassandra is a great data store but what about analytics? MapReduce! Arguable win over MapReduce + HBase, no SPOF

8. Setup and Conﬁguration

9. Setup and Conﬁguration Job/Task Trackers

10. Setup and Conﬁguration Job/Task Trackers On already established cluster

11. Setup and Conﬁguration Job/Task Trackers On already established cluster Overlays Cassandra cluster

12. Setup and Conﬁguration Job/Task Trackers On already established cluster Overlays Cassandra cluster Hybrid

13. Setup and Conﬁguration Job/Task Trackers On already established cluster Overlays Cassandra cluster Hybrid Locality

14. Setup and Conﬁguration Job/Task Trackers On already established cluster Overlays Cassandra cluster Hybrid Locality Gives data’s host information to job tracker

15. Setup and Conﬁguration Job/Task Trackers On already established cluster Overlays Cassandra cluster Hybrid Locality Gives data’s host information to job tracker Conﬁgure both topologies - Cassandra + Hadoop

16. A Separate Cluster

17. A Complete Overlay Separate Job Tracker Task Trackers Collocated with Cassandra Nodes

18. A Complete Overlay Separate Job Tracker Task Trackers Collocated with Cassandra Nodes - Bonus - Data locality!

19. A Hybrid Cluster Task Trackers on Cassandra nodes

20. A Hybrid Cluster - Bonus - Data locality Integrate w/Cluster Task Trackers on Cassandra nodes

21. Tutorial contrib/word_count example

22. Pig + Cassandra contrib/pig - a Cassandra speciﬁc storage backing Requires latest Pig - 0.7

23. Future Work

24. Future Work Simple output to Cassandra - Cassandra-1101 OutputFormat, OutputReducer, OutputWriter

25. Future Work Simple output to Cassandra - Cassandra-1101 OutputFormat, OutputReducer, OutputWriter Hive support - Cassandra-913

26. Future Work Simple output to Cassandra - Cassandra-1101 OutputFormat, OutputReducer, OutputWriter Hive support - Cassandra-913 Optimizations for start/end row - Cassandra-1125

27. Future Work Simple output to Cassandra - Cassandra-1101 OutputFormat, OutputReducer, OutputWriter Hive support - Cassandra-913 Optimizations for start/end row - Cassandra-1125 Other reﬁnements based on feedback

28. Questions... jeromatron on twitter jeromatron on #cassandra channel on freenode irc jeremy (dot) hanna (at) rackspace (dot) com

Cassandra+Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Cassandra+Hadoop

Similar to Cassandra+Hadoop (20)

More from Jeremy Hanna

More from Jeremy Hanna (10)

Recently uploaded

Recently uploaded (20)

Cassandra+Hadoop

Editor's Notes