Fast Analytics for Cassandra with Hadoop

•Download as PPTX, PDF•

3 likes•2,673 views

This document discusses using Apache Hadoop for analytics on data stored in Apache Cassandra. It describes how Cassandra is optimized for fast writes and tunable consistency, while Hadoop supports analytics through MapReduce and tools like Pig and Hive. The document provides a recipe for overlaying Hadoop on a Cassandra cluster to leverage data locality for analytics processing. Examples are given of using Hadoop streaming and Cassandra input/output formats to integrate the two systems.

Technology

 BigTable + Dynamo
 Semi-structured data model
 Decentralized – no special roles
 Ridiculously fast writes, fast reads
 Tunably consistent
 Cross-DC capable

 You design your data model based off of your
query model
 Real-time ad-hoc queries aren’t viable
 Secondary indexes help (0.7)
 What about analytics?

 Hadoop has analytics
 MapReduce
 Pig/Hive and other tools built above MapReduce
 Configurable data sources/destinations
 Many already familiar with it
 Active community

 Always able to output to Cassandra directly
 0.6
 ColumnFamilyInputFormat
 Pig support – Cassandra LoadFunc
 0.7
 ColumnFamilyOutputFormat
 Hadoop Streaming Output
 Streamlined configuration

 Recipe
 Overlay Hadoop on top of Cassandra
 Separate server for name node and job tracker
 Co-locate task trackers with Cassandra nodes
 Add data nodes to taste
 Voilà
 Data locality
 Analytics engine scales with data
 Example

 Cassandra specific InputFormat
 Configuration – ConfigHelper, Hadoop variables
 InputSplits over the data – tunable
 Example usage in contrib/word_count

 OutputFormat
 Configuration – ConfigHelper, Hadoop variables
 Batches output – tunable
 Don’t have to use Cassandra api
 Some optimizations (e.g. ConsistencyLevel.ONE)
 Example usage in contrib/word_count

 60,000+ Documented UFO Sightings
 Data set from http://infochimps.com
sighted_at reported_at location shape duration description
19951009 19951009 Iowa City, IA
Man repts.Witnessing “flash,
followed by a classic UFO, w/ a
tailfin at back.” …
19940801 19950220 Renton, WA
Man repts. seeing 2x large
ships hovering in night sky
while using Russian-made
night binoculars.
19970111 19970111 St. Cloud, MN pyramid 2 min.
Summary : Right when me and
my friend left my house we
saw a bright green glowing
object that looked like a 4
sided pyramid then after about
2 min it took off straight into
the sky leaving a yellow trail
behind it…

 What about languages outside of Java?
 Build on what Hadoop uses - Streaming
 Output streaming in 0.7.0
 Example in contrib/hadoop_streaming_output
 Input streaming in progress, likely 0.7.1

 Developed atYahoo!
 PigLatin/Grunt shell
 Powerful scripting language for analytics
 Example usage in contrib/pig
 Configuration – Hadoop/Env variables

 Raptr.com
 Home grown solution -> Cassandra + Hadoop
 Query time: hours -> minutes
 Pig obviated their need for multi-lingual MR
 Speed and ease are enabling
 Imagini/Visual DNA
 US Government (Digital Reasoning)
 See http://github.com/digitalreasoning/PyStratus

 Hive support in progress (HIVE-1434)
 Hadoop Input Streaming (likely 0.7.1)
 Performance improvements

 Hadoop analytics for Cassandra
 Data locality for processing
 Scales with the cluster

 More information
 http://cassandra.apache.org
 http://wiki.apache.org/cassandra/HadoopSupport
 Cassandra:The Definitive Guide
 About me:
 jeremy.hanna@rackspace.com
 @jeromatron onTwitter
 jeromatron on IRC in #cassandra

What's hot

Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo

Map ReduceRahul Agarwal

MapReduce basicChirag Ahuja

Hadoop - Simple. Scalable.elliando dias

SparkR-Advance Analytic for Big Datasamuel shamiri

Geek campjdhok

Getting Started on HadoopPaco Nathan

introduction to data processing using Hadoop and PigRicardo Varela

20170210 sapporotechbar7Ryuji Tamagawa

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所Ryuji Tamagawa

20171012 found IT #9 PySparkの勘所Ryuji Tamagawa

Introduction to hadoop ecosystem Rupak Roy

Introduction to Apache HadoopSteve Watt

Map Analytics in Starcraft II (2/3/2015)gy8

BDT201 AWS Data Pipeline - AWS re: Invent 2012Amazon Web Services

hadoop&zingzingopen

Introduction to Map ReduceApache Apex

R, Hadoop and Amazon Web ServicesPortland R User Group

Hadoop online training courseKamal A

Another Intro To HadoopAdeel Ahmad

What's hot (20)

Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Map Reduce

MapReduce basic

Hadoop - Simple. Scalable.

SparkR-Advance Analytic for Big Data

Geek camp

Getting Started on Hadoop

introduction to data processing using Hadoop and Pig

20170210 sapporotechbar7

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所

20171012 found IT #9 PySparkの勘所

Introduction to hadoop ecosystem

Introduction to Apache Hadoop

Map Analytics in Starcraft II (2/3/2015)

BDT201 AWS Data Pipeline - AWS re: Invent 2012

hadoop&zing

Introduction to Map Reduce

R, Hadoop and Amazon Web Services

Hadoop online training course

Another Intro To Hadoop

Viewers also liked

Real time ship tracking system using ais dataChathura

Flapping Foil Propulsion System in Ship and Underwater Vehicles Sharat Mathew

Marine PropulsionSiva Chidambaram

Propulsion Systems Of ShipsVipin Devaraj

Marine Propulsion History and Electric Propulsion & Future TechnologyMohammud Hanif Dewan M.Phil.

A seminar report on Electric PropulsionSAKTI PRASAD MISHRA

The Electric Propulsion SystemsPort Said University

Hydraulics trainingSunil Dewalekar

SHIP PROPULSION SEMINAR reportDNSPTL4569

Basic hydraulic circuitCik Aisyahfitrah

BIOMIMETIC ARCHITECTUREVaisali Krishnakumar

BiomimicryNUS SDE

Viewers also liked (12)

Real time ship tracking system using ais data

Flapping Foil Propulsion System in Ship and Underwater Vehicles

Marine Propulsion

Propulsion Systems Of Ships

Marine Propulsion History and Electric Propulsion & Future Technology

A seminar report on Electric Propulsion

The Electric Propulsion Systems

Hydraulics training

SHIP PROPULSION SEMINAR report

Basic hydraulic circuit

BIOMIMETIC ARCHITECTURE

Biomimicry

Similar to Fast Analytics for Cassandra with Hadoop

Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG

Hadoop and BigData - July 2016Ranjith Sekar

Hadoop and Big Data: RevealedSachin Holla

Hadoop demo pptPhil Young

Hadoop MapReduce FundamentalsLynn Langit

Hadoop 2.0 handout 5.0Manaranjan Pradhan

9/2017 STL HUG - Back to SchoolAdam Doyle

Python in big data worldRohit

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri

Presentation sreenu dwh-servicesSreenu Musham

The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...NashvilleTechCouncil

Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.

Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar

Best hadoop-online-trainingGeohedrick

TrainingDoug Chang

Hadoop - A Very Short Introductiondewang_mistry

Interactive SQL-on-Hadoop and JethroDataOfir Manor

Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar

Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit

Hadoop online trainingsrikanthhadoop

Similar to Fast Analytics for Cassandra with Hadoop (20)

Hadoop ecosystem framework n hadoop in live environment

Hadoop and BigData - July 2016

Hadoop and Big Data: Revealed

Hadoop demo ppt

Hadoop MapReduce Fundamentals

Hadoop 2.0 handout 5.0

9/2017 STL HUG - Back to School

Python in big data world

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...

Presentation sreenu dwh-services

The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...

Hw09 Production Deep Dive With High Availability

Hadoop a Natural Choice for Data Intensive Log Processing

Best hadoop-online-training

Training

Hadoop - A Very Short Introduction

Interactive SQL-on-Hadoop and JethroData

Big Data Hoopla Simplified - TDWI Memphis 2014

Scalable Hadoop with succinct Python: the best of both worlds

Hadoop online training

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

How to convert PDF to text with Nanonetsnaman860154

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Real Time Object Detection Using Open CVKhem

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

A Year of the Servo Reboot: Where Are We Now?Igalia

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

A Call to Action for Generative AI in 2024Results

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Artificial Intelligence: Facts and MythsJoaquim Jorge

Histor y of HAM Radio presentation slidevu2urc

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men

Automating Google Workspace (GWS) & more with Apps Script

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Exploring the Future Potential of AI-Enabled Smartphone Processors

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Breaking the Kubernetes Kill Chain: Host Path Mount

Data Cloud, More than a CDP by Matt Robison

Powerful Google developer tools for immediate impact! (2023-24 C)

What Are The Drone Anti-jamming Systems Technology?

How to convert PDF to text with Nanonets

Advantages of Hiring UIUX Design Service Providers for Your Business

Real Time Object Detection Using Open CV

Handwritten Text Recognition for manuscripts and early printed texts

A Year of the Servo Reboot: Where Are We Now?

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

A Call to Action for Generative AI in 2024

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Artificial Intelligence: Facts and Myths

Histor y of HAM Radio presentation slide

Fast Analytics for Cassandra with Hadoop

1. So HappyTogether

2.  BigTable + Dynamo  Semi-structured data model  Decentralized – no special roles  Ridiculously fast writes, fast reads  Tunably consistent  Cross-DC capable

3.  You design your data model based off of your query model  Real-time ad-hoc queries aren’t viable  Secondary indexes help (0.7)  What about analytics?

4.  Hadoop has analytics  MapReduce  Pig/Hive and other tools built above MapReduce  Configurable data sources/destinations  Many already familiar with it  Active community

5.  Always able to output to Cassandra directly  0.6  ColumnFamilyInputFormat  Pig support – Cassandra LoadFunc  0.7  ColumnFamilyOutputFormat  Hadoop Streaming Output  Streamlined configuration

6.  Recipe  Overlay Hadoop on top of Cassandra  Separate server for name node and job tracker  Co-locate task trackers with Cassandra nodes  Add data nodes to taste  Voilà  Data locality  Analytics engine scales with data  Example

7.  Cassandra specific InputFormat  Configuration – ConfigHelper, Hadoop variables  InputSplits over the data – tunable  Example usage in contrib/word_count

8.  OutputFormat  Configuration – ConfigHelper, Hadoop variables  Batches output – tunable  Don’t have to use Cassandra api  Some optimizations (e.g. ConsistencyLevel.ONE)  Example usage in contrib/word_count

9.  60,000+ Documented UFO Sightings  Data set from http://infochimps.com sighted_at reported_at location shape duration description 19951009 19951009 Iowa City, IA Man repts.Witnessing “flash, followed by a classic UFO, w/ a tailfin at back.” … 19940801 19950220 Renton, WA Man repts. seeing 2x large ships hovering in night sky while using Russian-made night binoculars. 19970111 19970111 St. Cloud, MN pyramid 2 min. Summary : Right when me and my friend left my house we saw a bright green glowing object that looked like a 4 sided pyramid then after about 2 min it took off straight into the sky leaving a yellow trail behind it…

10.  What about languages outside of Java?  Build on what Hadoop uses - Streaming  Output streaming in 0.7.0  Example in contrib/hadoop_streaming_output  Input streaming in progress, likely 0.7.1

11.  Developed atYahoo!  PigLatin/Grunt shell  Powerful scripting language for analytics  Example usage in contrib/pig  Configuration – Hadoop/Env variables

12.  Raptr.com  Home grown solution -> Cassandra + Hadoop  Query time: hours -> minutes  Pig obviated their need for multi-lingual MR  Speed and ease are enabling  Imagini/Visual DNA  US Government (Digital Reasoning)  See http://github.com/digitalreasoning/PyStratus

13.  Hive support in progress (HIVE-1434)  Hadoop Input Streaming (likely 0.7.1)  Performance improvements

14.  Hadoop analytics for Cassandra  Data locality for processing  Scales with the cluster

15.  More information  http://cassandra.apache.org  http://wiki.apache.org/cassandra/HadoopSupport  Cassandra:The Definitive Guide  About me:  jeremy.hanna@rackspace.com  @jeromatron onTwitter  jeromatron on IRC in #cassandra

Editor's Notes

Talk a little about background of the theme – hippies, The Turtles, readability.
Mention Jeff Hodges, Johan, Stu, and Todd Lipcon.
Mention how InputSplit works and how it can choose among replicas – array of locations returned.
Highlight how this is the same extension point that is used with HDFS, HBase and any other data source/destination for MapReduce.
IOW, are people using this stuff in the real world? In production? Put some notes in here about raptr and imagini’s use cases.

Fast Analytics for Cassandra with Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Fast Analytics for Cassandra with Hadoop

Similar to Fast Analytics for Cassandra with Hadoop (20)

More from Jeremy Hanna

More from Jeremy Hanna (12)

Recently uploaded

Recently uploaded (20)

Fast Analytics for Cassandra with Hadoop

Editor's Notes