Pig with Cassandra: Adventures in Analytics

•Als PPTX, PDF herunterladen•

8 gefällt mir•5,144 views

This document discusses using Pig with Cassandra to perform analytics and data processing tasks. Pig allows running queries over Cassandra data and storing intermediate results in HDFS or Cassandra. Example uses include analytics, data exploration, validation, and correction. Configuration involves splitting the cluster into virtual datacenters and setting properties. Future work includes improving data type handling and adding support for secondary indexes and wide rows.

Technologie Business

Pig with Cassandra Adventures in Analytics

Motivation What’s our need? How do we get at data in Cassandra with ad-hoc queries Don’t reinvent the wheel

Enter Pig Pig was created at Yahoo! as an abstraction for MapReduce Designed to eat anything loadstorefunc created for Cassandra

How it works Perform queries over all rows in a column family or set of column families Intermediate results stored in HDFS or CFS Can mixand match inputs and outputs

Uses Analytics Data exploration How many items did I get from New Jersey? Data validation How many items were missing a field and when were they created? Data correction Company name correction over all data Expand Cassandra data model Make a new column family for querying by US State and back-populate with Pig Bootstrap local dev environment

Pygmalion Figure in Greek mythology, sounds like Pig UDFs, examples scripts for using Pig with Cassandra Used in production at The Dachis Group https://github.com/jeromatron/pygmalion/

Digging in the Dirt Pygmalion basic examples

Tips Develop incrementally Output intermediate data frequently to verify Validate data on input if possible Use Cassandra data type validation for inputs and outputs Pygmalion for tabular data Penny in Pig 0.9!

Cluster Configuration Split cluster – virtual datacenters Brisk (built-in pig support in 1.0 beta 2+) Task trackers on all analytic nodes With HDFS: Separate namenode/jobtracker Data nodes on all analytic nodes A few settings to bridge the two Start the server processes Distributed cache and intermediate data With Brisk: Startup includes CFS, job tracker, and task trackers

Topology configuration # from conf/cassandra-topology.properties ### # Cassandra Node IP=Data Center:Rack 10.20.114.10=DC-Analytics:Rack-1b 10.20.114.11=DC-Analytics:Rack-1b 10.20.114.12=DC-Analytics:Rack-2b 10.0.0.10=DC-Realtime-East:Rack-1a 10.0.0.11=DC-Realtime-East:Rack-1a 10.0.0.12=DC-Realtime-East:Rack-2a 10.21.119.13=DC-Realtime-West:Rack-1c 10.21.119.14=DC-Realtime-West:Rack-1c 10.21.119.15=DC-Realtime-West:Rack-2c # default for unknown nodes default=DC-Realtime-West:Rack-1c

Configuration Priorities Data locality Data locality – no really, biggest performance factor Memory needs Cassandra requires lots of memory Hadoop requires lots of memory Plan with your data model and analytics in mind CPU needs Cassandra doesn’t need a lot of CPU horsepower Hadoop loves CPU cores Interconnected Analytic nodes need to be close to one another

Cassandra/Hadoop properties Reference: org.apache.cassandra.hadoop.ConfigHelper.java Basics cassandra.thrift.address cassandra.thrift.port cassandra.partitioner.class Consistency cassandra.consistencylevel.read cassandra.consistencylevel.write Splits and batches cassandra.input.split.size cassandra.range.batch.size

Future Work Better data type handling (Cassandra-2777) MapReduce over subsets of rows (Cassandra-1600) MapReduce over secondary indexes (Cassandra-1600) Pig pushdown projection Pig pushdown filter HCatalog support for Cassandra Better Cassandra wide-row support (Cassandra-2688) Support for immutable/snapshot inputs (Cassandra-2527)

Questions Contact info Jeremy Hanna @jeromatron on twitter jeremy.hanna1234 <at> gmail jeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Pig: MapReduce the easy way!Nathan Bijnens

Practical Hadoop using PigDavid Wellman

HadoopCassell Hsu

Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer

Apache Drill - Why, What, Howmcsrivas

PySpark in practice slidesDat Tran

Apache drillJakub Pieprzyk

Introduction to Apache PigJason Shao

Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma

Data profiling in Apache CalciteDataWorks Summit

Cascalog internal dsl_presoHadoop User Group

Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz

Hadoop sqoop Wei-Yu Chen

Hive Anatomynzhang

Building a Scalable Web Crawler with HadoopHadoop User Group

introduction to data processing using Hadoop and PigRicardo Varela

Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal

Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz

알쓸신잡youngick

03 pig introSubhas Kumar Ghosh

Was ist angesagt? (20)

Hadoop Pig: MapReduce the easy way!

Practical Hadoop using Pig

Hadoop

Spark Cassandra Connector: Past, Present, and Future

Apache Drill - Why, What, How

PySpark in practice slides

Apache drill

Introduction to Apache Pig

Hadoop Hive Talk At IIT-Delhi

Data profiling in Apache Calcite

Cascalog internal dsl_preso

Introduction to the Hadoop Ecosystem (SEACON Edition)

Hadoop sqoop

Hive Anatomy

Building a Scalable Web Crawler with Hadoop

introduction to data processing using Hadoop and Pig

Hadoop Cluster Configuration and Data Loading - Module 2

Introduction to the Hadoop Ecosystem (codemotion Edition)

알쓸신잡

03 pig intro

Ähnlich wie Pig with Cassandra: Adventures in Analytics

Developing with CassandraSperasoft

Koalas: Pandas on Apache SparkDatabricks

Building and running cloud native cassandraVinay Kumar Chella

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit

Overview of stinger interactive query for hiveDavid Kaiser

Cassandra synergyniallmilton

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen

How Cloudflare analyzes -1m dns queries per second @ Percona E17Tom Arnfeld

Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt

Get started with Microsoft SQL PolybaseHenk van der Valk

Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble

Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks

PyconJP: Building a data preparation pipeline with Pandas and AWS LambdaFabian Dubois

Automating everything with PowerShell, Terraform, and AWSChris Brown

Staying Ahead of the Curve with Spring and Cassandra 4VMware Tanzu

Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)Alexandre Dutra

Introduction to Stacki at Atlanta Meetup February 2016StackIQ

Data science for infrastructure dev week 2022ZainAsgar1

Simplifying Apache CascadingMing Yuan

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit

Ähnlich wie Pig with Cassandra: Adventures in Analytics (20)

Developing with Cassandra

Koalas: Pandas on Apache Spark

Building and running cloud native cassandra

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...

Overview of stinger interactive query for hive

Cassandra synergy

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...

How Cloudflare analyzes -1m dns queries per second @ Percona E17

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Get started with Microsoft SQL Polybase

Spark + Cassandra = Real Time Analytics on Operational Data

Koalas: Making an Easy Transition from Pandas to Apache Spark

PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda

Automating everything with PowerShell, Terraform, and AWS

Staying Ahead of the Curve with Spring and Cassandra 4

Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)

Introduction to Stacki at Atlanta Meetup February 2016

Data science for infrastructure dev week 2022

Simplifying Apache Cascading

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...

Mehr von Jeremy Hanna

Göteborg Distributed: Eventual Consistency in Apache CassandraJeremy Hanna

Apache Cassandra in the Real WorldJeremy Hanna

Modern Cassandra for DevelopersJeremy Hanna

Troubleshooting CassandraJeremy Hanna

Cassandra + Hadoop: Analisi Batch con Apache CassandraJeremy Hanna

Cassandra euJeremy Hanna

Cassandra + Hadoop @ApacheCon Jeremy Hanna

Mehr von Jeremy Hanna (8)

Göteborg Distributed: Eventual Consistency in Apache Cassandra

Apache Cassandra in the Real World

Modern Cassandra for Developers

Troubleshooting Cassandra

Cassandra + Hadoop: Analisi Batch con Apache Cassandra

Cassandra eu

Cassandra + Hadoop @ApacheCon

Kürzlich hochgeladen

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

A Journey Into the Emotions of Software DevelopersNicole Novielli

WordPress Websites for Engineers: Elevate Your Brandgvaughan

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

"ML in Production",Oleksandr BaganFwdays

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

What is Artificial Intelligence?????????blackmambaettijean

unit 4 immunoblotting technique complete.pptxBkGupta21

Kürzlich hochgeladen (20)

Ensuring Technical Readiness For Copilot in Microsoft 365

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

The State of Passkeys with FIDO Alliance.pptx

Time Series Foundation Models - current state and future directions

Generative AI for Technical Writer or Information Developers

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

"Debugging python applications inside k8s environment", Andrii Soldatenko

A Journey Into the Emotions of Software Developers

WordPress Websites for Engineers: Elevate Your Brand

DevEX - reference for building teams, processes, and platforms

The Ultimate Guide to Choosing WordPress Pros and Cons

Anypoint Exchange: It’s Not Just a Repo!

What's New in Teams Calling, Meetings and Devices March 2024

How AI, OpenAI, and ChatGPT impact business and software.

"ML in Production",Oleksandr Bagan

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Artificial intelligence in cctv survelliance.pptx

Unraveling Multimodality with Large Language Models.pdf

What is Artificial Intelligence?????????

unit 4 immunoblotting technique complete.pptx

Pig with Cassandra: Adventures in Analytics

1. Pig with Cassandra Adventures in Analytics

2. Motivation What’s our need? How do we get at data in Cassandra with ad-hoc queries Don’t reinvent the wheel

3. Enter Pig Pig was created at Yahoo! as an abstraction for MapReduce Designed to eat anything loadstorefunc created for Cassandra

4. How it works Perform queries over all rows in a column family or set of column families Intermediate results stored in HDFS or CFS Can mixand match inputs and outputs

5. Uses Analytics Data exploration How many items did I get from New Jersey? Data validation How many items were missing a field and when were they created? Data correction Company name correction over all data Expand Cassandra data model Make a new column family for querying by US State and back-populate with Pig Bootstrap local dev environment

6. Pygmalion Figure in Greek mythology, sounds like Pig UDFs, examples scripts for using Pig with Cassandra Used in production at The Dachis Group https://github.com/jeromatron/pygmalion/

7. Digging in the Dirt Pygmalion basic examples

8. Tips Develop incrementally Output intermediate data frequently to verify Validate data on input if possible Use Cassandra data type validation for inputs and outputs Pygmalion for tabular data Penny in Pig 0.9!

9. Cluster Configuration Split cluster – virtual datacenters Brisk (built-in pig support in 1.0 beta 2+) Task trackers on all analytic nodes With HDFS: Separate namenode/jobtracker Data nodes on all analytic nodes A few settings to bridge the two Start the server processes Distributed cache and intermediate data With Brisk: Startup includes CFS, job tracker, and task trackers

10. Topology configuration # from conf/cassandra-topology.properties ### # Cassandra Node IP=Data Center:Rack 10.20.114.10=DC-Analytics:Rack-1b 10.20.114.11=DC-Analytics:Rack-1b 10.20.114.12=DC-Analytics:Rack-2b 10.0.0.10=DC-Realtime-East:Rack-1a 10.0.0.11=DC-Realtime-East:Rack-1a 10.0.0.12=DC-Realtime-East:Rack-2a 10.21.119.13=DC-Realtime-West:Rack-1c 10.21.119.14=DC-Realtime-West:Rack-1c 10.21.119.15=DC-Realtime-West:Rack-2c # default for unknown nodes default=DC-Realtime-West:Rack-1c

11. Configuration Priorities Data locality Data locality – no really, biggest performance factor Memory needs Cassandra requires lots of memory Hadoop requires lots of memory Plan with your data model and analytics in mind CPU needs Cassandra doesn’t need a lot of CPU horsepower Hadoop loves CPU cores Interconnected Analytic nodes need to be close to one another

12. Cassandra/Hadoop properties Reference: org.apache.cassandra.hadoop.ConfigHelper.java Basics cassandra.thrift.address cassandra.thrift.port cassandra.partitioner.class Consistency cassandra.consistencylevel.read cassandra.consistencylevel.write Splits and batches cassandra.input.split.size cassandra.range.batch.size

13. Future Work Better data type handling (Cassandra-2777) MapReduce over subsets of rows (Cassandra-1600) MapReduce over secondary indexes (Cassandra-1600) Pig pushdown projection Pig pushdown filter HCatalog support for Cassandra Better Cassandra wide-row support (Cassandra-2688) Support for immutable/snapshot inputs (Cassandra-2527)

14. Questions Contact info Jeremy Hanna @jeromatron on twitter jeremy.hanna1234 <at> gmail jeromatron on irc (in #cassandra, #hadoop-pig, and #hadoop)

Hinweis der Redaktion

Make this section interactiveHow many are using Cassandra – find out why, what types of dataHow many are using Hadoop – what types of dataWhat they would like to get from that data
Mention Jacob’s involvement

Pig with Cassandra: Adventures in Analytics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Pig with Cassandra: Adventures in Analytics

Ähnlich wie Pig with Cassandra: Adventures in Analytics (20)

Mehr von Jeremy Hanna

Mehr von Jeremy Hanna (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Pig with Cassandra: Adventures in Analytics

Hinweis der Redaktion