SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Demonstration
Outline
● Some comments on what we're trying to
  show
  ○ high level cluster configuration
  ○ an example application that might use this config
    ■ based on a Gowalla data set
● Launch cluster nodes on EC2
● Launch/configure Cassandra on cluster
● Demonstrate use of Cassandra
  ○ cassandra-cli, pycassa scripts to interact with db
● Demonstrate use of Hadoop
● Demonstrate use of Pig on the real data
Cluster configuration
● Four EC2 nodes
  ○ m1.medium instances
    ■ realistically a bit small for real world
● 3 nodes part of Cassandra
  ○ data can be input dynamically into db via Thrift API
● All nodes run Hadoop Tasktracker
● MapReduce runs close to (Cassandra) data
● JobTracker on separate node
Cluster config


        Job Tracker                           Cassandra

                                             Task Tracker




                               Cassandra                     Cassandra

                              Task Tracker                  Task Tracker



All nodes m1.small for demo
Let's get the cluster up...
       ...over to Lamine!
Let's get Cassandra
      running...
  ...and show the basic cli...
Application data
● Used Gowalla data in this test application
● Gowalla provide anonymized data for
  test/research purposes:
  ○ Graph of UID connections
  ○ List of checkins - UID, LocID
● Size of data set:
  ○ 400MB checkins
    ■ 6.4m checkins
  ○ ~200k users
● Also generated simpler variant of this data
  for demonstration
  ○ more real user information
  ○ more real location information
Application data - User Graph




 Simple graph structure -
 unidirectional graph with
 UIDs as nodes
Application Data - Checkin info
How this data can be used
● Application interested in:
   ○   my checkins
   ○   list my friends
   ○   checkins at given location
   ○   my friends checkins
● Analytics:
   ○ top ten most active users - most checkins
   ○ aggregate checkins per week
   ○ aggregate checkins per week per city
Cassandra data models
● The following data models were used:
  ○ User
  ○ Location
  ○ Checkin
  ○ FriendRels
    ■ graph of friend relationships
  ○ UserCheckins
    ■ checkins by user
  ○ LocationCheckins
    ■ checkins by location
  ○ FriendCheckins
    ■ checkins by friends
Cassandra data models
● Use of valueless columns
  ○ FriendRels, UserCheckins, LocationCheckins,
    FriendCheckins are just sets of valueless columns
● FriendRel:
  ○ row_key: {friendid1: '', friendid2: '', friendid3: '', ...}
    ■ row_key is a uid
● UserCheckins:
  ○ row_key: {checkinid1: '', checkinid2: '', ...}
    ■ row_key is uid
● LocationCheckins use LocID as row key
● FriendCheckins use my UID to get my
  friend's checkins
Let's import the data into
       Cassandra...
You deserve a coffee...
Using Hadoop and Pig
 ...and we can do some analytics...

Weitere ähnliche Inhalte

Was ist angesagt?

Elasticsearch avoiding hotspots
Elasticsearch  avoiding hotspotsElasticsearch  avoiding hotspots
Elasticsearch avoiding hotspotsChristophe Marchal
 
Object multifunctional indexing with an open API
Object multifunctional indexing with an open API Object multifunctional indexing with an open API
Object multifunctional indexing with an open API akvalex
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...NoSQLmatters
 
Streaming data to s3 using akka streams
Streaming data to s3 using akka streamsStreaming data to s3 using akka streams
Streaming data to s3 using akka streamsMikhail Girkin
 
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...Srinath Perera
 
DB reading group may 16, 2018
DB reading group may 16, 2018DB reading group may 16, 2018
DB reading group may 16, 2018Keisuke Suzuki
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkSpark Summit
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRaghunath A
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...Rob Skillington
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей РыбакOntico
 
Data Step Hash Object vs SQL Join
Data Step Hash Object vs SQL JoinData Step Hash Object vs SQL Join
Data Step Hash Object vs SQL JoinGeoff Ness
 
Pain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication worksPain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication worksRob Skillington
 
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-OnApache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-OnApache Flink Taiwan User Group
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodbDeep Kapadia
 
MongoDB Workshop Universidad de Huelva
MongoDB Workshop Universidad de HuelvaMongoDB Workshop Universidad de Huelva
MongoDB Workshop Universidad de HuelvaJuan Antonio Roy Couto
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...NETWAYS
 
Amazon Web Services lection 4
Amazon Web Services lection 4  Amazon Web Services lection 4
Amazon Web Services lection 4 Binary Studio
 
Data Lessons Learned at Scale
Data Lessons Learned at ScaleData Lessons Learned at Scale
Data Lessons Learned at ScaleCharlie Reverte
 

Was ist angesagt? (20)

Elasticsearch avoiding hotspots
Elasticsearch  avoiding hotspotsElasticsearch  avoiding hotspots
Elasticsearch avoiding hotspots
 
Object multifunctional indexing with an open API
Object multifunctional indexing with an open API Object multifunctional indexing with an open API
Object multifunctional indexing with an open API
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
Streaming data to s3 using akka streams
Streaming data to s3 using akka streamsStreaming data to s3 using akka streams
Streaming data to s3 using akka streams
 
NBITSearch. Features.
NBITSearch. Features.NBITSearch. Features.
NBITSearch. Features.
 
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
 
DB reading group may 16, 2018
DB reading group may 16, 2018DB reading group may 16, 2018
DB reading group may 16, 2018
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
 
Data Step Hash Object vs SQL Join
Data Step Hash Object vs SQL JoinData Step Hash Object vs SQL Join
Data Step Hash Object vs SQL Join
 
Pain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication worksPain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication works
 
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-OnApache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
MongoDB Workshop Universidad de Huelva
MongoDB Workshop Universidad de HuelvaMongoDB Workshop Universidad de Huelva
MongoDB Workshop Universidad de Huelva
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
 
Amazon Web Services lection 4
Amazon Web Services lection 4  Amazon Web Services lection 4
Amazon Web Services lection 4
 
R user group 2011 09
R user group 2011 09R user group 2011 09
R user group 2011 09
 
Data Lessons Learned at Scale
Data Lessons Learned at ScaleData Lessons Learned at Scale
Data Lessons Learned at Scale
 

Andere mochten auch

No sql course introduction
No sql course   introductionNo sql course   introduction
No sql course introductionSean Murphy
 
Rss announcements
Rss announcementsRss announcements
Rss announcementsSean Murphy
 
Overview of no sql
Overview of no sqlOverview of no sql
Overview of no sqlSean Murphy
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandraTarun Garg
 

Andere mochten auch (7)

No sql course introduction
No sql course   introductionNo sql course   introduction
No sql course introduction
 
Rocco pres-v1
Rocco pres-v1Rocco pres-v1
Rocco pres-v1
 
Rss announcements
Rss announcementsRss announcements
Rss announcements
 
Rss talk
Rss talkRss talk
Rss talk
 
Overview of no sql
Overview of no sqlOverview of no sql
Overview of no sql
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 

Ähnlich wie Demonstrate Cassandra, Hadoop and Pig on EC2 Cluster

Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoringRohit Jnagal
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in RetailHari Shreedharan
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanDatabricks
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaJose Mº Muñoz
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdfAbhi Jain
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
 
Avoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfAvoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfCédrick Lunven
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightScyllaDB
 
Peer sim (p2p network)
Peer sim (p2p network)Peer sim (p2p network)
Peer sim (p2p network)Hein Min Htike
 
Mongo db improve the performance of your application codemotion2016
Mongo db improve the performance of your application codemotion2016Mongo db improve the performance of your application codemotion2016
Mongo db improve the performance of your application codemotion2016Juan Antonio Roy Couto
 

Ähnlich wie Demonstrate Cassandra, Hadoop and Pig on EC2 Cluster (20)

Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Native Container Monitoring
Native Container MonitoringNative Container Monitoring
Native Container Monitoring
 
Druid
DruidDruid
Druid
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 
Presentation
PresentationPresentation
Presentation
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdf
 
MongoDB FabLab León
MongoDB FabLab LeónMongoDB FabLab León
MongoDB FabLab León
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Avoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfAvoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdf
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
Peer sim (p2p network)
Peer sim (p2p network)Peer sim (p2p network)
Peer sim (p2p network)
 
Mongo db improve the performance of your application codemotion2016
Mongo db improve the performance of your application codemotion2016Mongo db improve the performance of your application codemotion2016
Mongo db improve the performance of your application codemotion2016
 

Kürzlich hochgeladen

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Kürzlich hochgeladen (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Demonstrate Cassandra, Hadoop and Pig on EC2 Cluster

  • 2. Outline ● Some comments on what we're trying to show ○ high level cluster configuration ○ an example application that might use this config ■ based on a Gowalla data set ● Launch cluster nodes on EC2 ● Launch/configure Cassandra on cluster ● Demonstrate use of Cassandra ○ cassandra-cli, pycassa scripts to interact with db ● Demonstrate use of Hadoop ● Demonstrate use of Pig on the real data
  • 3. Cluster configuration ● Four EC2 nodes ○ m1.medium instances ■ realistically a bit small for real world ● 3 nodes part of Cassandra ○ data can be input dynamically into db via Thrift API ● All nodes run Hadoop Tasktracker ● MapReduce runs close to (Cassandra) data ● JobTracker on separate node
  • 4. Cluster config Job Tracker Cassandra Task Tracker Cassandra Cassandra Task Tracker Task Tracker All nodes m1.small for demo
  • 5. Let's get the cluster up... ...over to Lamine!
  • 6. Let's get Cassandra running... ...and show the basic cli...
  • 7. Application data ● Used Gowalla data in this test application ● Gowalla provide anonymized data for test/research purposes: ○ Graph of UID connections ○ List of checkins - UID, LocID ● Size of data set: ○ 400MB checkins ■ 6.4m checkins ○ ~200k users ● Also generated simpler variant of this data for demonstration ○ more real user information ○ more real location information
  • 8. Application data - User Graph Simple graph structure - unidirectional graph with UIDs as nodes
  • 9. Application Data - Checkin info
  • 10. How this data can be used ● Application interested in: ○ my checkins ○ list my friends ○ checkins at given location ○ my friends checkins ● Analytics: ○ top ten most active users - most checkins ○ aggregate checkins per week ○ aggregate checkins per week per city
  • 11. Cassandra data models ● The following data models were used: ○ User ○ Location ○ Checkin ○ FriendRels ■ graph of friend relationships ○ UserCheckins ■ checkins by user ○ LocationCheckins ■ checkins by location ○ FriendCheckins ■ checkins by friends
  • 12. Cassandra data models ● Use of valueless columns ○ FriendRels, UserCheckins, LocationCheckins, FriendCheckins are just sets of valueless columns ● FriendRel: ○ row_key: {friendid1: '', friendid2: '', friendid3: '', ...} ■ row_key is a uid ● UserCheckins: ○ row_key: {checkinid1: '', checkinid2: '', ...} ■ row_key is uid ● LocationCheckins use LocID as row key ● FriendCheckins use my UID to get my friend's checkins
  • 13. Let's import the data into Cassandra...
  • 14. You deserve a coffee...
  • 15. Using Hadoop and Pig ...and we can do some analytics...