SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Twitter
                    stream processing




                                        Colin Surprenant
Big Data Montreal                       @colinsurprenant
April 2012                              Lead ninja
Twitter - Spring 2012

● 350 million   tweets/day
● 140 million   active users
● >1 million    applications using API
Daily Twitter @ Needium



50 000 000 processed tweets
     5 000 opportunities
       500 messages sent

   100GB data
Anatomy of a tweet
What's in a tweet?
                                  timestamp
                 user




                        message
  avatar


Anything else?
How to get the tweets?

● Streaming API
  Subscribe to realtime feeds moving forward

● REST Search API
  Search request on past data (1 week)
Streaming API
public statuses from all users


● status/filter
    track/location/follow
     ○ 5000 follow user ids
     ○ 400 track keywords
     ○ 25 location boxes
     ○ rate limited

● status/sample
    ○   1% of all public statuses (message id mod 100)
    ○   two status/sample streams will result in same data
Streaming API
per user streams




● User Streams
    ○   all data required to update a user's display
    ○   requires user's OAuth token
    ○   statuses from followings, direct messages, mentions
    ○   cannot open large number of user streams from same host


● Site Streams
    ○   multiplexing of multiple User Streams
Streaming API
Firehose


need more/full data?           only through partners

● gnip.com
● datasift.com

●   filtering/tracking
●
                                     $$$
    partial to full Firehose


What's the catch?
                                       lots of it
Search API

● REST API (http request/response)
  ○   search query
  ○   geocode (lat, long, radius)
  ○   result type (mixed/recent/popular)
  ○   since id


● max 100 rpp and 1500 results
● rate limited (~1 request/sec)
Storm
Distributed and fault-tolerant realtime computation

                         https://github.com/nathanmarz/storm
Storm
                     The promise

●   Guaranteed data processing
●   Horizontal scalability
●   Fault-tolerance
●   No intermediate message brokers
●   Higher level abstraction than message passing
●   Just work
RedStorm
Storm: Java + Closure
RedStorm: JRuby integration & DSL for Storm


                                     +
Ruby + Storm on JVM




              https://github.com/colinsurprenant/redstorm
Storm
Typical use cases



    Stream          Continuous    Distributed
  processing        computation      RPC
Storm
Concepts

                       Streams


  Tuple        Tuple    Tuple    Tuple   Tuple




           Unbounded sequence of tuples
Storm
Concepts

                           Spouts
                                            Tuple
                                    Tuple
                            Tuple
                   Tuple
           Tuple



           Tuple
                   Tuple
                            Tuple
                                    Tuple
                                            Tuple


                   Source of streams
Storm
Concepts

                                          Bolts

  Tuple
          Tuple
                  Tuple
                          Tuple
                                  Tuple


                                                  Tuple   Tuple   Tuple   Tuple   Tuple


                                  Tuple
                          Tuple
                  Tuple
          Tuple
  Tuple




   Processes input streams and produce new streams
Storm
Concepts

                  Topology




           Network of spouts and bolts
Storm
Concepts

   bolt A   Shuffle       bolt B                           All
                                              bolt A                     bolt B




   random+fair distribution                   replication to all tasks
   across tasks

                                   Grouping
   bolt A    Fields       bolt B              bolt A     Global          bolt B
              field X


              field Y


   partition on specified field               entire stream to single task
Storm
                   What Storm does

●   Distributes code
●   Robust process management
●   Monitors topologies and reassigns failed tasks
●   Provides reliability by tracking tuple trees
●   Routing and partitioning of stream
Tweitgeist
Live top 10 trending hashtags on Twitter




      DEMO


                  https://github.com/colinsurprenant/tweitgeist
                  Live demo: http://tweitgeist.needium.com
Tweitgeist
Architecture




                      Partitioned    Partitioned
                      counting       ranking
         Parallel
         extraction                                 Merging


                      Partitioned    Partitioned
                      counting       ranking


          Queue
                                                     Queue
                      N partitions   N partitions



         Twitter
         stream                                     Frontend
Tweitgeist
Architecture




● Parallel computation across partitions of the stream
● Scalable architecture
● Work for full Twitter firehose with very little mods
Tweitgeist
Storm Topology


               TwitterStream             ExtractMessage             ExtractHashtag                   RollingCount                   Rank            Merge
               Spout                     Bolt                       Bolt                             Bolt                           Bolt            Bolt
  Twitter




                                                                                                                    field hashtag
                                                                                     field hashtag




                                                                                                                                                               UI
                                                                                                                                           global
                               shuffle




                                                          shuffle
 Redis                                                                                                                                                        Redis
 queue                                                                                                                                                        queue




               stream                    message hashtag                                              rolling ranking                               merging
               reader                    extract extract                                              counter


            Shuffle grouping:                    Tuples are randomly distributed across the bolt's tasks
            Fields grouping:                     The stream is partitioned by the fields specified in the grouping
            Global grouping:                     The entire stream goes to a single one of the bolt's tasks
Tweitgeist
Topology definition
Tweitgeist
Where



                          Code
        https://github.com/colinsurprenant/tweitgeist

                     Live demo
               http://tweitgeist.needium.com

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration Service
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web Corpus
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
 
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
 
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Fast analytics kudu to druid
Fast analytics  kudu to druidFast analytics  kudu to druid
Fast analytics kudu to druid
 
Deep Dive and Best Practices for Real Time Streaming Applications
Deep Dive and Best Practices for Real Time Streaming ApplicationsDeep Dive and Best Practices for Real Time Streaming Applications
Deep Dive and Best Practices for Real Time Streaming Applications
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 

Andere mochten auch

Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
Uday Vakalapudi
 

Andere mochten auch (10)

Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
StormWars - when the data stream shrinks
StormWars - when the data stream shrinksStormWars - when the data stream shrinks
StormWars - when the data stream shrinks
 
Real-Time Analytics with Apache Storm
Real-Time Analytics with Apache StormReal-Time Analytics with Apache Storm
Real-Time Analytics with Apache Storm
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
 

Ähnlich wie Twitter Stream Processing

Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 

Ähnlich wie Twitter Stream Processing (20)

Twitter Big Data
Twitter Big DataTwitter Big Data
Twitter Big Data
 
Bigdata roundtable-storm
Bigdata roundtable-stormBigdata roundtable-storm
Bigdata roundtable-storm
 
Storm - The Real-Time Layer Your Big Data's Been Missing
Storm - The Real-Time Layer Your Big Data's Been MissingStorm - The Real-Time Layer Your Big Data's Been Missing
Storm - The Real-Time Layer Your Big Data's Been Missing
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
Incremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation systemIncremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation system
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Storm
 
stream processing engine
stream processing enginestream processing engine
stream processing engine
 
Storm
StormStorm
Storm
 
Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentation
 
Storm
StormStorm
Storm
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Mobile Monday Athens launch, Konstantinos Papamiltiadis, Taptu
Mobile Monday Athens launch, Konstantinos Papamiltiadis, TaptuMobile Monday Athens launch, Konstantinos Papamiltiadis, Taptu
Mobile Monday Athens launch, Konstantinos Papamiltiadis, Taptu
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Storm
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Streaming 101: Hello World
Streaming 101:  Hello WorldStreaming 101:  Hello World
Streaming 101: Hello World
 
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
 
Message Queues : A Primer - International PHP Conference Fall 2012
Message Queues : A Primer - International PHP Conference Fall 2012Message Queues : A Primer - International PHP Conference Fall 2012
Message Queues : A Primer - International PHP Conference Fall 2012
 
Jan 2012 HUG: Storm
Jan 2012 HUG: StormJan 2012 HUG: Storm
Jan 2012 HUG: Storm
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation system
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 

Twitter Stream Processing

  • 1. Twitter stream processing Colin Surprenant Big Data Montreal @colinsurprenant April 2012 Lead ninja
  • 2. Twitter - Spring 2012 ● 350 million tweets/day ● 140 million active users ● >1 million applications using API
  • 3. Daily Twitter @ Needium 50 000 000 processed tweets 5 000 opportunities 500 messages sent 100GB data
  • 4. Anatomy of a tweet What's in a tweet? timestamp user message avatar Anything else?
  • 5.
  • 6. How to get the tweets? ● Streaming API Subscribe to realtime feeds moving forward ● REST Search API Search request on past data (1 week)
  • 7. Streaming API public statuses from all users ● status/filter track/location/follow ○ 5000 follow user ids ○ 400 track keywords ○ 25 location boxes ○ rate limited ● status/sample ○ 1% of all public statuses (message id mod 100) ○ two status/sample streams will result in same data
  • 8. Streaming API per user streams ● User Streams ○ all data required to update a user's display ○ requires user's OAuth token ○ statuses from followings, direct messages, mentions ○ cannot open large number of user streams from same host ● Site Streams ○ multiplexing of multiple User Streams
  • 9. Streaming API Firehose need more/full data? only through partners ● gnip.com ● datasift.com ● filtering/tracking ● $$$ partial to full Firehose What's the catch? lots of it
  • 10. Search API ● REST API (http request/response) ○ search query ○ geocode (lat, long, radius) ○ result type (mixed/recent/popular) ○ since id ● max 100 rpp and 1500 results ● rate limited (~1 request/sec)
  • 11. Storm Distributed and fault-tolerant realtime computation https://github.com/nathanmarz/storm
  • 12. Storm The promise ● Guaranteed data processing ● Horizontal scalability ● Fault-tolerance ● No intermediate message brokers ● Higher level abstraction than message passing ● Just work
  • 13. RedStorm Storm: Java + Closure RedStorm: JRuby integration & DSL for Storm + Ruby + Storm on JVM https://github.com/colinsurprenant/redstorm
  • 14. Storm Typical use cases Stream Continuous Distributed processing computation RPC
  • 15. Storm Concepts Streams Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples
  • 16. Storm Concepts Spouts Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Source of streams
  • 17. Storm Concepts Bolts Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Processes input streams and produce new streams
  • 18. Storm Concepts Topology Network of spouts and bolts
  • 19. Storm Concepts bolt A Shuffle bolt B All bolt A bolt B random+fair distribution replication to all tasks across tasks Grouping bolt A Fields bolt B bolt A Global bolt B field X field Y partition on specified field entire stream to single task
  • 20. Storm What Storm does ● Distributes code ● Robust process management ● Monitors topologies and reassigns failed tasks ● Provides reliability by tracking tuple trees ● Routing and partitioning of stream
  • 21. Tweitgeist Live top 10 trending hashtags on Twitter DEMO https://github.com/colinsurprenant/tweitgeist Live demo: http://tweitgeist.needium.com
  • 22. Tweitgeist Architecture Partitioned Partitioned counting ranking Parallel extraction Merging Partitioned Partitioned counting ranking Queue Queue N partitions N partitions Twitter stream Frontend
  • 23. Tweitgeist Architecture ● Parallel computation across partitions of the stream ● Scalable architecture ● Work for full Twitter firehose with very little mods
  • 24. Tweitgeist Storm Topology TwitterStream ExtractMessage ExtractHashtag RollingCount Rank Merge Spout Bolt Bolt Bolt Bolt Bolt Twitter field hashtag field hashtag UI global shuffle shuffle Redis Redis queue queue stream message hashtag rolling ranking merging reader extract extract counter Shuffle grouping: Tuples are randomly distributed across the bolt's tasks Fields grouping: The stream is partitioned by the fields specified in the grouping Global grouping: The entire stream goes to a single one of the bolt's tasks
  • 26. Tweitgeist Where Code https://github.com/colinsurprenant/tweitgeist Live demo http://tweitgeist.needium.com