SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Twitter
                    stream processing




                                        Colin Surprenant
Big Data Montreal                       @colinsurprenant
April 2012                              Lead ninja
Twitter - Spring 2012

● 350 million   tweets/day
● 140 million   active users
● >1 million    applications using API
Daily Twitter @ Needium



50 000 000 processed tweets
     5 000 opportunities
       500 messages sent

   100GB data
Anatomy of a tweet
What's in a tweet?
                                  timestamp
                 user




                        message
  avatar


Anything else?
How to get the tweets?

● Streaming API
  Subscribe to realtime feeds moving forward

● REST Search API
  Search request on past data (1 week)
Streaming API
public statuses from all users


● status/filter
    track/location/follow
     ○ 5000 follow user ids
     ○ 400 track keywords
     ○ 25 location boxes
     ○ rate limited

● status/sample
    ○   1% of all public statuses (message id mod 100)
    ○   two status/sample streams will result in same data
Streaming API
per user streams




● User Streams
    ○   all data required to update a user's display
    ○   requires user's OAuth token
    ○   statuses from followings, direct messages, mentions
    ○   cannot open large number of user streams from same host


● Site Streams
    ○   multiplexing of multiple User Streams
Streaming API
Firehose


need more/full data?           only through partners

● gnip.com
● datasift.com

●   filtering/tracking
●
                                     $$$
    partial to full Firehose


What's the catch?
                                       lots of it
Search API

● REST API (http request/response)
  ○   search query
  ○   geocode (lat, long, radius)
  ○   result type (mixed/recent/popular)
  ○   since id


● max 100 rpp and 1500 results
● rate limited (~1 request/sec)
Storm
Distributed and fault-tolerant realtime computation

                         https://github.com/nathanmarz/storm
Storm
                     The promise

●   Guaranteed data processing
●   Horizontal scalability
●   Fault-tolerance
●   No intermediate message brokers
●   Higher level abstraction than message passing
●   Just work
RedStorm
Storm: Java + Closure
RedStorm: JRuby integration & DSL for Storm


                                     +
Ruby + Storm on JVM




              https://github.com/colinsurprenant/redstorm
Storm
Typical use cases



    Stream          Continuous    Distributed
  processing        computation      RPC
Storm
Concepts

                       Streams


  Tuple        Tuple    Tuple    Tuple   Tuple




           Unbounded sequence of tuples
Storm
Concepts

                           Spouts
                                            Tuple
                                    Tuple
                            Tuple
                   Tuple
           Tuple



           Tuple
                   Tuple
                            Tuple
                                    Tuple
                                            Tuple


                   Source of streams
Storm
Concepts

                                          Bolts

  Tuple
          Tuple
                  Tuple
                          Tuple
                                  Tuple


                                                  Tuple   Tuple   Tuple   Tuple   Tuple


                                  Tuple
                          Tuple
                  Tuple
          Tuple
  Tuple




   Processes input streams and produce new streams
Storm
Concepts

                  Topology




           Network of spouts and bolts
Storm
Concepts

   bolt A   Shuffle       bolt B                           All
                                              bolt A                     bolt B




   random+fair distribution                   replication to all tasks
   across tasks

                                   Grouping
   bolt A    Fields       bolt B              bolt A     Global          bolt B
              field X


              field Y


   partition on specified field               entire stream to single task
Storm
                   What Storm does

●   Distributes code
●   Robust process management
●   Monitors topologies and reassigns failed tasks
●   Provides reliability by tracking tuple trees
●   Routing and partitioning of stream
Tweitgeist
Live top 10 trending hashtags on Twitter




      DEMO


                  https://github.com/colinsurprenant/tweitgeist
                  Live demo: http://tweitgeist.needium.com
Tweitgeist
Architecture




                      Partitioned    Partitioned
                      counting       ranking
         Parallel
         extraction                                 Merging


                      Partitioned    Partitioned
                      counting       ranking


          Queue
                                                     Queue
                      N partitions   N partitions



         Twitter
         stream                                     Frontend
Tweitgeist
Architecture




● Parallel computation across partitions of the stream
● Scalable architecture
● Work for full Twitter firehose with very little mods
Tweitgeist
Storm Topology


               TwitterStream             ExtractMessage             ExtractHashtag                   RollingCount                   Rank            Merge
               Spout                     Bolt                       Bolt                             Bolt                           Bolt            Bolt
  Twitter




                                                                                                                    field hashtag
                                                                                     field hashtag




                                                                                                                                                               UI
                                                                                                                                           global
                               shuffle




                                                          shuffle
 Redis                                                                                                                                                        Redis
 queue                                                                                                                                                        queue




               stream                    message hashtag                                              rolling ranking                               merging
               reader                    extract extract                                              counter


            Shuffle grouping:                    Tuples are randomly distributed across the bolt's tasks
            Fields grouping:                     The stream is partitioned by the fields specified in the grouping
            Global grouping:                     The entire stream goes to a single one of the bolt's tasks
Tweitgeist
Topology definition
Tweitgeist
Where



                          Code
        https://github.com/colinsurprenant/tweitgeist

                     Live demo
               http://tweitgeist.needium.com

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...Kai Wähner
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...SANG WON PARK
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafkaconfluent
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
 
Building a Dynamic Rules Engine with Kafka Streams
Building a Dynamic Rules Engine with Kafka StreamsBuilding a Dynamic Rules Engine with Kafka Streams
Building a Dynamic Rules Engine with Kafka StreamsHostedbyConfluent
 
Introduction to Micronaut - JBCNConf 2019
Introduction to Micronaut - JBCNConf 2019Introduction to Micronaut - JBCNConf 2019
Introduction to Micronaut - JBCNConf 2019graemerocher
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?Kai Wähner
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit
 
Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Araf Karsh Hamid
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandracodecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with CassandraDataStax Academy
 

Was ist angesagt? (20)

Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
Core Banking System on Apache Kafka
Core Banking System on Apache KafkaCore Banking System on Apache Kafka
Core Banking System on Apache Kafka
 
Building a Dynamic Rules Engine with Kafka Streams
Building a Dynamic Rules Engine with Kafka StreamsBuilding a Dynamic Rules Engine with Kafka Streams
Building a Dynamic Rules Engine with Kafka Streams
 
Introduction to Micronaut - JBCNConf 2019
Introduction to Micronaut - JBCNConf 2019Introduction to Micronaut - JBCNConf 2019
Introduction to Micronaut - JBCNConf 2019
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandracodecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
 
LMAX Architecture
LMAX ArchitectureLMAX Architecture
LMAX Architecture
 
Track code quality with SonarQube
Track code quality with SonarQubeTrack code quality with SonarQube
Track code quality with SonarQube
 

Andere mochten auch

Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationUday Vakalapudi
 
StormWars - when the data stream shrinks
StormWars - when the data stream shrinksStormWars - when the data stream shrinks
StormWars - when the data stream shrinksvishnu rao
 
Real-Time Analytics with Apache Storm
Real-Time Analytics with Apache StormReal-Time Analytics with Apache Storm
Real-Time Analytics with Apache StormTaewoo Kim
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014StampedeCon
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Adrianos Dadis
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormAndrea Iacono
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeGuido Schmutz
 

Andere mochten auch (10)

Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
StormWars - when the data stream shrinks
StormWars - when the data stream shrinksStormWars - when the data stream shrinks
StormWars - when the data stream shrinks
 
Real-Time Analytics with Apache Storm
Real-Time Analytics with Apache StormReal-Time Analytics with Apache Storm
Real-Time Analytics with Apache Storm
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
 

Ähnlich wie Twitter Stream Processing

Storm - The Real-Time Layer Your Big Data's Been Missing
Storm - The Real-Time Layer Your Big Data's Been MissingStorm - The Real-Time Layer Your Big Data's Been Missing
Storm - The Real-Time Layer Your Big Data's Been MissingFullContact
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012Dan Lynn
 
Incremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation systemIncremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation systemIstvan Rath
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Stormboorad
 
stream processing engine
stream processing enginestream processing engine
stream processing enginetiana528
 
Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentationGabriel Eisbruch
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
Mobile Monday Athens launch, Konstantinos Papamiltiadis, Taptu
Mobile Monday Athens launch, Konstantinos Papamiltiadis, TaptuMobile Monday Athens launch, Konstantinos Papamiltiadis, Taptu
Mobile Monday Athens launch, Konstantinos Papamiltiadis, TaptuMobile Monday Athens
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Stormboorad
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Streaming 101: Hello World
Streaming 101:  Hello WorldStreaming 101:  Hello World
Streaming 101: Hello WorldJosh Fischer
 
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019Rafał Leszko
 
Message Queues : A Primer - International PHP Conference Fall 2012
Message Queues : A Primer - International PHP Conference Fall 2012Message Queues : A Primer - International PHP Conference Fall 2012
Message Queues : A Primer - International PHP Conference Fall 2012Mike Willbanks
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemAndrii Gakhov
 

Ähnlich wie Twitter Stream Processing (20)

Twitter Big Data
Twitter Big DataTwitter Big Data
Twitter Big Data
 
Bigdata roundtable-storm
Bigdata roundtable-stormBigdata roundtable-storm
Bigdata roundtable-storm
 
Storm - The Real-Time Layer Your Big Data's Been Missing
Storm - The Real-Time Layer Your Big Data's Been MissingStorm - The Real-Time Layer Your Big Data's Been Missing
Storm - The Real-Time Layer Your Big Data's Been Missing
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
Incremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation systemIncremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation system
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Storm
 
stream processing engine
stream processing enginestream processing engine
stream processing engine
 
Storm
StormStorm
Storm
 
Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentation
 
Storm
StormStorm
Storm
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Mobile Monday Athens launch, Konstantinos Papamiltiadis, Taptu
Mobile Monday Athens launch, Konstantinos Papamiltiadis, TaptuMobile Monday Athens launch, Konstantinos Papamiltiadis, Taptu
Mobile Monday Athens launch, Konstantinos Papamiltiadis, Taptu
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Storm
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Streaming 101: Hello World
Streaming 101:  Hello WorldStreaming 101:  Hello World
Streaming 101: Hello World
 
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
 
Message Queues : A Primer - International PHP Conference Fall 2012
Message Queues : A Primer - International PHP Conference Fall 2012Message Queues : A Primer - International PHP Conference Fall 2012
Message Queues : A Primer - International PHP Conference Fall 2012
 
Jan 2012 HUG: Storm
Jan 2012 HUG: StormJan 2012 HUG: Storm
Jan 2012 HUG: Storm
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation system
 

Kürzlich hochgeladen

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Twitter Stream Processing

  • 1. Twitter stream processing Colin Surprenant Big Data Montreal @colinsurprenant April 2012 Lead ninja
  • 2. Twitter - Spring 2012 ● 350 million tweets/day ● 140 million active users ● >1 million applications using API
  • 3. Daily Twitter @ Needium 50 000 000 processed tweets 5 000 opportunities 500 messages sent 100GB data
  • 4. Anatomy of a tweet What's in a tweet? timestamp user message avatar Anything else?
  • 5.
  • 6. How to get the tweets? ● Streaming API Subscribe to realtime feeds moving forward ● REST Search API Search request on past data (1 week)
  • 7. Streaming API public statuses from all users ● status/filter track/location/follow ○ 5000 follow user ids ○ 400 track keywords ○ 25 location boxes ○ rate limited ● status/sample ○ 1% of all public statuses (message id mod 100) ○ two status/sample streams will result in same data
  • 8. Streaming API per user streams ● User Streams ○ all data required to update a user's display ○ requires user's OAuth token ○ statuses from followings, direct messages, mentions ○ cannot open large number of user streams from same host ● Site Streams ○ multiplexing of multiple User Streams
  • 9. Streaming API Firehose need more/full data? only through partners ● gnip.com ● datasift.com ● filtering/tracking ● $$$ partial to full Firehose What's the catch? lots of it
  • 10. Search API ● REST API (http request/response) ○ search query ○ geocode (lat, long, radius) ○ result type (mixed/recent/popular) ○ since id ● max 100 rpp and 1500 results ● rate limited (~1 request/sec)
  • 11. Storm Distributed and fault-tolerant realtime computation https://github.com/nathanmarz/storm
  • 12. Storm The promise ● Guaranteed data processing ● Horizontal scalability ● Fault-tolerance ● No intermediate message brokers ● Higher level abstraction than message passing ● Just work
  • 13. RedStorm Storm: Java + Closure RedStorm: JRuby integration & DSL for Storm + Ruby + Storm on JVM https://github.com/colinsurprenant/redstorm
  • 14. Storm Typical use cases Stream Continuous Distributed processing computation RPC
  • 15. Storm Concepts Streams Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples
  • 16. Storm Concepts Spouts Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Source of streams
  • 17. Storm Concepts Bolts Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Processes input streams and produce new streams
  • 18. Storm Concepts Topology Network of spouts and bolts
  • 19. Storm Concepts bolt A Shuffle bolt B All bolt A bolt B random+fair distribution replication to all tasks across tasks Grouping bolt A Fields bolt B bolt A Global bolt B field X field Y partition on specified field entire stream to single task
  • 20. Storm What Storm does ● Distributes code ● Robust process management ● Monitors topologies and reassigns failed tasks ● Provides reliability by tracking tuple trees ● Routing and partitioning of stream
  • 21. Tweitgeist Live top 10 trending hashtags on Twitter DEMO https://github.com/colinsurprenant/tweitgeist Live demo: http://tweitgeist.needium.com
  • 22. Tweitgeist Architecture Partitioned Partitioned counting ranking Parallel extraction Merging Partitioned Partitioned counting ranking Queue Queue N partitions N partitions Twitter stream Frontend
  • 23. Tweitgeist Architecture ● Parallel computation across partitions of the stream ● Scalable architecture ● Work for full Twitter firehose with very little mods
  • 24. Tweitgeist Storm Topology TwitterStream ExtractMessage ExtractHashtag RollingCount Rank Merge Spout Bolt Bolt Bolt Bolt Bolt Twitter field hashtag field hashtag UI global shuffle shuffle Redis Redis queue queue stream message hashtag rolling ranking merging reader extract extract counter Shuffle grouping: Tuples are randomly distributed across the bolt's tasks Fields grouping: The stream is partitioned by the fields specified in the grouping Global grouping: The entire stream goes to a single one of the bolt's tasks
  • 26. Tweitgeist Where Code https://github.com/colinsurprenant/tweitgeist Live demo http://tweitgeist.needium.com