SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Storm - pipes and
filters on steroids

      Andre Sprenger


    BigData Roundtable
   Hamburg 30. Nov 2011
My background
•   info@andresprenger.de

•   Studied Computer Science and Economics

•   Background: banking, ecommerce, online advertising

•   Freelancer

•   Java, Scala, Ruby, Rails

•   Hadoop, Pig, Hive, Cassandra
“Next click” problem
Raymie Strata (CTO,Yahoo):

“With the paths that go through Hadoop [at Yahoo!], the
latency is about fifteen minutes. … [I]t will never be true
real-time. It will never be what we call “next click,” where
I click and by the time the page loads, the semantic
implication of my decision is reflected in the page.”
“Next click” problem
                             (next)
 HTTP         HTTP           HTTP          HTTP
Request      Response       Request       Response


     max latency                  max latency
       80 ms                        80 ms

                                                     web server
              realtime   near realtime
              response     response

                                                     real time layer

      collect data                 process data


                         time
Example problems
•   Realtime statistics - counting, trends, moving average

•   Read Twitter stream and output images that are
    trending in the last 10 minutes

•   CTR calculation - read ad clicks/ad impressions and
    calculate new click through rate

•   ETL - transform format, filter duplicates / bot traffic,
    enrich from static data, persist

•   Search advertising
Pick your framework...
•   S4 - Yahoo, “real time map reduce”, actor model

•   Storm - Twitter

•   MapReduce Online - Yahoo

•   Cloud Map Reduce - Accenture

•   HStreaming - Startup, based on Hadoop

•   Brisk - DataStax, Cassandra
System requirements
•   Fault tolerance - system keeps running when a node
    fails

•   Horizontal scalability - should be easy, just add a
    node

•   Low latency

•   Reliable - does not loose data

•   High availability - well, if it’s down for an hour its not
    realtime
Storm in a nutshell
•   Written by Backtype (aquired by Twitter)

•   Open Source, Github

•   Runs on JVM

•   Clojure, Python, Zookeeper, ZeroMQ

•   Currently used by Twitter for real time statistics
Programming model
•   Tuple - name/value list

•   Stream - unbounded sequence of Tuples

•   Spout - source of Streams

•   Bolt - consumer / producer of Streams

•   Topology - network of Streams, Spouts and Bolts
Spout
        tuple tuple tuple tuple


Spout
        tuple tuple tuple tuple
Bolt
   Processes streams and generates new streams.



tuple tuple tuple tuple

                                  tuple tuple tuple tuple
                           Bolt
tuple tuple tuple tuple
Bolt
•   filtering

•   transformation

•   split / aggregate streams

•   counting, statistics

•   read from / write to database
Topology
Network of Streams, Spouts and Bolts

                    Bolt         Bolt
     Spout

                    Bolt

     Spout                       Bolt

                    Bolt
Task
Parallel processor inside Spouts and Bolts.
Each Spout / Bolt has a fixed number of Tasks.


      Spout                Bolt

      Task                 Task

      Task                 Task

      Task
Stream grouping
Which Task does a Tuple go to?

•   shuffle grouping - distribute randomly

•   field grouping - partition by field value

•   all grouping - send to all Tasks

•   custom grouping - implement your own logic
Word count example

                Sentence            Word    (“a”, 2)
                 Splitter           Count   (“b”, 2)
Spout
                  Bolt               Bolt   (“c”, 1)
                            (“a”)           (“d”, 1)
                            (“b”)
  (“a b c a b d”)           (“c”)
                            (“a”)
                            (“b”)
                            (“d”)
Guaranteed processing
                             (“a”)

                             (“b”)
                                             (“a”, 2)
                             (“c”)
                                             (“b”, 2)
Spout    (“a b c a b d”)
                                             (“c”, 1)
                             (“a”)
                                             (“d”, 1)
                             (“b”)

                             (“d”)

Topology has a timeout for processing of the tuple tree
Runtime view
Reliability
•   Nimbus / Supervisor are SPOF

•   both are stateless, easy to restart without data loss

•   Failure of master node (?)

•   Running Topologies should not be affected!

•   Failed Workers are restarted

•   Guaranteed message processing
Administration

•   Nimbus / Supervisor / Zookeeper need monitoring
    and supervisor (e.g. Monit)

•   Cluster nodes can be added at runtime

•   But: existing Topologies are not rebalanced (there is a
    ticket)

•   Administration web GUI
Community
•   Source is on Github - https://github.com/
    nathanmarz/storm.git

•   Wiki - https://github.com/nathanmarz/storm/wiki

•   Nice documentation

•   Google Group

•   People start to build add-ons: JRuby integration,
    adapters for JMS, AMQP
Storm summary
•   Nice programming model

•   Easy to deploy new topologies

•   Horizontal scalability

•   Low latency

•   Fault tolerance

•   Easy to setup on EC2
Questions?

Weitere ähnliche Inhalte

Was ist angesagt? (8)

Storm
StormStorm
Storm
 
ソーシャルゲームログ解析基盤のHadoop活用事例
ソーシャルゲームログ解析基盤のHadoop活用事例ソーシャルゲームログ解析基盤のHadoop活用事例
ソーシャルゲームログ解析基盤のHadoop活用事例
 
Cwmg
CwmgCwmg
Cwmg
 
[BGOUG] Java GC - Friend or Foe
[BGOUG] Java GC - Friend or Foe[BGOUG] Java GC - Friend or Foe
[BGOUG] Java GC - Friend or Foe
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo Products
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for Dummies
 
Kyotoproducts
KyotoproductsKyotoproducts
Kyotoproducts
 
DEFCON 23 - Mike Sconzo - i am packer and so can you
DEFCON 23 - Mike Sconzo - i am packer and so can youDEFCON 23 - Mike Sconzo - i am packer and so can you
DEFCON 23 - Mike Sconzo - i am packer and so can you
 

Ähnlich wie Bigdata roundtable-storm

Storm distributed processing
Storm distributed processingStorm distributed processing
Storm distributed processing
ducquoc_vn
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
npinto
 

Ähnlich wie Bigdata roundtable-storm (20)

Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Storm distributed processing
Storm distributed processingStorm distributed processing
Storm distributed processing
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Reactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesReactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServices
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Storm
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and Reactor
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, Tikal
 
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Bigdata roundtable-storm

  • 1. Storm - pipes and filters on steroids Andre Sprenger BigData Roundtable Hamburg 30. Nov 2011
  • 2. My background • info@andresprenger.de • Studied Computer Science and Economics • Background: banking, ecommerce, online advertising • Freelancer • Java, Scala, Ruby, Rails • Hadoop, Pig, Hive, Cassandra
  • 3. “Next click” problem Raymie Strata (CTO,Yahoo): “With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes. … [I]t will never be true real-time. It will never be what we call “next click,” where I click and by the time the page loads, the semantic implication of my decision is reflected in the page.”
  • 4. “Next click” problem (next) HTTP HTTP HTTP HTTP Request Response Request Response max latency max latency 80 ms 80 ms web server realtime near realtime response response real time layer collect data process data time
  • 5. Example problems • Realtime statistics - counting, trends, moving average • Read Twitter stream and output images that are trending in the last 10 minutes • CTR calculation - read ad clicks/ad impressions and calculate new click through rate • ETL - transform format, filter duplicates / bot traffic, enrich from static data, persist • Search advertising
  • 6. Pick your framework... • S4 - Yahoo, “real time map reduce”, actor model • Storm - Twitter • MapReduce Online - Yahoo • Cloud Map Reduce - Accenture • HStreaming - Startup, based on Hadoop • Brisk - DataStax, Cassandra
  • 7. System requirements • Fault tolerance - system keeps running when a node fails • Horizontal scalability - should be easy, just add a node • Low latency • Reliable - does not loose data • High availability - well, if it’s down for an hour its not realtime
  • 8. Storm in a nutshell • Written by Backtype (aquired by Twitter) • Open Source, Github • Runs on JVM • Clojure, Python, Zookeeper, ZeroMQ • Currently used by Twitter for real time statistics
  • 9. Programming model • Tuple - name/value list • Stream - unbounded sequence of Tuples • Spout - source of Streams • Bolt - consumer / producer of Streams • Topology - network of Streams, Spouts and Bolts
  • 10. Spout tuple tuple tuple tuple Spout tuple tuple tuple tuple
  • 11. Bolt Processes streams and generates new streams. tuple tuple tuple tuple tuple tuple tuple tuple Bolt tuple tuple tuple tuple
  • 12. Bolt • filtering • transformation • split / aggregate streams • counting, statistics • read from / write to database
  • 13. Topology Network of Streams, Spouts and Bolts Bolt Bolt Spout Bolt Spout Bolt Bolt
  • 14. Task Parallel processor inside Spouts and Bolts. Each Spout / Bolt has a fixed number of Tasks. Spout Bolt Task Task Task Task Task
  • 15. Stream grouping Which Task does a Tuple go to? • shuffle grouping - distribute randomly • field grouping - partition by field value • all grouping - send to all Tasks • custom grouping - implement your own logic
  • 16. Word count example Sentence Word (“a”, 2) Splitter Count (“b”, 2) Spout Bolt Bolt (“c”, 1) (“a”) (“d”, 1) (“b”) (“a b c a b d”) (“c”) (“a”) (“b”) (“d”)
  • 17. Guaranteed processing (“a”) (“b”) (“a”, 2) (“c”) (“b”, 2) Spout (“a b c a b d”) (“c”, 1) (“a”) (“d”, 1) (“b”) (“d”) Topology has a timeout for processing of the tuple tree
  • 19. Reliability • Nimbus / Supervisor are SPOF • both are stateless, easy to restart without data loss • Failure of master node (?) • Running Topologies should not be affected! • Failed Workers are restarted • Guaranteed message processing
  • 20. Administration • Nimbus / Supervisor / Zookeeper need monitoring and supervisor (e.g. Monit) • Cluster nodes can be added at runtime • But: existing Topologies are not rebalanced (there is a ticket) • Administration web GUI
  • 21. Community • Source is on Github - https://github.com/ nathanmarz/storm.git • Wiki - https://github.com/nathanmarz/storm/wiki • Nice documentation • Google Group • People start to build add-ons: JRuby integration, adapters for JMS, AMQP
  • 22. Storm summary • Nice programming model • Easy to deploy new topologies • Horizontal scalability • Low latency • Fault tolerance • Easy to setup on EC2