Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Learning Stream Processing with Apache Storm

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Introduction to Storm
Introduction to Storm
Wird geladen in …3
×

Hier ansehen

1 von 100 Anzeige

Learning Stream Processing with Apache Storm

Herunterladen, um offline zu lesen

Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.

In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.

Following topics will be covered:

• Why use Apache Storm?

• Common use cases

• Storm Architecture - components, concepts, topology

• Building simple Storm topology with Java and Groovy

• Trident and micro-batch processing

• Fault tolerance and guaranteed message delivery

• Running and monitoring Storm in production

• Kafka

• Storm at WebMD

• Resources

Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.

In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.

Following topics will be covered:

• Why use Apache Storm?

• Common use cases

• Storm Architecture - components, concepts, topology

• Building simple Storm topology with Java and Groovy

• Trident and micro-batch processing

• Fault tolerance and guaranteed message delivery

• Running and monitoring Storm in production

• Kafka

• Storm at WebMD

• Resources

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (18)

Anzeige

Ähnlich wie Learning Stream Processing with Apache Storm (20)

Aktuellste (20)

Anzeige

Learning Stream Processing with Apache Storm

  1. 1. a [ b K Z
  2. 2. • • • CONTACT ME @edvorkin
  3. 3. • • • • • • • • • •
  4. 4. [
  5. 5. real-time medical news from curated Twitter feed
  6. 6. Every second, on average, around 6,000 tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day 350,000 ^ 1 % = 3500 ^
  7. 7. •How to scale •How to deal with failures •What to do with failed messages •A lot of infrastructure concerns •Complexity •Tedious coding DB t *Image credit:Nathanmarz: slideshare: storm
  8. 8. Inherently BATCH-Oriented System
  9. 9. •Exponential rise in real- time data •New business opportunity •Economics of OSS and commodity hardware Stream processing has emerged as a key use case* *Source: Discover HDP2.1: Apache Storm for Stream Data Processing. Hortonworks. 2014
  10. 10. •Detecting fraud while someone swiping credit card •Place ad on website while someone is reading a specific article •Alerts on application and machine failures •Use stream-processing in batch oriented fashion
  11. 11. 4
  12. 12. % å å
  13. 13. Created by Nathan Martz Acquired by Twitter Apache Incubator Project Open sourced Part of Hortonworks HDP2 platform U a x Top Level Apache Project
  14. 14. Most mature, widely adopted framework Source: http://storm.incubator.apache.org/
  15. 15. Process endless stream of data. 1M+ messages / sec on a 10- 15 node cluster / 4
  16. 16. Guaranteed message processing Û
  17. 17. Tuples, Streams, Spouts, Bolts and Topologies Z å å å
  18. 18. TUPLE Storm data type: Immutable List of Key/Value pair of any data type word: “Hello” Count: 25 Frequency: 0.25
  19. 19. Unbounded Sequence of Tuples between nodes STREAM
  20. 20. SPOUT The Source of the Stream
  21. 21. Read from stream of data – queues, web logs, API calls, databases Spout responsibilities
  22. 22. BOLT ⚡
  23. 23. •Process tuples and perform actions: calculations, API calls, DB calls •Produce new output stream based on computations Bolt ⚡ F(x)
  24. 24. •A topology is a network of spouts and bolts •Defines data flow 4
  25. 25. •May have multiple spouts 4
  26. 26. •Each spout and bolt may have many instances that perform all the processing in parallel 4 • • • • •
  27. 27. How tuples are send between instances of spouts and bolts Random Distribution. Routes tuples to bolt based on the value of the field. Same values always route to the same bolt Replicates the tuple stream across all the bolt tasks. Each task receive a copy of tuple. Routes all tuple in the stream to single task. Should be used with caution. 4
  28. 28. å å å å
  29. 29. compile 'org.apache.storm:storm-core:0.9.2’ <dependency> <groupId>org.apache.storm</groupId> <artifactId>storm-core</artifactId> <version>0.9.2</version> </dependency>
  30. 30. Two 1 Households 1 Both 1 Alike 1 In 1 Dignity 1 sentence word Word ⚡ ⚡ ⚡ 3 final count: Two 20 Households 24 Both 22 Alike 1 In 1 Dignity 10 "Two households, both alike in dignity" Two Households Both alike in dignity
  31. 31. Data Source
  32. 32. SplitSentenceBolt Resource initialization
  33. 33. WordCountBolt
  34. 34. PrinterBolt
  35. 35. Linking it all together
  36. 36. How to scale stream processing q å å å å å
  37. 37. storm main components Machines in a storm cluster JVM processes running on a node. One or more per node. Java thread running within worker JVM process. Instances of spouts and bolts.
  38. 38. q
  39. 39. q
  40. 40. How tuples are send between instances of spouts and bolts
  41. 41. a å å å å å å
  42. 42. Tuple tree Reliable vs unreliable topologies
  43. 43. Methods from ISpout interface
  44. 44. Reliability in Bolts Anchoring Ack Fail
  45. 45. Unit testing Storm components a
  46. 46. BDD style of testing
  47. 47. Extending OutputCollector
  48. 48. Extending OutputCollector
  49. 49. Z å å å å å å å
  50. 50. Physical View 4
  51. 51. deploying topology to a cluster storm jar wordcount-1.0.jar com.demo.storm.WordCountTopology word- count-topology
  52. 52. Monitoring and performance tuning
  53. 53. x å å å å å å å å
  54. 54. Run under supervision: Monit, supervisord
  55. 55. Nimbus move work to another node
  56. 56. Supervisor will restart worker
  57. 57. Micro-Batch Stream Processing K å å å å å å å å å
  58. 58. Functions, Filters, aggregations, joins, grouping Ordered batches of tuples. Batches can be partitioned. Similar to Pig or Cascading Transactional spouts Trident has first class abstraction for reading and writing to stateful sources Ü 4
  59. 59. Stream processed in small batches •Each batch has a unique ID which is always the same on each replay •If one tuple failed, the whole batch is reprocessed •Higher throutput than storm but higher latency as well
  60. 60. How trident provides exactly –one semantics?
  61. 61. Store the count along with BatchID COUNT 100 BATCHID 1 COUNT 110 BATCHID 2 10 more tuples with batchId 2 Failure: Batch 2 replayed The same batchId (2) •Spout should replay a batch exactly as it was played before •Trident API hide dealing with batchID complexity
  62. 62. Word count with trident
  63. 63. Word count with Trident
  64. 64. Word count with Trident
  65. 65. Style of computation 4
  66. 66. By styles of computation 4
  67. 67. å å å å å å å å å å
  68. 68. Enhancing Twitter feed with lead Image and Title •Readability enhancements •Image Scaling •Remove duplicates •Custom Business Logic
  69. 69. Writing twitter spout
  70. 70. Status
  71. 71. use Twitter4J java library
  72. 72. use existing Spout from Storm contrib project on GitHub Spouts exists for: Twitter, Kafka, JMS, RabbitMQ, Amazon SQS, Kinesis, MongoDB….
  73. 73. •Storm takes care of scalability and fault-tolerance •What happens if there is burst in traffic?
  74. 74. Introducing Queuing Layer with Kafka Ñ
  75. 75. 4
  76. 76. Solr Indexing
  77. 77. Processing Groovy Rules (DSL) on a scale in real-time
  78. 78. å å å å å å å å å å å
  79. 79. Statsd and Storm Metrics API http://www.michael-noll.com/blog/2013/11/06/sending-metrics-from-storm-to-graphite/
  80. 80. •Use cache if you can: for example Google Guava caching utilities •In memory DB •Tick tuples (for batch updates)
  81. 81. •Linear classification (Perceptron, Passive-Aggresive, Winnow, AROW) •Linear regression (Perceptron, Passive-Aggresive) •Clustering (KMeans) •Feature scaling (standardization, normalization) •Text feature extraction •Stream statistics (mean, variance) •Pre-Trained Twitter sentiment classifier Trident-ML
  82. 82. http://www.michael-noll.com http://www.bigdata-cookbook.com/post/72320512609/storm-metrics- how-to http://svendvanderveken.wordpress.com/
  83. 83. edvorkin/Storm_Demo_Spring2GX
  84. 84. Go ahead. Ask away.

Hinweis der Redaktion

  • Attention!

    Before you open this template be sure what you have the following fonts installed:

    Novecento Sans wide font family (6 free weight)
    http://typography.synthview.com

    Abattis Cantarell
    http://www.fontsquirrel.com/fonts/cantarell

    Icon Sets Fonts:

    raphaelicons-webfont.ttf from this page: http://icons.marekventur.de
    iconic_stroke.ttf from this page: http://somerandomdude.com/work/open-iconic
    modernpics.otf from this page: http://www.fontsquirrel.com/fonts/modern-pictograms
    general_foundicons.ttf, social_foundicons.ttf, accessibility_foundicons.ttf from this page: http://www.zurb.com/playground/foundation-icons
    fontawesome-webfont.ttf from this page: http://fortawesome.github.io/Font-Awesome
    Entypo.otf from this page: http://www.fontsquirrel.com/fonts/entypo
    sosa-regular-webfont.ttf from this page: http://tenbytwenty.com/?xxxx_posts=sosa

    All fonts are permitted free use in commercial projects.

    If you have difficulties to install those fonts or have no time to find all of them, please follow the FAQs:
    http://graphicriver.net/item/six-template/3626243/support
  • Recently we at WebMD had to create application that process data from twitter
  • infrastructure
  • Infrastructure investment
    Administration cost
    Steep learning curve
    Huge ecosystem: pig, hive, ambari, cascading, flume ….
  • social media sentiments, machine sensors, internet of things, interconnected devices, logs, clickstream
    CEP or stream processing solution existed before but was very costly
  • Pause
  • Ready for the enterprise – not only for twitter or linked in
  • Pause
    Meaning – fault tolerant
  • Workers, spout, slow down on basics
  • A bolt processes any number of input streams and produces any number of new output streams.
    Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, and so on.
  • A bolt processes any number of input streams and produces any number of new output streams. Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, and so on.
  • pause
    A topology is a network of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt.
    DAG
  • pause
    A topology is a network of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt.
    DAG
  • pause
    A topology is a network of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt.
    DAG
  • pause
  • Like driver in Hadoop
  • pause
  • pause
  • Storm considers a tuple coming of a spout fully processed when every message in the tree has been processed. A tuple is considered failed when its tree of messages fails to be fully processed within a configurable timeout. The default is 30 seconds.
  • Storm considers a tuple coming of a spout fully processed when every message in the tree has been processed. A tuple is considered failed when its tree of messages fails to be fully processed within a configurable timeout. The default is 30 seconds.
    When emitting a tuple, the Spout provides a
    "message id" that will be used to identify
    the tuple later
  • Storm considers a tuple coming of a spout fully processed when every message in the tree has been processed. A tuple is considered failed when its tree of messages fails to be fully processed within a configurable timeout. The default is 30 seconds.

    Link between incoming and derived tuple.

  • Master and worker node
    Nimbus – simular to job tracker in Hadoop
    Nimbus- responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures
    Each worker node runs a daemon called the "Supervisor". The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it.
    All coordination between Nimbus and the Supervisors is done through a Zookeeper cluster.
  • Master and worker node
    Nimbus- responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures
    Each worker node runs a daemon called the "Supervisor". The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it.
    All coordination between Nimbus and the Supervisors is done through a Zookeeper cluster.
  • Capacity – percentage of time bolt was busy executing particular task
  • Processing will continue. But topology lifecycle operations and reassignment facility are lost.
    Run under system supervision
  • Trident topologies got converted into storm topologies with spout/tuples
  • Higher throutput than storm but higher latency as well
  • Spout should replay a batch exactly as it was played before
    Kafka spout
    Trident API hide dealing with batchID complexity
  • Java fluent api
    Write functions or filters instead of bolts
  • Fire and forget
  • A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients
  • Same code, just different topologies and original sources
    Lambda architecture
  • Groovy Script engine

×