6. Streams
• A stream is an unbounded sequence of tuples.
• Streams are defined with a schema that names
the fields in the stream’s tuples.
7. Spouts
• Spouts - a spout is a source of streams for a
given topology.
• It will read data from an external source and emit
them into the topology as tuples.
8. Bolts
• A bolt is the processing element in the topology.
• Bolts can do simple stream transformations like:
filtering, aggregations, functions, joins, etc.
9. Topologies
• A topology contains all the logic for the realtime
application.
• A topology is a graph of spouts and bolts that are
connected by stream groupings.
10. Tasks
• Each spout or bolt executes as many tasks
across the cluster.
• Each task corresponds to one thread of
execution.
• Stream groupings define how to send tuples from
one set of tasks to another set of tasks.
11. Stream Groupings
• A stream grouping defines for a given bolt which
streams it should receive as input.
• A stream grouping also defines how the stream’s
tuples are partitioned among the bolt tasks.
12. Shuffle Grouping
• Tuples are randomly distributed across the bolt's
tasks in a way such that each bolt is guaranteed
to get an equal number of tuples.
13. Fields Grouping
• The stream is partitioned by the fields specified in
the grouping.
• If the stream is grouped by the "user-id" field,
tuples with the same "user-id" will always go to
the same task.
14. Global Grouping
• The entire stream goes to a single one of the
bolt's tasks. Specifically, it goes to the task with
the lowest id.
15. Workers
• Topologies execute across one or more worker
processes.
• Each worker process is a physical JVM and
executes a subset of all the tasks for the
topology.
• If the combined parallelism of the topology is 300
and 50 workers are allocated, then each worker
will execute 6 tasks