C-MR enables continuously executing MapReduce workflows on streams of data by using windows to subdivide streams into finite batches and a pull-based scheduling model. It provides a programming interface for defining MapReduce jobs on input/output streams and coordinating workflows. Evaluation shows C-MR can process streams with lower latency than batch systems by incrementally sharing computation across windows and using hybrid scheduling policies that prioritize oldest data first but also optimize memory usage.
2. Problem
• Stream applications are often time-critical
• Enabling stream support for MapReduce
jobs
– Simple for the Map operations
– Hard for the Reduce operations
• Continuously executing MapReduce
workflows requires a great deal of
coordination
1
3. C-MR Workflow
• Windows: temporal subdivisions of a stream
described by
– size (the amount of the stream spanning)
– slide (the interval between windows)
2
7. C-MR vs. MapReduce
• MapReduce computing nodes receive a set of
Map or Reduce tasks and each node must wait
for all other nodes to complete their tasks
before being allocated additional tasks.
• C-MR uses pull-based data acquisition allowing
computing nodes to execute any Map or
Reduce workload as they are able. Thus,
straggling nodes will not hinder the progress of
the other nodes if there is data available to
process elsewhere in the workflow.
6
9. Stream and Window Management
• The merged output streams are not
guaranteed to retain their original
orderings.
• Solution: Replicating window-bounding
punctuations
10. Stream and Window Management (cont.1)
A node consumes the punctuation from the sorted input
stream-buffer
9
11. Stream and Window Management (cont.2)
Replicate that punctuation to the other nodes
12. Stream and Window Management (cont.3)
After all replicas are received at the intermediate buffer,
collect data whose timestamps fall into the applicable
interval and materialize them as a window
13. Operator Scheduling
• Scheduling framework
– Execute multiple policies simultaneously
– Transition between policies based on
resource availability
• Scheduling policies
20. Two Properties of Streams
• Unbounded
• Accessed sequentially
Hard to be handled using traditional DBMS
19
21. Query Operators
• Unbounded stateful operators
– maintain state with no upper bound in size
run out of memory
• Blocking operators
– read an entire input before emitting a
single output
might never produce a result
• Never use them, or
• Use them under a refactoring
20
22. Punctuations
• Mark the end of substreams
– allowing us to view an infinite stream as a
mixture of finite streams
21
Hinweis der Redaktion
Repeatedly invoking a Phoenix++ MapReduce job over a stream results in many redundant computations (at both Map and Reduce operations). C-MR allows data to be processed only once by Map and the inclusion of the Combine operator significantly decreases redundant work performed at the Reduce operator.
1. Data is often generated from a source that can potentially produce an unbounded stream.2. A stream’s contents can only be accessed sequentially.Traditional queries are comprised of relational operators that assume a finite data source that can be accessed randomly.