Continuous dataflows complement scientific workflows
by allowing composition of realtime data ingest and analytics
pipelines to process data streams from pervasive sensors and
“always-on” scientific instruments. Such dataflows are missioncritical
applications that cannot suffer downtime, need to operate
consistent, and are long running, but may need to be updated to
fix bug or add features. This poses the problem: How do we update
the continuous dataflow application with minimal disruption? In
this paper, we formalize different types of dataflow update models
for continuous dataflow applications, and identify the qualitative
and quantitative metrics to be considered when choosing an
update strategy. We propose five dataflow update strategies,
and analytically characterize their performance trade-offs. We
validate one of these consistent, low-latency update strategies
using the F`o" dataflow engine for an eEngineering application
from the Smart Power Grid domain, and show its
5. • Mission critical data flows cannot suffer downtime
– How to update continuous dataflow applications with minimal disruption ?
• Evaluating dynamic update.
– Performance impact
• Throughput , Latency
– Consistency
• Data loss
• Reproducibility
5
http://www.ipandora.net/2009/08/09/pray-steadfastly/
6. • Formalize different types of data flow
updates needs.
• Identify qualitative and quantitative
metrics to be considered when
designing update strategies
• Introduce five different data flow
strategies and analytically characterize
their performance metrics
• Implement a consistent, low latency
update strategy in Floe continuous
dataflow engine and evaluate it against
a simple update strategy for a
motivating application from Los Angeles
power grid project
http://www.flickr.com/photos/dhammakaya/7095451689/
6
7. • Continuous data flow τ(Ƿ,С) is a directed graph
– Ƿ set of processors
– С set of directed edges(channels) connecting processors
7
11. P3
P1
P5
P2
P4
P6
P2++
• Updates to one or more processors and channels
– No change in number of processors
– Channel connectivity change, Channel addition/removal
11
13. • Quantitative
–
–
–
–
Refresh latency
Lag latency
Throughput
Message loss
• Qualitative
– Consistency
– Interleaved vs Delineated
http://theculturevulture.co.uk/blog/reviews/what-happened-at-whats-next/
13
14. • Refresh latency
– Time between update start and first message from the
new workflow component
• Lag latency
– Time between update start and time at which last
message from the old work flow is emitted.
• Throughput
– Message throughput drop at update time
• Message loss
– Is there a message loss ? How many ?
14
15. • Consistency
– Does message consistently processed through a one
version of data flow ?
• Interleaved & Delineated
• Let tf be the first message processed and emitted from τs+1 and tl
be the last message processed and emitted from τs
• Delineated if tf > tl
15
17. P3
P5
P2
P4
Pause
P1
P6
• Pause input stream , terminate dataflow , deploy new data flow,
resume workflow
–
–
–
–
–
Consistent
Delineated
Lag latency = 0
Refresh latency = Deployment time + Min(wave head time)
Throughput = 0 ;starting at update start time for a duration of refresh
latency
17
18. P5
P3
Flush
Pause
P1
P2
P4
P6
• Pause input stream , flush on the fly messages (TTLold), terminate
dataflow , deploy new data flow, resume workflow
–
–
–
–
–
–
Consistent
Delineated
Refresh latency = DT + TTLold + Min(wave head time)
Lag Latency = TTLold
No Message loss
Throughput goes to 0
18
19. P3
P1
P5
P2
P4
P6
• Perform in place updates upon request
– Inconsistent
– Interleaving messages
– Low latencies (bounds are derived per update type)
19
20. P3
P5
Update current version
P1
P2
• Tags messages at the
sources
• Message versions are used
to find the correct
processor/channel/sub-graph
P4
P6
P4
– Consistent
– Interleaved
20
21. • Extension of MVC
• Message tagged with current path it took
• Dispatch messages to new version either if they processed
through new version or its processed through components
present in both new and old versions of workflow.
– Consistent
– Interleaved
21
22. • Implemented MCV in Floe[1] Continuous data flow engine.
• Compare MVC against Naïve Consistent Lossy update.
• Used Message Context as a carrier of data-flow version
Floe Message
Key
Properties<K,V>
Payload
[1] https://github.com/usc-cloud/floe
22
25. • Online updates to mission critical continuous data flows is
an important problem space.
• Formalized and analyzed
– Update models
– Evaluation metrics
– Update strategies and their trade offs
• Empirically evaluate MVC and NCL update strategies.
25
The story behind this paper is some what interesting.It was started in last winter break where I started implementing a dynamic update for out in house continious data flow engine.After the winter break is over and when I met the with Yogesh and one of my colleges in lab who maintained floe at that time. we had a disagreement regarding the consistency provided by my implementation. In this discussion we realized that there can be different dynamic update types for distributed continuous data flows with different trade offs.This paper is a result of looking in to this problem in detail.
Motivation behind the need for dynamic update for continuous data-flows.Different possible types of updates to data flowsIdentify metrics to evaluate dynamic data flow updatesIntroduce set of update strategies' which offer different performance/quality trade-offs for different update types.Present empirical evaluation results some update strategies. Finish the presentation with a conclusion.
Big data High volume , High Variety , High velocity data assets Initial efforts on batch processing systemsCyber physical systems, Sensor networks, social network streams need data stream processing systems. This is where continious data flows comes in to the picture
Power grids are transforming into smart grids. (We can see power grids are transforming in to smart girds)Smart meters allow electricity consumption events to transfer in near real timeIntelligent management Demand response optimization to predict and forecast power grid demands and allow take corrective measurements if demand > supply USC act as a micro-grid test bed to evaluate forecasting models and curtailment techniques.Process data streams from over 100 buildings /50k sensors to measure power usage/ equipment status , ambient temperature etc. Continious data flowsRead data , phrase data ,extract and validate reading ,annotate data , inserted in to RDF storage used by smart girl web portal, and also parsed data directed to analytics model which does energy forecasting. Those trigger actions.
Those work flows can’t suffer downtime.Need to update : to improve, fix issues, ex : parsing/annotating logic needs to change when sensor streams change or get updated. User need :No message loss.Ordering of new/old messagesReproducibility of data. Update should not affect the reproducibility property Some applications might accept delay as : Web portal while facility management might not. Shut down the data flow and restart views on the fly updates.
I’ll not board you by going through all the formalism and theoretical bounds. Rather I’ll try to go through examples and give an intuition which will be useful when you are reading the paper.
Continious data flow is a directed graph with processor nodes and channel edges. Processors does the data processing while channels carry data between processors connecting them
We define four types of updates that can be done in a continious data flow.
Processor update defined as update to one or more processors without changing number of processors or channels for channel connectivity.
Channel update change either number of channels in the data flow or change the connectivity.
Combination of previous update types.Update one or more processors and channelsNo change in number of processors No Change in
In Connected sub-graph update, update is done by replacing a connected sub-graph in data flow replaced by another connected subgraph.
We identify and formalize different Quantitate and qualitative metrics that can be used to evaluate different dynamic data flow updates. We define and identify RL,LL,T,ML as Quantitative metrics and Consistency , Interleaved vs Delineated as Qualitative metrics
Refresh latency : How fast we see the effect of update ? Lag Latency : How long it take to flush messages from the old work flowThroughput : is there a impact to throughputMessage loss : Does this update strategy cause
Consistency : Is data reproducible ? Delineated : Can we draw a line between new messages and old messages such that there is no old messages emitted from the system after we see the first new message
We introduce 5 different data flow update strategies that can be implemented with different trade-offs. And we analyze the evaluation metrics against
Pause the input stream , terminate the data flow immediately deploy new data flow and then resume the workflow. Consistent : We terminate the old data flow immediately upon the request and messages processing in the data flow is lost. So all the messages are processed either from old version of the dataflow or new version of the data flow. Delineated: Since we terminated the data flow with the on the fly messages. There will be no old messages emitted from the data flow after the deployment of the new data floe. Lag latency is zero since we terminate the data flow immediately after update request. Refresh latency