The 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN13) talk slides. June 27th, 2013. Full text here: http://www.nicta.com.au/pub?doc=7031
Modelling and Analysing Operation Processes for Dependability
1. NICTA Copyright 2012 From imagination to impact
Modelling and Analysing Operation
Processes for Dependability
Xiwei Xu, Liming Zhu, Jim Li, Len Bass,
Qinghua Lu, Min Fu
Software Systems Research Group
NICTA (National ICT Australia)
DSN13, Budapest
2. NICTA Copyright 2012 From imagination to impact
Motivation
• Cloud applications fail due to operation issues
– Gartner report: 80% of outage caused by people/process issues
• Sporadic activities: replication/failover, auto-scaling, upgrade…
– Not that dependability issues may trigger mitigating operations but
the converse:
• dependability, often unexpectedly, is affected by these mitigating
activities and other sporadic activities
• Many examples in cloud outage reports
– Lessons from our own cloud DR product: Yuruware.com
• Complex interleaving “sporadic” processes/activities
– Scripts (e.g. Chef/Shell), tools and human
– Activities auto-triggered by policies, monitoring and analysis
– Logs/Events often lack the “process-context”
2
3. NICTA Copyright 2012 From imagination to impact
Our Process-Oriented Approach
• Rather than artifact-oriented and state-based
– Log analysis linking back to issues in source code
– Configuration analysis and constraint checking
– SPN-based system-level models
• We explicitly model an operation as a set of steps
– Executed by fault-prone agents (scripts/tools/human)
– Requiring various fault-prone resources (computing/nodes/environ)
– Faults at one step may surface later at another step
– Exception handling: error diagnosis, undo/redo, fixing, tolerating…
• Modelling language choices
– Little-JIL for process itself: Abstraction, recursion, resource modelling
– Impact of processes on system availability: SRN models [Cloud’13]
"Incorporating Uncertainty into in-Cloud Application Deployment Decisions for Availability,” IEEE Cloud 13
3
4. NICTA Copyright 2012 From imagination to impact
HBase/Hadoop Deployment Process
4
6. NICTA Copyright 2012 From imagination to impact
Models and Analysis
• Scope and sources of models
– cloud application operations, consumer-perspective
– Runbook, scripts(chef/shell), management tools (Cloudera/Whirr)
• Models are highly analysis/use-driven, e.g.:
– Error diagnosis and root cause analysis
• Models: group traceable events/logs into activities (steps)
– Use for Undo [HotDep’12] and recovery
• Models: checkpoint/undo or not, incremental deployment “validation”…
– Impact on system availability analysis
• Models: computing impact or component freeze
• Also analysable for operation “process quality”
– Time to completion, SPOF, probability of successful completion,
bottlenecks, undoability …
“Automatic undo for cloud management via AI planning”, HotDep12
6
7. NICTA Copyright 2012 From imagination to impact
Preliminary Results and Observations
• HBase/Hadoop deployment and upgrade
– Empirical data from several HBase/Hadoop projects
– Fault injection using literature-based fault models
• Simulation observations
– Critical steps: error diagnosis (upper layer software in Hadoop
ecosystem), undo/redo time..
– Product improvements suggestions [LISA’13][RELENG’13]
• Feedback from a Hadoop team in a major bank
– Useful for estimating operation dependability
– Need to model pre/post conditions better around restart/redo
– Need to inform wait vs. exception handling now
“Challenges to Error Diagnosis in Hadoop Ecosystems”, usenix LISA13
“Eliciting operations requirements for applications”, RELNG Workshop at ICSE13
7
8. NICTA Copyright 2012 From imagination to impact 8
Conclusion and Future Work
• Process-oriented approach for dependable “operations”
– Focus on sporadic activities
– Model, analyze, simulate an operation and its exception handling as a
process model
• Provide “process context” to error diagnosis and recovery
• Enable process quality analysis
• Future work
– Better runtime monitoring of operation progress
– Process mining and script analysis for generating process models
Liming.Zhu@nicta.com.au
Slides available at http://www.slideshare.net/LimingZhu/