2. Availability
Availability is about system failure and its consequences.
Faults & Failures :
Faults become failures if not corrected or masked.
A failure is observable by the system user; a fault not.
Areas of concern:
Fault detection and frequency
Reduced operations
Recovery and Prevention
Availability = MTBF
MTBF + MTTR
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 2
4. Availability generic scenario (1/4)
Source of stimulus: ……….. who or what ?
We differentiate between internal and external indications of faults or
failure since the desired system response may be different.
Stimulus: …………………does something ?
A fault of one of the following classes occurs.
Omission. A component fails to respond to an input.
Crash. The component repeatedly suffers omission faults.
Timing. A component responds but the response is early or late.
Bad response. A component responds with an incorrect value.
Artifact: …………. to the system or part of it ?
This specifies the resource that is required to be highly
available
Processor,
Communication channel,
Process,
Storage.
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 4
5. Availability generic scenario (2/4)
Environment: …….under certain conditions
The state of the system affects the desired system response.
Normal mode: if this is the first fault observed, some degradation of
response time or function may be preferred
Degraded mode: if the system has already seen some faults it may
be desirable to shut it down totally.
Overload mode:
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 5
6. Availability generic scenario (3/4)
Response: ………how the system reacts ?
The System should detect the event & :
Record it
Notify appropriate parties, including the user and other
systems
Disable sources of events that cause fault or failure
according to defined rules
be unavailable for a specified interval, where interval
depends on criticality of system
Continue to operate in normal or degraded mode
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 6
7. Availability generic scenario (4/4)
Response Measure…how can you measure this ?
Time interval when the system must be available
Availability time
Time interval in which system can be in degraded mode
Repair time
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 7
8. Availability Specific Scenario
“An unanticipated external message (DOS attack) is
received by a process during normal operation. The
process logs the receipt of the message, notifies the
operator and continues with no downtime”
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 8
9. Case: Digital Signage – Public Transport
Availability QAS :
SOURCE who or what A random event
STIMULUS does something ... causes a failure
ARTIFACT to the system or part of it ... to the communication system
ENVIRONMENT under certain conditions ...during normal operations
RESPONSE how the system reacts All displays must start showing
scheduled arrival times for all
buses
MEASURE how you can measure this ... Within 30 seconds of failure
detection
Q: What is the architectural impact of this requirement ?
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 9
10. Availability Tactics
Tactics
to Control Fault Masked
Fault
Availability or Repaired
Fault Detection
Echo
Heartbeat
Exceptions
Fault Recovery
Preparing for recovery
Accomplishing the recovery
Fault Prevention
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 10
11. Fault Recovery Tactics (1/4)
Voting Tactic:
Processes running on redundant processors each take the
input, compute and report the results to the “vote-counter.”
Majority rules
Preferred Component
Preferred component:
This corrects faulty operation of components, algorithms or
processors.
The more severe the consequences of failures the more stringent
the effort to ensure that the redundancy is independent.
– Separate processors, separate implementation teams, … dissimilar
platforms
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 11
12. Fault Recovery Tactics (2/4)
Active redundancy (hot restart):
All redundant components respond to events in parallel
Redundant components synchronized at start then first
to return is the answer.
This covers some faults. A faulty processor will be
slower to respond.
When a failure occurs the downtime is usually only
milliseconds (switching to another component).
Often used in client-server applications involving back-
end databases.
In high availability for LANs the redundancy may be
separate paths so that failure of a bridge or router is not
fatal. Note the synchronization demands here.
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 12
13. Fault Recovery Tactics (3/4)
Passive Redundancy:
One component responds to events and informs the standbys
of state updates.
Upon failure the system must:
Ensure that the backup is sufficiently fresh.
Restart points, checkpoints, log points ???
Remap the system to switch which system is the active
component.
Often used in control systems
Example : Air traffic Control
Chapter 6: Air Traffic Control: A Case Study in
Designing for High Availability
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 13
14. Fault Recovery Tactics (4/4)
Switchovers
Upon failure or Periodic
Synchronization:
is the responsibility of the primary component, broadcasting
synchronization signals to the redundant components.
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 14
15. Fault Prevention Tactics
Removal from service
To perform some preventive actions, e.g.,
rebooting to prevent slow memory leaks from
causing problems
Transactions
the bundling of a sequence of steps so that
they can be done all at once
Process monitor
Once a fault in a process is detected;
remove–reinstantiate-reinitialize state
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 15
16. Availability Tactics Hierarchy
Availability
Fault detection Recovery Recovery Prevention
Preparation Reintroduction
Fault and repair
Fault
Arrives Masked
or
Repaired
Ping/echo Voting
Heartbeat Shadow
Active red. State resync. Removal from
Exception Passive red. Rollback
Spare Service
Transactions
Process
Monitor
Vakgroep Informatietechnologie – Onderzoeksgroep IBCN p. 16
Hinweis der Redaktion
Issues with Ping/Ech/Heartbeat: Measure “are you alive”. >Functionality simple: 1) Response time under high load ? 2) Capacity of the ping server 3) Availability of the communication channel Complexity: - Tradeoff with performance : - periodic - datacontent