Understanding High Availability - Introducing the Theory and Concepts of High Availability

Understanding
High Availability
Introducing the Theory and Concepts of High Availability

Version Number: 1.04

Status: Final

Author: Paul Moore, HA Infrastructure Architect

Date Published: 21 March 2012 (V1.0), 2 September 2012

File Name: Understanding High Availability v1.04.docx

Copyright: © 2012 Paul Moore, Astute Systems

License: Creative Commons Attribution 3.0 License

Understanding High Availability 2 of 10

Acknowledgements:
Name Contributions
Brenton Carbins, Socius in Veritas Review, Footnote 4, The Resilient Enterprise
Debbie Moore, The Picky Proofreader Review, Proofreading
iStockPhoto Cover Photo

Reviewers:
Role Name Review Date
Infrastructure Architect Paul Moore 21-Mar-2012
Infrastructure Architect Brenton Carbins 16-Mar-2012

Reference Documents:
Title Author Version Location
The Resilient Enterprise Richard Barker, Veritas (Symantec) Published 2002 Commercially Available

Contents
1 Introduction 3
2 Definition 3
3 Costs and Benefits of Availability 3
4 Prediction 4
5 Sufficient Understanding 4
6 A Systems Approach 5
7 High Availability Calculation 6
8 Determining Dependencies 7
9 Architectural Requirements 8
10 Logical Requirements 8
11 High Availability Assumptions 8
12 Architectural Decisions 8
13 Glossary 10

Figures
Figure 1: A Conceptual Graph of Availability versus Cost 3
Figure 2: An example of a Black Box system 5
Figure 3: An example of Black Box system recursion 5
Figure 4: System Availability Calculation 6
Figure 5: Sub-systems in an IT System 7

© 2012 Paul Moore, Astute Systems Published under the License


1 Introduction
To develop any high availability infrastructure it is essential to first understand what high
availability is and is not. This document attempts to communicate High Availability concepts
in a concise and efficient manner.

2 Definition
Disaster Recover and High Availability are related, yet different concepts. They can be
summarised as follows:
 High Availability is an approach to minimise the probability of a failure to provide
HA focuses on an operational service.
minimising the
chance of a High Availability is the automatic continuation or resumption of service after a
service failure predictable interruption.

Example: Disk mirroring continues to provide data in the event of a disk failure.
(But does not guarantee the highly available data is uncorrupted.)

 Disaster Recovery is an approach to restoring operational service after a failure
to provide it due to a predictable or non-predictable event.

Disaster Recovery is a system enabling the recovery of services after an
interruption due to events not mitigated by a HA system, or due to the failure of a
HA system.

Example: Backups enable recovery from a service failure due to a data loss.

3 Costs and Benefits of Availability
As the level of service availability increases, the cost of the providing it increases
but increases logarithmically due to the increasing architectural complexity and resource use and ends with
complexity and the impossibility of providing any increased availability using currently known technology. As
cost hyper- a result, an appropriate balance must be achieved between the costs of implementing
logarithmically availability and the costs of non-availability.

10000
1000 Conceptual Graph
Unit $ Cost

100 of Availability vs Cost
10
1
0.1
0.01
0.001
0.0001
99.998%

99.995%

99.993%

99.990%

99.980%

99.950%

99.930%

99.900%

99.800%

99.500%

99.300%

99.000%

98.000%

99.000%

93.000%

90.000%

Availability
Figure : A Conceptual Graph of Availability versus Cost



4 Prediction
The architecture of a high availability service requires an assessment and prediction of the
A prediction of most likely and frequent causes of potential service interruption and a resultant design to
the statistical enable the service to continue operating when the predicted event occurs.
likelihood of
future events These assessments and predictions will invariably differ from the actual occurrence of events
observed during future service operation and as a result the actual performance can never
but availability be guaranteed through the use of any particular architecture, design or implementation. The
is a historic actual future availability of the service will, by definition, be a historic statistical measurement
measure over a set period of time.
A high availability architecture seeks to provide higher functional service level by designing
so there are no systems capable of withstanding a range of conceivable failure scenarios, however a perfect
guarantees service will never be possible due to limitations imposed by hardware, software,
communications, policies, cost and the inherent limited ability to predict the likelihood of
future events and their consequences.

5 Sufficient Understanding
To design a highly available system, a thorough understanding of its components is required
The devil is to the degree that all significant availability risks to the system are understood and managed.
always in the
detail … British writer and scientist, Arthur C. Clarke, stated in his third law of prediction:

“Any sufficiently advanced technology is indistinguishable from magic.”
1
Adopting the above terminology, all magic must be eliminated from the system through
… so eliminate enquiry and investigation.
all “magic”.
Several tools can assist in gaining this understanding. Where a system contains complexity
and where there is a logical layering of component sub-systems, a systems approach is one
of the most useful. This approach is outlined in the following section.

1
See “Magic” in the Glossary.



6 A Systems Approach
2
In determining a systems level of availability it can be useful to implement a black-box
A ‘black box’ approach. This maximises flexibility by enabling arbitrary boundaries to be drawn to best suit
model any particular scenario, enforces a rigorous and disciplined focus on the functional
requirements of the system and eliminates consideration of unnecessary details which might
otherwise complicate the assessment.
This system approach and the types of information necessary to use this approach can be
best demonstrated using a simple example. The example system takes a two dimensional
shape of a particular colour as an input, changes any blue to green, changes any green to
red and changes any red to blue, duplicates the shape and vertically flips one of the shapes
around its centre of gravity and sends the result to the output.

2D Shape Transformer System
Black Box

Function: RGB Colour B G,G R,R B; Output
Input Duplicate Shape;
Vertical Flip One Shape.
Properties: 2D Shape, Colour Properties: 2D Shape, Colour

Figure : An example of a Black Box system

How the system implements its internal functions is unknown and need not be known
because all behaviour is fully defined. Consequently the black-box can be used without
internal investigation to ease analysis.
Investigation of the internal working of the system is required in a number of circumstances,
When must the including
‘black box’ be  when the system input, output or function is not fully known,
opened?  when the system behavior must be validated,
 when the system must be assessed for potential failure vulnerabilities,

… with the latter being most important when determining or validating system availability.
An investigation can be performed by breaking the original black box system into its various
functional components, with each of these in turn being considered as individual black box
sub-systems as shown in the diagram below.

2D Shape Transformer System
Colour Translator System Vertical Flipping System
Black Box Black Box Output
Input
Output

Input

Shape Merge System
Input

Black Box

Object Cloning System
Input Black Box
Output
Output

Input Output
Properties: 2D Shape, Colour Properties: 2D Shape, Colour

Function: RGB Colour B G,G R,R B,
Duplicate Shape,
Vertical Flip One Shape
Figure : An example of Black Box system recursion

In the event that an investigation of one or more of the individual sub-systems is required, an
3
additional level of recursion can be performed on each of them by using the same criteria
and method as used for the system as a whole.

2
See “Black Box” in the Glossary.
3
See “Recursion” in the Glossary.



7 High Availability Calculation
Having established the boundaries of various sub-systems using the systems approach
outlined in the previous section, it is now desirable to determine the availability properties of
the larger system.
Required sub-
The availability calculation for a system relies on a statistical treatment of the likelihood of
systems
failures in sub-systems and an assessment of their direct and indirect consequences. Where
decrease
any single sub-system is required for system operation, the availability of the system cannot
availability
be higher than the availability of that sub-system.
Conversely, where any single sub-system has other redundant systems that allow it to fail
Redundant without causing the system to fail, the availability of the system is higher than it would be in
sub-systems the case where the sub-system was non-redundant. These observations and the associated
increase availability equations can be seen in the diagram below.
availability
“The availability of a system is the product of the availability of every serial sub-
system upon which that system depends, multiplied by the availability derived
from the product of the unavailability of each member of a group of redundant
parallel sub-systems where the system depends on the availability of that group.”

SYSTEM
Sub-System 1 Sub-System 2 Sub-System 3 Sub-System 5 Sub-System 6
(Serial) (Serial) (Parallel) (Serial) (Parallel)
SS2 Component 1
(Parallel)
SS2 Component 5

SS2 Component 2
(Parallel)
Sub-System 7
SS5 Component 1 (Parallel)
(Serial)

SS2 Component 3
(Parallel)
Sub-System 4 (Parallel)

(Parallel)
SS2 Component 4
(Parallel)
Sub-System 8
(Parallel)

Avail_S = The availability of system “S” as a percentage of a defined time period.
Avail_S = Avail_SS1 * Avail_SS2 * Avail_SS(3,4) * Avail_SS5 * Avail_SS(6,7,8)
Where:
Avail_SS(3,4) = 1 – (1 – Avail_SS3) * (1 – Avail_SS4)
Avail_SS(6,7,8) = 1 – (1 – Avail_SS6) * (1 – Avail_SS7) * (1 – Avail_SS8)
Avail_SS2 = Avail_SS2C5 * Avail_SS2C(1,2,3,4)
Where:
Avail_SS2C(1,2,3,4) = 1 – (1 – Avail_SS2C1) * (1 – Avail_SS2C2) * (1 – Avail_SS2C3) * (1 – Avail_SS2C4)

Figure : System Availability Calculation

The above diagram demonstrates the availability calculation for a system by recursively
using the calculation formula for each black box sub-system.
For architectural purposes and in the context of information technology, a system is
Either working considered to be in either a failed or working state, with the system in the failed state when
or failed, and non-routine staff intervention is required. Nevertheless, from both an external service
any human availability and management perspective, the staff that intervene to repair the system in the
intervention event of sub-system failure could be conceived to be part of the system.
means failed. In practice, High Availability design consists of determining optimal sub-system boundaries
that make both the understanding and implementation of a system as simple as possible
without compromising either the requirements or functionality.



8 Determining Dependencies
An IT system is comprised of a number of sub-systems, most of which are essential to the
system function, and so should be considered as serial dependencies for high availability
architecture purposes. Due to the number of unavoidable serial dependencies in the IT
system the availability of each sub-system must be maximized through the addition of
redundant components within each sub-system. These sub-systems are shown in the
diagram below.

CAPACITY
High Level Function
Application

CAPACITY
Increasing Dependency
(Serial)
OPERATIONAL Financial
Sub Application
An IT system (Serial)

can fail at any Core Application Technical

Disaster Recovery
(Serial)
layer, at any Influence

Monitoring
time and for Data Storage

Security
Business

(Parallel)

(Parallel)

(Parallel)
(Serial)

Disaster Recovery
many different Influence
Operating System
reasons.

Monitoring
(Serial)

Security
Support
Political

Testing
Legal
(Parallel)

(Parallel)

(Parallel)

(Parallel)

(Parallel)
(Parallel)

(Parallel)

(Parallel)
Communications

DOCUMENTATION
(Serial)

MANAGEMENT
Financial

Electrical
Support MAINTENANCE
Political

Testing PERFORMANCE

PREVENTION
Legal
(Parallel)

(Parallel)

(Parallel)

(Parallel)

(Parallel)

(Serial)
Increasing Abstraction
REGULATORY

Hardware
EXPENDITURE

Technical
EXTERNAL

(Serial)
FUNCTION
TRAINING

Influence
CONTRACTUAL

DETECTION
Temperature

PLANNING

PLANNING
(Serial)

Mechanical
INTERNAL

Business (Serial)
Influence
Location
(Serial)

Figure : Sub-systems in an IT System

The most dependent layers of the model drive the requirements for those layers upon which
Well designed they depend. For example, if the application instance is able to use an alternate instance of a
environments core application, the availability of a specific instance of that core application is less critical.
need less HA Conversely, when the application instance cannot use an alternate instance of the core
work at lower application, that application instance logically cannot have a higher availability than that of
levels of the the core application instance upon which it depends, and consequently, that core application
stack. instance is critical to the operation of the application instance.
In many IT Systems the most critical component is the data storage sub-system, since there
Data storage is is often a requirement for a single source of authoritative data upon which to operate. This
often critical contrasts with other sub-systems such as location, electrical, communications and hardware
due to the need which can often be made redundant to form highly available sub-systems.
for a single
source of truth.



9 Architectural Requirements
 The solution must be analysed for single points of failure which must be addressed.
 The solution must be analysed for dependencies which must be addressed.
 An estimated availability for the solution must be determined to ensure that this
availability figure is consistent with the availability requirements.

10 Logical Requirements
A system As seen in figure 5 on the previous page, the following logical requirements must be met in
requires… order to provide a system capable of meeting the business requirements.
The system must be sufficiently documented so as to be supportable and maintainable.
documentation
There must be sufficient training available for support staff to be able to maintain the system
training in a timely manner.
The availability of the system must be measurable for function and responsiveness. This will
availability of require the retention of specific metrics.
configuration Configuration details of system components must be available in a timely manner so that a
failure of a hardware system will not result in the loss of unique configuration information.
configuration
auditability Configuration details of system components must be auditable so that erroneous
administrative configuration changes can be restored in a timely manner.

11 High Availability Assumptions
That a failure of a single system must be mitigated against, and that the failure of multiple
systems will be considered to be a failure in the larger system.
That a system failure during the critical time window will require automated mitigation and
that there will be insufficient time for support staff to be notified, respond, analyse and
perform reliable mitigation to restore service.
That a failure of a system component can occur at any logical level of the IT solution and can
include human mistakes.
That the system will scale appropriately and that there is sufficient time during any required
time window for the systems to perform all required operations. (IE: The system availability is
not required to exceed 100 %.)

12 Architectural Decisions
By parallelising sub-systems, no single sub-system instance represents a single
point of failure and the availability of the system as a whole is increased.
Parallelising sub-systems enables the performance of most maintenance activities
on individual sub-system instances without the system ceasing to function.
Decision  Implement the parallelisation of sub-systems where possible.

By distributing load between parallel sub-systems the throughput of the group of
parallel sub-system instances is higher than it would be for a single sub-system
instance.
Distributing the load between parallel sub-system instances leverages the
investment in hardware and software.
Decision  Distribute load between parallel sub-systems where possible.

By monitoring the responsiveness of parallel sub-systems, traffic can be directed to
responsive instances and away from unresponsive or failed ones.
When traffic is routed centrally, effective service delivery is maximised by minimising
the duration that traffic is routed to unresponsive or failed parallel sub-system
instances.
Decision
 Monitor the responsiveness of parallel sub-systems where possible.

By using a clustered file system for all sub-system configurations, configuration files
can be more easily managed. In the event of the failure of a sub-system, the unique



sub-system instances configuration files are less likely to be lost and are more
rapidly available when a new sub-system instance is deployed as part of a disaster
recovery plan.
The clustered configuration file system serves as a highly available single source of
critical configuration data which cannot be stored in the database. The clustered
configuration file system can also be used to enable rapid and automated recovery
for active/standby sub-systems that maintain state information outside of the
database.
Decision
 Implement a shared file system to all sub-systems for configuration management.

By using a clustered file system for all sub-system configurations, configuration files
can be more easily managed4. As human configuration mistakes are a common
cause of IT system failure, managing configuration files in a simple, logical,
centralised, auditable and consistent manner is one way to increase availability by
decreasing the chances of mistakes and decreasing the time taken to recovery from
them.
Decision
 Implement a version control system for all sub-system configuration files.

As sub-systems may fail for unknown reasons, availability can be maximised by
restarting failed sub-system processes on the same or on alternate machines.
Cluster management software, such as Veritas Cluster Server, can automate the
execution of these pre-planned mitigation decisions.
The clustering software will provide a global view of the availability and status of all
services running on both primary and disaster recovery sites. An administrator must
be able to easily fail-over sub-systems or the entire system from the primary to the
disaster recovery site and back again.
The use of cluster management software is most critical for non-parallelisable sub-
systems upon which the entire system is dependent.
 Implement cluster management software to automatically restart failed sub-systems.
Decision
Inter-site data replication is necessary to provide a remote copy of data for disaster
recovery and high availability. This can be performed by a hardware solution or on a
file system level.
 Implement inter-site data replication.
Decision

4
This is always a cause of contentious ‘camps’ in architectural discussions. One position is that ‘running
configuration’ instances should use identical configuration files, while ‘individual configuration file’
proponents maintain that shared configuration files make upgrades more difficult. A possible compromise is
the use of snapshots and/or altered mount details only during upgrade procedures.



13 Glossary
Term Description
Active-Standby, Hot/Active: Actively processing data. Warm/Standby: Processing capability on standby.
Active-Active Active-Standby or Hot-Warm is defined as a model where the production application
Hot-Warm, instance or facility (Active or Hot) will provide operational services in a business as usual
state while a disaster recovery application instance or facility (Standby or Warm) is available
Hot-Hot to take over service provision in the event of a failure in production.
Black Box In science and engineering, a black box is a device, system or object which can be viewed
solely in terms of its input, output and transfer characteristics without any knowledge of its
internal workings, that is, its implementation is "opaque" (black). (Wikipedia)
Database Database replication can be used on many database management systems, usually with a
Replication active-standby relationship between the original and the copies. The active logs the updates,
which then ripple through to the standby copies.
The standby acknowledges that it has received the update successfully, thus allowing the
sending (and potentially re-sending until successfully applied) of subsequent updates.
Database replication provides a higher level of reporting than log shipping; but does not lock
passive databases from user changes and so is unsuitable for failover. (Wikipedia)
Disaster Disaster Recovery is a system enabling the recovery of services after an interruption due to
Recovery events not mitigated by a High Availability system, or due to High Availability system failure.
High Availability High Availability is the automatic continuation or resumption of service after a predictable
interruption.
Log Shipping Log shipping is the process of automating the backup of a database and transaction log files
on a primary database server, and then restoring them onto a standby server.
Similar to Database Replication, the primary purpose of log shipping is to increase database
availability by maintaining a backup server to quickly replace the primary server.
Log Shipping locks the standby database from user changes and is often chosen for its low
cost in human and server resources and ease of implementation. Failover between primary
and standby servers is manual and limited reporting capabilities are possible. (Wikipedia)
Magic In the context of programming, Magic is an informal term for the use of code that handles
complex tasks while hiding that complexity to present a simple interface. (Wikipedia)
In computer system design, Magic is used as an informal term to describe gaps in
understanding the process of interaction between one system and another.
Oracle Streams Oracle Streams is available on Enterprise Edition systems only and enables propagation of
information within and between Oracle and other databases. Oracle announced Streams
deprecation and now encourages usage of Golden Gate (acquired by Oracle in July 2009).
Recursion Recursion is the process of repeating items in a self-similar way. For instance, when the
surfaces of two mirrors are exactly parallel with each other the nested images that occur are
a form of infinite recursion. The term has a variety of meanings specific to a variety of
disciplines ranging from linguistics to logic.
The most common application of recursion is in mathematics and computer science, in
which it refers to a method of defining functions in which the function being defined is
applied within its own definition.
Specifically this defines an infinite number of instances (function values), using a finite
expression that for some instances may refer to other instances, but in such a way that no
loop or infinite chain of references can occur. The term is also used more generally to
describe a process of repeating objects in a self-similar way. (Wikipedia)
The Resilient The Resilient Enterprise is a well-known reference book on high availability and disaster
Enterprise recovery published by Veritas Software (now Symantec) in 2002.
Veritas Cluster Veritas Cluster Server is High-availability cluster software, for Unix, Linux and Microsoft
Server Windows computer systems, created by Veritas Software (now part of Symantec). It
provides application cluster capabilities to systems running databases, file sharing on a
network, electronic commerce websites or other applications.
Veritas Cluster Server is one of the few products in the industry that provides both high
availability and disaster recovery across all major operating systems while supporting 40+
major application / replication technologies out of the box.
Similar products include Fujitsu PRIMECLUSTER, IBM HACMP, HP Serviceguard, IBM
Tivoli System Automation for Multiplatforms, Linux-HA, Microsoft Cluster Server, NEC
ExpressCluster, Red Hat Cluster Suite, SteelEye LifeKeeper and Sun Cluster. (Wikipedia)


Understanding High Availability - Introducing the Theory and Concepts of High Availability

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to Understanding High Availability - Introducing the Theory and Concepts of High Availability

Similar to Understanding High Availability - Introducing the Theory and Concepts of High Availability (20)

Recently uploaded

Recently uploaded (20)

Understanding High Availability - Introducing the Theory and Concepts of High Availability