3. High Availability is usually achieved through redundant set-up of each
component such that any single point of failure is avoided. Special
attention needs to be paid to x-shape inter-connectivity of components
A and B so that every single failure of component A or B can be
bypassed without loss of functionality. High Availability about doubles
TCO compared to none HA systems. Resiliency set-up avoids TCO
doubling but need more investment into inbuilt error recovery
mechanisms, see text.
Second image is for “Resiliency”, which is the ability to recover from
temporary failures or through some explicit error handling and error
correction. Like before, in the 99% availability case only a small
amount of steps will fail in average when performing a business
scenario. You’d pass in average 495 “Things” successfully and only 5
will go wrong in average.
4. Resilience: It is the ability to recover quickly. That is, if Site 1 goes down, Site 2
immediately comes operational. Or if a disk drive fails, another spare disk drive
quickly is added to the storage pool. System Resilience includes eliminating single
points of failure in system designs into critical systems.
Quality of Service (QoS): It is a technology that enables specified services to
receive a higher quality of service than other specified services. Therefore, service
providers need to determine which service has the highest priority among the
services they provide to their customers. For example, Voice over Internet Protocol
(VoIP) systems typically are prioritized to ensure sufficient network bandwidth is
always available to avoid any traffic delay or degradation of voice quality. Other
services (such as web browsing) will be prioritized at a lower level. Why? Because
they are not sensitive to delays. The new net neutrality law gives ISPs a right to
provide a higher quality of services to a specified set of customers or for a
specified service on the internet.
5. High Availability: It is about having multiple redundant systems that
enable zero downtime or degradation for a single failure. High
availability can usually be implemented in cluster systems, and it has
two modes: 1- Active-active mode: both systems are running and
quickly available. 2- Active-passive mode: One system is active, while
the other is in standby but can become active, usually within a matter
of seconds.
Fault Tolerance: It is the ability of a system to suffer a fault but
continue to operate. How can the system have this capability? Via
adding redundant components such as additional disks within a
redundant array of inexpensive disks (RAID) array, multiple power
supplies, NIC (multiple network interfaces), or additional servers within
a failover clustered configuration.
6. Resiliency is not the same thing as high-availability. Resiliency is the network’s ability
to handle failures. This includes HA but also includes factors like rate-limiting,
security, management, and monitoring.
Network-Level Resiliency
Network-level resiliency includes redundancy in the topology (including physical),
and control plane resiliency. This means using the hardware for failure detection,
prevention, and recovery. For example, using stacking, multiple links, and so on.
This is where to use a Defence in Depth approach. This means using several layers of
resilience. As an example, you may have many ECMP routed links. Also, you may also
enable UDLD on the links to detect layer-1 failures.
Use a modular design in the control plane. One example of this is to use route
summarization. Throttling can prevent overwhelming the control plane. The goal is to
isolate failures to a single area.
7. System-Level Resiliency
This is providing resiliency at the device level. This includes dual power
supplies, dual supervisors, SSO/NSF, and so on.
It also includes software resilience, including security features and control
control plane hardening. Overlooking this can result in high CPU load,
TCAM starvation, and similar errors.
Consider using Control Plane Policing (CoPP), limiting flooding, and
hardening spanning-tree. Also consider using QoS and Storm Control to
prevent overwhelming the data plane.
8. Operational Resiliency
This is about how you manage the network. In particular, think about
change management and change windows.
Software updates also fall into this category. Some platforms support ISSU
ISSU (In-Service Software Upgrade) or similar for non-disruptive updates.
9. Availability can simply be understood as system
uptime, i.e., the percentage of time the storage
system is available and operational, allowing
data to be accessed. Highly available systems are
designed to minimize downtime and avoid loss of
service. All organizations expect to achieve high
availability for their applications and business
services. This is not achieved by a single IT
component alone. High availability depends on
many IT infrastructure components including the
storage hardware and software to work in concert
as expected, minimizing downtime by quickly
restoring essential services in the event of a
failure.
10. Availability is typically calculated in number of
9s.
1 nine = 90% availability, 2 nines = 99%
availability, 3 nines = 99.9 % availability, 4
nines = 99.99% availability, and so on. The
converse of availability is downtime. So, if a
storage system has an annual SLA of 7 nines
availability (99.99999%), it would suffer just 3.15
seconds of downtime in a year. You need to fully
understand your business requirements and the costs
involved to be able to determine and set your
availability goals. Service providers, too, offer
availability SLAs as part of their contracts.
To improve availability, organizations generally
use replication techniques that create redundant
11. Availability is typically calculated in number of
9s.
1 nine = 90% availability, 2 nines = 99%
availability, 3 nines = 99.9 % availability, 4
nines = 99.99% availability, and so on. The
converse of availability is downtime. So, if a
storage system has an annual SLA of 7 nines
availability (99.99999%), it would suffer just 3.15
seconds of downtime in a year. You need to fully
understand your business requirements and the costs
involved to be able to determine and set your
availability goals. Service providers, too, offer
availability SLAs as part of their contracts.
To improve availability, organizations generally
use replication techniques that create redundant
12. Resiliency describes the ability of a storage
system to self-heal, recover, and continue
operating after encountering failure, outage,
security incidents, etc. High resiliency doesn’t
mean there is high data availability. It just means
that the storage infrastructure is equipped enough
to overcome disruptions. Resiliency is not a
standalone metric; it spans business continuity,
incidence response, and recovery techniques to
reduce the magnitude and duration of disruptive
events.
Resiliency of a storage system can be improved
through redundancy and failover and by building in
software-defined intelligence to automatically
detect issues and self-heal in a short span of
13. Fault tolerance is similar to the concept of
availability, but it goes one step further
to guarantee zero downtime. While a highly
available storage system may have minimal
interruption, a fault-tolerant system will have no
service interruption. Having a more complex design
a fault-tolerant system is a typically quite
expensive to maintain: it will involve running
active-active copies of data all the time with the
necessary automation to fail over when encountered
with any components of a storage system failing and
causing downtime. And this failover will be non-
disruptive in such a way that applications and data
access are not impacted at all the business
continues to function as expected..
14. Durability refers to the continued
persistence of data. Businesses will have
long-term data retention goals. This is
achieved by improving durability of the data
and the storage infrastructure preserving it.
Especially in the context of object
storage where data is archived and preserved
for longer terms, it is important to achieve
higher durability. A high level of durability
ensures that the data does not suffer from
bit rot, degradation, or any form of
corruption or data loss.
15. Reliability is typically associated with the
infrastructure storing the data. It refers to the
probability that the storage system will work as
expected. A storage system may be available for a
certain period of time, but it may not work as
expected. In that case, the reliability will be
low. Various factors contribute to increasing
reliability of a system. It’s not easy to measure
reliability. One common metric that is used to
indicate reliability is mean time between failures
(MTBF). MTBF is the predicted elapsed time between
inherent failures of a storage system during normal
operations. If MTBF is high, it is an indicator
that reliability is low.