Enabling Carrier-Grade Availability Within a Cloud Infrastructure

Enabling Carrier-Grade Availability
Within a Cloud Infrastructure
Aaron Smith, Red Hat
Pasi Vaananen, Red Hat

Agenda
• Introduction
• Problem and goals?
• Fault management cycle and timeline
• Relative impact to Service Availability
• Proof of concept
• PoC results
• What's next?

Problem
• The move to a NFV and a cloud infrastructure complicates the
delivery of highly-available services
 No longer a vertically integrated hardware / software stack
 Stack components provided by different vendors
• Same requirements apply (50ms … 1000ms, increasing by “layer”)
• For a cloud infrastructure, the network impacts availability more
than individual compute hosts, and detection / protection strategies
must adjust accordingly

Goals
• Produce a monitoring and event detection framework that
distributes fault information to various listeners with low latency
(<10’s of milliseconds)
• Provide a hierarchy of remediation controllers, which can react
quickly (<10’s of milliseconds) to faults.
• Provide FM mechanisms for both current virtualization environments
and future containerization environments orchestrated by
Kubernetes, etc…

Fault Management Cycle
Detection
(Prediction)
Localization
IsolationRemediation
RecoveryDiagnosis
Re-pool
Repair
Suspect
HW
Bad
Good

Fault Management Cycle Phases
• Detection – Requires low-latency, low-overhead mechanisms
• Localization – Physical/Virtualized resources to resource
consumer(s) mapping within the context of fault trees
• Isolation – Remove the ability of the failed component to
affect service state
• Remediation – Service restoration through failover to
redundant resource / component, or component restart
• Recovery – Restoration of service redundancy configuration

FM Cycle Timeline
Up, redundant Down, Remediation
Up, Recovering
Up, Repair Pending
Minimize TUA
TDET TREM
1st
Failure -- Potential
Outage or Degradation
TUA = TDET + TREM
Up, Redundant
Up, Recovering Up, Redundant
Failure Event
Service
Recovered
Redundancy
Restored
(pooled)
Repair
Completed
(non-pooled)
Redundancy
Restored (non-
pooled)
TREC, Pooled
2nd failure exposure, typ. ~2 mins MTTREC
TREP
TREC, Non-Pooled
2nd failure exposure, typ. 4+ hrs MTTREP
1st Indication:
FM cycle start
For non-pooled resources: coupled, critical repair
For pooled resources: uncoupled, deferred repairs

Fault Management Cycle Timeline
• TDET + TNOT + TREM < 50 ms (lowest “layers”, typ. network)
• TDET -- Detection time
• TNOT-- Notification
• TREM-- Remediation is often the longest process and therefore TDET
+ TNOT should be made as small as possible
Minimize

Automated Service Recovery Survey
Within 1 second
Within 50 ms
Within 5 seconds
Automated recovery not important
0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
40%
39%
20%
1%
Heavy Reading NFV operator survey of 128 service providers, “Telco Requirements for NFVI”, November 2016

Relative Impact to Service Availability
• Different infrastructure components do have different impact
potential on the application level Service Availability e.g.:
• Network switch faults have a very high impact potential on the SA
(can affect all associated nodes / services)
• Compute node faults can only affect the VMs / Containers running on
them
• Spine > Leaf > Network Nodes > Storage Nodes > Control Nodes >
Control Node (Specific Service) > Compute Nodes > Compute Nodes
(Critical Services) > Compute Node (Specific VM/Container)

Service Relative Criticality (cont’d)
• Focus monitoring/remediation efforts with respect to the
relative impact potential, e.g.:
 Switch failure affects 10s of hosts (100s of services)
 Need fast detection and remediation of switch failures

Proof of Concept
• Demonstrate that events can be detected < 10ms
• Node network interfaces
• Kernel fault conditions
• Complete node failure (and differentiate host vs. switch)
• Demonstrate that event messages can be delivered to
subscribed components with consistently low latency
(99.999% of the latency values < 10ms)

Proof of Concept (cont’d)
• Applications can be enhanced to include the subscription and
reception of events
• Ensure that the collectd framework is suitable for event
monitoring (detection latency & overhead)
• Prototype integration with OpenStack services
• Prototype a node/switch monitoring system that provides quick
detection without adding significant overhead

Node Monitoring (PoC)
rules / action
engine
policies /
topology
Ingress Plugins
Kafka/AMQP
Local Agent
Config
Kublet
process
kernel
syslogd
libVirt
network
cpu
libVirt
cAdvisor
MCE CollectdCore
Egress Plugins
kernel
net
cpu
mem
hardware
syslog
/proc
pid
interface
Event
Telemetry
Gnocchi
telemetry
collectd config
Policy,
topology,
events
Local corrective actions
G-VNFM
Aodh
Keystone
NFVO/E2EO
RTMD
Ceilometer
Ceilometer
Services
Local Agent
Visualization

Proof of Concept Results
• Demonstrate that events can be detected < 10ms
• Node network interfaces – Dependent upon driver but achievable
• Kernel fault conditions – Verified monitoring of syslog output
• Complete node failure (and differentiate host vs. switch) – 802.1ag

Proof of Concept Results (cont'd)
• Demonstrate that event messages can be delivered to
subscribed services with consistently low latency. (99.999% of
the latency values < 10ms) – Mixed results with Kafka. With
simulated metrics from 700 nodes, average latency is
below 10ms. However, the cumulative latency distribution
had a long tail with values out to 200ms.
• Applications can be enhanced to include the subscription and
reception of events

Proof of Concept Results (cont'd)
• Telco and enterprise applications can be enhanced to include
the subscription and reception of events – In Progress. Low-
latency delivery of messages is achievable, however,
issues of scale and multi-tenancy/security need to be
addressed.

What’s Next?
• Common Object Model for Events and Telemetry
• Inclusion of Object and Event model in TOSCA
• Event interfaces towards G-VNFM and other MANO subsystems

Enabling Carrier-Grade Availability Within a Cloud Infrastructure

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Enabling Carrier-Grade Availability Within a Cloud Infrastructure

Ähnlich wie Enabling Carrier-Grade Availability Within a Cloud Infrastructure (20)

Mehr von OPNFV

Mehr von OPNFV (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Enabling Carrier-Grade Availability Within a Cloud Infrastructure