Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Enabling Carrier-Grade Availability Within a Cloud Infrastructure

236 Aufrufe

Veröffentlicht am

Aaron Smith, Red hat, Pasi Vaananen, Red Hat

Carrier-Grade Cloud Infrastructure (Aaron Smith, Pasi Vaananen, Red Hat): The move from vertically integrated hardware and software to distributed execution in a cloud complicates the delivery of highly available services. Vertically integrated systems enabled all system layers required to communicate and participate in the support of availability of the service to be under control of single system vendor. With NFV, the cloud philosophy of infrastructure and application decoupling requires new open interfaces to support the necessary flow of information between layers and clear separation of the fault and availability management responsibilities between the infrastructure and application SW subsystems. Even in the cloud environment, traditional availability concepts such as fast detection, correlation, and fault notification still apply. A fast, low-latency fault management platform will be presented that allows cloud-based services to achieve 5NINES of availability and service continuity. Performance measurements from a prototype of the system will be presented along with a demo of the operation of a service requiring 50 ms fault remediation.

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Enabling Carrier-Grade Availability Within a Cloud Infrastructure

  1. 1. Enabling Carrier-Grade Availability Within a Cloud Infrastructure Aaron Smith, Red Hat Pasi Vaananen, Red Hat
  2. 2. Agenda • Introduction • Problem and goals? • Fault management cycle and timeline • Relative impact to Service Availability • Proof of concept • PoC results • What's next?
  3. 3. Problem • The move to a NFV and a cloud infrastructure complicates the delivery of highly-available services  No longer a vertically integrated hardware / software stack  Stack components provided by different vendors • Same requirements apply (50ms … 1000ms, increasing by “layer”) • For a cloud infrastructure, the network impacts availability more than individual compute hosts, and detection / protection strategies must adjust accordingly
  4. 4. Goals • Produce a monitoring and event detection framework that distributes fault information to various listeners with low latency (<10’s of milliseconds) • Provide a hierarchy of remediation controllers, which can react quickly (<10’s of milliseconds) to faults. • Provide FM mechanisms for both current virtualization environments and future containerization environments orchestrated by Kubernetes, etc…
  5. 5. Fault Management Cycle Detection (Prediction) Localization IsolationRemediation RecoveryDiagnosis Re-pool Repair Suspect HW Bad Good
  6. 6. Fault Management Cycle Phases • Detection – Requires low-latency, low-overhead mechanisms • Localization – Physical/Virtualized resources to resource consumer(s) mapping within the context of fault trees • Isolation – Remove the ability of the failed component to affect service state • Remediation – Service restoration through failover to redundant resource / component, or component restart • Recovery – Restoration of service redundancy configuration
  7. 7. FM Cycle Timeline Up, redundant Down, Remediation Up, Recovering Up, Repair Pending Minimize TUA TDET TREM 1st Failure -- Potential Outage or Degradation TUA = TDET + TREM Up, Redundant Up, Recovering Up, Redundant Failure Event Service Recovered Redundancy Restored (pooled) Repair Completed (non-pooled) Redundancy Restored (non- pooled) TREC, Pooled 2nd failure exposure, typ. ~2 mins MTTREC TREP TREC, Non-Pooled 2nd failure exposure, typ. 4+ hrs MTTREP 1st Indication: FM cycle start For non-pooled resources: coupled, critical repair For pooled resources: uncoupled, deferred repairs
  8. 8. Fault Management Cycle Timeline • TDET + TNOT + TREM < 50 ms (lowest “layers”, typ. network) • TDET -- Detection time • TNOT-- Notification • TREM-- Remediation is often the longest process and therefore TDET + TNOT should be made as small as possible Minimize
  9. 9. Automated Service Recovery Survey Within 1 second Within 50 ms Within 5 seconds Automated recovery not important 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 40% 39% 20% 1% Heavy Reading NFV operator survey of 128 service providers, “Telco Requirements for NFVI”, November 2016
  10. 10. Relative Impact to Service Availability • Different infrastructure components do have different impact potential on the application level Service Availability e.g.: • Network switch faults have a very high impact potential on the SA (can affect all associated nodes / services) • Compute node faults can only affect the VMs / Containers running on them • Spine > Leaf > Network Nodes > Storage Nodes > Control Nodes > Control Node (Specific Service) > Compute Nodes > Compute Nodes (Critical Services) > Compute Node (Specific VM/Container)
  11. 11. Service Relative Criticality (cont’d) • Focus monitoring/remediation efforts with respect to the relative impact potential, e.g.:  Switch failure affects 10s of hosts (100s of services)  Need fast detection and remediation of switch failures
  12. 12. Proof of Concept • Demonstrate that events can be detected < 10ms • Node network interfaces • Kernel fault conditions • Complete node failure (and differentiate host vs. switch) • Demonstrate that event messages can be delivered to subscribed components with consistently low latency (99.999% of the latency values < 10ms)
  13. 13. Proof of Concept (cont’d) • Applications can be enhanced to include the subscription and reception of events • Ensure that the collectd framework is suitable for event monitoring (detection latency & overhead) • Prototype integration with OpenStack services • Prototype a node/switch monitoring system that provides quick detection without adding significant overhead
  14. 14. Node Monitoring (PoC) rules / action engine policies / topology Ingress Plugins Kafka/AMQP Local Agent Config Kublet process kernel syslogd libVirt network cpu libVirt cAdvisor MCE CollectdCore Egress Plugins kernel net cpu mem hardware syslog /proc pid interface Event Telemetry Gnocchi telemetry collectd config Policy, topology, events Local corrective actions G-VNFM Aodh Keystone NFVO/E2EO RTMD Ceilometer Ceilometer Services Local Agent Visualization
  15. 15. Proof of Concept Results • Demonstrate that events can be detected < 10ms • Node network interfaces – Dependent upon driver but achievable • Kernel fault conditions – Verified monitoring of syslog output • Complete node failure (and differentiate host vs. switch) – 802.1ag
  16. 16. Proof of Concept Results (cont'd) • Demonstrate that event messages can be delivered to subscribed services with consistently low latency. (99.999% of the latency values < 10ms) – Mixed results with Kafka. With simulated metrics from 700 nodes, average latency is below 10ms. However, the cumulative latency distribution had a long tail with values out to 200ms. • Applications can be enhanced to include the subscription and reception of events
  17. 17. Proof of Concept Results (cont'd) • Telco and enterprise applications can be enhanced to include the subscription and reception of events – In Progress. Low- latency delivery of messages is achievable, however, issues of scale and multi-tenancy/security need to be addressed.
  18. 18. What’s Next? • Common Object Model for Events and Telemetry • Inclusion of Object and Event model in TOSCA • Event interfaces towards G-VNFM and other MANO subsystems