Anzeige

Optimising Service Deployment and Infrastructure Resource Configuration

RECAP Project
8. May 2018
Anzeige

Más contenido relacionado

Presentaciones para ti(20)

Similar a Optimising Service Deployment and Infrastructure Resource Configuration(20)

Anzeige

Optimising Service Deployment and Infrastructure Resource Configuration

  1. Alec Leckey Intel Labs
  2. Data Centre challenges. • Decrease Operational Costs, • Maintain Consistent Performance, • Increase Scale, • Innovate, and deliver more value. TCO Performance Over-provisioning Utilization Energy Allocation Management Availability More Capacity More Complexity Application Growth Every 2 Years Data Volume Every 18 Months Operational Costs Every 8 Years Reduction in Compute Costs Every 2 Years 2x Increase in Management & Administration1 50% 2x 2x 8x More Resources 1 – IDC Directions ‘14 - 2014 Source: Worldwide and Regional Public IT Cloud Services 2013-2017 Forecast. IDC (August 2013) http://www.idc.com/getdoc.jsp?containerId=242464
  3. Why we need to understand infrastructure 3 • T-Nova* project demonstrates a 10X performance improvement when an Network Traffic Analyzer is landed onto a machine that is SR-IOV enabled • However it’s not feasible to manually place workloads at scale. • How can we automatically match workloads to suitable infrastructure? VNFC Performance - Bytes Per Second Total Traffic Standard Deployment Enhanced Deployment Matching workload types to hardware features can improve performance * http://www.t-nova.eu/
  4. Infrastructure Landscape Goal: Support setup and run-time orchestration for optimised service delivery by defining and maintaining a layered landscape: • Physical • Virtual • Service Nodes in each layer are enriched by telemetry
  5. Landscaper Overview Graph representation of physical, virtual, service layers of infrastructure landscape • Landscape Nodes have a category: • Compute, Storage, Network • Landscape History • edges have a ‘from time’ and ‘to time’ • Landscape State • landscape nodes have state nodes • Data gathered by collectors • Data export via RESTful API • json string - networkx Xeon E5Xeon Phi AES-NI AtomSSD NVM 10Gb Virtual Storage Object store Video transcode Wordpress ERP Virtual Network Virtual Machine
  6. Different Landscape Views
  7. • Plugin architecture • Can detect and update based on events • Current Collectors • HwLoc (internal components) and CPUinfo (enrich the core/pu attributes) • OpenStack Heat • OpenStack Nova • OpenStack Neutron • OpenStack Cinder • Docker Swarm • OpenDayLight • Importer (.csv) Landscaper Collectors
  8. Service Stack (1x view)
  9. pu Service Layer Virtual Layer Physical Layer vm machine stack pcidev bridge numanode core cache socket/ package networkvnic switch subnet switch bridge pcidev osdev_storage osdev_network osdev_network puCompute Category Network Category Storage Category volume cache cache cache L3 Cache L2 Cache L1 Data Cache L1 Instruction Cache Heat Cinder Nova Neutron OpenDaylight (cpuinfo) hwloc + cpuinfo Service Stack (10x view)
  10. machine numanode bridge pcidev ID, NAME, CATEGORY, LAYER, ARCHITECTURE, OS_NAME, OS_VERSION, OS_RELEASE, OS_INDEX, ALLOCATION, PROCESS_NAME, HW_LOC_VERSION, DMI_BOARD_VENDOR, DMI_BOARD_NAME, DMI_BOARD_SERIAL, DMI_BOARD_VERSION, DMIN_BIOS_DATE, DMI_BIOS_VENDOR, DMI_BIOS_VERSION, DMI_SYS_VENDOR, DMI_CHASSIS_VENDOR, DMI_CHASSIS_TYPE, DMI_CHASSIS_ASSET_TAG, DMI_CHASSIS_SERIAL, DMI_PRODUCT_NAME, DMI_PRODUCT_UUID, DMI_PRODUCT_VERSION, LINUX_GROUP, BACKEND, NODESET, COMPLETE_NODESET, ALLOWED_NODESET, CPUSET, COMPLET_CPUSET, ALLOWED_CPUSET, ONLINE_CPUSET, COSTS ID, NAME, CATEGORY, LAYER, OS_INDEX, ALLOCATION, LOCAL_MEMORY, NODESET, COMPLETE_NODESET, ALLOWED_NODESET, CPUSET, COMPLET_CPUSET, ALLOWED_CPUSET, ONLINE_CPUSET, ID, NAME, CATEGORY, LAYER, OS_INDEX, ALLOCATION, BRIDGE_PCI, BRIDGE_TYPE, PCI_LINK_SPEED, PCI_BUS_ID, PCI_TYPE, DEPTH ID, NAME, CATEGORY, LAYER, OS_INDEX, ALLOCATION, PCI_SLOT, PCI_LINK_SPEED, PCI_BUS_ID, PCI_TYPE Compute Meta-Data: Physical Layer
  11. package cache core pu ID, NAME, CATEGORY, LAYER, OS_INDEX ALLOCATION, CPU_FAMILY_NUMBER, CPU_VENDOR, CPU_MODEL_NUMBER, CPU_MODEL, CPU_STEPPING, NODESET, COMPLETE_NODESET, ALLOWED_NODESET, CPUSET, COMPLET_CPUSET, ALLOWED_CPUSET, ONLINE_CPUSET, ID, NAME, CATEGORY, LAYER, ALLOCATION, CACHE_SIZE, CACHE_LINESIZE, CACHE_ASSOCIATIV ITY NODESET, COMPLETE_NODES ET, ALLOWED_NODESE T, CPUSET, COMPLET_CPUSET, ALLOWED_CPUSET, ONLINE_CPUSET, ID, NAME, CATEGORY, LAYER, OS_INDEX, ALLOCATION, NODESET, COMPLETE_NODESET, ALLOWED_NODESET, CPUSET, COMPLET_CPUSET, ALLOWED_CPUSET, ONLINE_CPUSET, ID, NAME, CATEGORY, LAYER, OS_INDEX, WP, ALLOCATION, CPUID_LEVEL, CPU_CORES, CORE_ID, CPU MHZ, MICROCODE, VENDOR_ID, CPU_FAMILY, APICID, INTIAL_APICID, SIBLINGS, ADDRESS_SIZES, MODEL, MODEL_NAME, STEPPING, CACHE_SIZE, CACHE_ALLIGNMENT, NODESET, COMPLETE_NODESET, ALLOWED_NODESET, CPUSET, COMPLET_CPUSET, ALLOWED_CPUSET, ONLINE_CPUSET, PHYSICAL_ID, FPU, FLAGS, BOGOMIPS, CLF_FLUSHSIZE, Compute Meta-Data: Physical Layer
  12. Enrichment Through Telemetry 13 Snap: a Lightweight modular programmable telemetry system • Unified namespace, Configurable at run time, Dynamically derived metrics • Integration of diverse data for analysis • Calculation of generic node metrics across the stack (e.g. Utilization & Saturation) Instrumentation Logs Capture Store Transform & Prepare Access
  13. Snap - architecture • Full stack: motherboards, cpus, memory, disks, operating systems, hypervisor, guest operating system, hosted applications • Performant. Scalable. Dynamically reconfigurable. Secure. Extensible. Manageable.
  14. Snap - telemetry Process PublishCollect $ go get github.com/intelsdi-x/snap http://snap-telemetry.io/ Plugin Catalogue (github)
  15. Adaptive telemetry – anomaly detection approach 16 Challenge: Sending all data all the time • overflow the system with “redundant” information. Goal: reduce data transfer while preserving essence Approach & Findings: • Pluggable anomaly detection algorithm • Increased transmission rate around outliers only • Transmissions typically reduced by >10x Time elapse (seconds) %ageutilizationofCPU Machine 1 Machine 2 Machine 3
  16. Contextual Information 17 • Automatic application of USE methodology • Ranking & Cost functions • Supports comparison of service configurations & generation of deployment template for specific workloads Representation of SDI sub-graph including performance
  17. Application to large scale systems 18 • Optimization of Initial placement • Re-balancing actuations • Troubleshooting • Accounting • Security • Capacity planning Using the landscape data it is possible to develop models for:
  18. Network Model for vCDN deployments Technical challenges: • Performance of virtualisation technologies, especially virtualised storage. • Orchestration of a multi-tenant vCDN service and infrastructure. • Optimisation of placement and scaling of vCDN system. • Monitoring and repair of the vCDN system. • Detection and mitigation of impact of “noisy neighbours”.
  19. 1. Load to Capacity Requirement Mapping 2. Load to Telemetry Mapping 3. Infrastructure Configuration Optimization Resource A Telemetry for Resource A Infrastructure BT Workload Resource B Telemetry for Resource B KPI 1 KPI 2 Cost Resource A Telemetry for Resource A Infrastructure Workload Resource B Telemetry for Resource B KPI 1 KPI 2 Cost Resource A Telemetry for Resource A Infrastructure Workload Resource B Telemetry for Resource B KPI 1 KPI 2 Cost Optimization approaches
  20. 22 Utility Theory approach
  21. BT as Infrastructure Provider Content Operator End User Requesting for Content Part A: Provider vs Customer Part B: Provider vs Customer Provider vs Customer
  22. Landscape Model 24 UK Exchange PointCore SitesMetro SitesMulti-Service Access Points Network Switch Physical Servers Virtual Machines Service Stacks Legend
  23. Content Provider consumers 2 3 MSAN Metro Site Core Exchange Costs 1
  24. Success Criteria Create a system to: • model the performance of VNF’s prior to deployment • learn the configuration of existing networks and predict the impact of topology, application and infrastructure changes • improve the placement decisions of Orchestration systems to improve infrastructure utilization whilst guaranteeing performance and availability SLAs • put in place remediation rules a priory to failures happening. Ensuring rapid protection using the minimum of additional resources • automate the remediation of unexpected/unpredicted failures in a timely fashion (several minutes).
Anzeige