SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
CloudStack Scalability

     By Alex Huang
Current Status
• 10k resources managed per management server
  node
• Scales out horizontally (must disable stats
  collector)
• Real production deployment of tens of thousands
  of resources
• Internal testing with software simulators up to
  30k physical resources with 30k VMs managed by
  4 management server nodes
• We believe we can at least double that scale per
  management server node
Balancing Incoming Requests
• Each management server has two worker thread pools for incoming
  requests: effectively two servers in one.
   – Executor threads provided by tomcat
   – Job threads waiting on job queue
• All incoming requests that requires mostly DB operations are short
  in duration and are executed by executor threads because incoming
  requests are already load balanced by the load balancer
• All incoming requests needing resources, which often have long
  running durations, are checked against ACL by the executor threads
  and then queued and picked up by job threads.
• # of job threads are scaled to the # of DB connections available to
  the management server
• Requests may take a long time depending on the constraint of the
  resources but they don’t fail.
The Much Harder Problem
• CloudStack performs a number of tasks on behalf of
  the users and those tasks increases with the number of
  virtual and physical resources available
   –   VM Sync
   –   SG Sync
   –   Hardware capacity monitoring
   –   Virtual resource usage statistics collection
   –   More to come
• When done in number of hundreds, no big deal.
• As numbers increase, this problem magnifies.
• How to scale this horizontally across management
  servers?
Comparison of two Approaches
• Stats Collector – collects capacity statistics
   – Fires every five minutes to collect stats about host CPU and
     memory capacity
   – Smart server and dumb client model: Resource only
     collects info and management server processes
   – Runs the same way on every management server
• VM Sync
   – Fires every minute
   – Peer to peer model: Resource does a full sync on
     connection and delta syncs thereafter. Management
     server trusts on resource for correct information.
   – Only runs against resources connected to the management
     server node
Numbers
•   Assume 10k hosts and 500k VMs (50 VMs per host)
•   Stats Collector
     – Fires off 10k requests every 5 minutes or 33 requests a second.
     – Bad but not too bad: Occupies 33 threads every second.
     – But just wait:
          •   2 management servers: 66 requests
          •   3 management servers: 99 requests
     – It gets worse as # of management servers increase because it did not auto-balance across
       management servers
     – Oh but it gets worse still: Because the 10k hosts is now spread across 3 management servers.
       While it’s 99 requests generated, the number of threads involved is three-fold because
       requests need to be routed to the right management server.
     – It keeps the management server at 20% busy even at no load from incoming requests
•   VM Sync
     – Fires off 1 request at resource connection to sync about 50 VMs
     – Then, push from resource as resource knows what it has pushed before and only pushes
       changes that are out-of-band.
     – So essentially no threads occupied for a much larger data set.
What’s the Down Side?
• Resources must reconcile between VM states
  caused by management server commands and
  VM states it collects from the physical
  hardware so it requires more CPU
• Resources must use more memory to keep
  track of what amounts to a journal of changes
  since the last sync point.
• But data centers are full of these two
  resources.
Resource Load Balancing
• As management server is added into the cluster, resources are rebalanced
  seamlessly.
    –   MS2 signals to MS1 to hand over a resource
    –   MS1 wait for the commands on the resources to finish
    –   MS1 holds further commands in a queue
    –   MS1 signals to MS2 to take over
    –   MS2 connects
    –   MS2 signals to MS1 to complete transfer
    –   MS1 discards its resource and flows the commands being held to MS2
• Listeners are provided to business logic to listen on connection status and
  adjusts work based on who’s connected.
• By only working on resources that are connected to the management
  server the process is on, work is auto-balanced between management
  servers.
• Also reduces the message routing between the management servers.
Designing for Scalability
• Take advantage of the most abundant resources in a data center
  (CPU, RAM)
• Auto-scale to the least abundant resource (DB)
• Do not hold DB connections/Transactions across resource calls.
   – Use lock table implementation (Merovingian2 or
     GenericDao.acquireLockInTable() call) over database row locks in this
     situation.
   – Database row locks are still fine quick short lock outs.
• Balance the resource intensive tasks as # of management server
  nodes increases and decreases
   – Use job queues to balance long running processes across management
     servers
   – Make use of resource rebalancing in CloudStack to auto-balance your
     world load.
Reliability

By Alex Huang
The Five W’s of Unreliability
• What is unreliable? Everything
• Who is unreliable? Developers & administrators
• When does unreliability happen? 3:04 a.m. no
  matter which time zone… Any time.
• Where does unreliability happen? In carefully
  planned, everything has been considered data
  centers.
• How does unreliability happen? Rather
  nonchalantly
Dealing with Unreliability
•   Don’t assume!
•   Don’t bang your head against the wall!
•   Know when you don’t know any better.
•   Ask for help!
Designs against Unreliability
• Management Servers keeps an heartbeat with the DB. One
  ping a minute.
• Management Servers self-fences if it cannot write the
  heartbeat
• Other management servers wait to make sure the down
  management server is no longer writing to the heartbeat
  and then signal interested software to recover
• Check points at every call to a resource and code to deal
  with recovering from those check points
• Database records are not actually deleted to help with
  manual recovery when needed
• Write code that is idempotent
• Respect modularity when writing your code

Más contenido relacionado

Was ist angesagt?

Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...confluent
 
Server load balancer ppt
Server load balancer pptServer load balancer ppt
Server load balancer pptShilpi Tandon
 
Architecting for Failure in a Containerized World
Architecting for Failure in a Containerized WorldArchitecting for Failure in a Containerized World
Architecting for Failure in a Containerized WorldTom Faulhaber
 
Load Balancing from the Cloud - Layer 7 Aware Solution
Load Balancing from the Cloud - Layer 7 Aware SolutionLoad Balancing from the Cloud - Layer 7 Aware Solution
Load Balancing from the Cloud - Layer 7 Aware SolutionImperva Incapsula
 
Building your own Distributed System The easy way - Cassandra Summit EU 2014
Building your own Distributed System The easy way - Cassandra Summit EU 2014Building your own Distributed System The easy way - Cassandra Summit EU 2014
Building your own Distributed System The easy way - Cassandra Summit EU 2014Kévin LOVATO
 
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...confluent
 
Russell spring one2gx_messaging_india
Russell spring one2gx_messaging_indiaRussell spring one2gx_messaging_india
Russell spring one2gx_messaging_indiaGaryPRussell
 
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...Continuent
 
Load Balancing
Load BalancingLoad Balancing
Load Balancingnashniv
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterContinuent
 
Database , 13 Replication
Database , 13 ReplicationDatabase , 13 Replication
Database , 13 ReplicationAli Usman
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...Bob Pusateri
 
Client Centric Consistency Model
Client Centric Consistency ModelClient Centric Consistency Model
Client Centric Consistency ModelRajat Kumar
 
Achieving Zero Downtime for SQL
Achieving Zero Downtime for SQLAchieving Zero Downtime for SQL
Achieving Zero Downtime for SQLScaleArc
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaGrant Henke
 
clustering and load balancing
clustering and load balancingclustering and load balancing
clustering and load balancingPrabhat gangwar
 

Was ist angesagt? (20)

Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
 
Server load balancer ppt
Server load balancer pptServer load balancer ppt
Server load balancer ppt
 
Architecting for Failure in a Containerized World
Architecting for Failure in a Containerized WorldArchitecting for Failure in a Containerized World
Architecting for Failure in a Containerized World
 
Load balancing
Load balancingLoad balancing
Load balancing
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Load Balancing Server
Load Balancing ServerLoad Balancing Server
Load Balancing Server
 
Load Balancing from the Cloud - Layer 7 Aware Solution
Load Balancing from the Cloud - Layer 7 Aware SolutionLoad Balancing from the Cloud - Layer 7 Aware Solution
Load Balancing from the Cloud - Layer 7 Aware Solution
 
Building your own Distributed System The easy way - Cassandra Summit EU 2014
Building your own Distributed System The easy way - Cassandra Summit EU 2014Building your own Distributed System The easy way - Cassandra Summit EU 2014
Building your own Distributed System The easy way - Cassandra Summit EU 2014
 
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
 
Russell spring one2gx_messaging_india
Russell spring one2gx_messaging_indiaRussell spring one2gx_messaging_india
Russell spring one2gx_messaging_india
 
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
 
Load Balancing
Load BalancingLoad Balancing
Load Balancing
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
 
Database , 13 Replication
Database , 13 ReplicationDatabase , 13 Replication
Database , 13 Replication
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
 
Client Centric Consistency Model
Client Centric Consistency ModelClient Centric Consistency Model
Client Centric Consistency Model
 
Achieving Zero Downtime for SQL
Achieving Zero Downtime for SQLAchieving Zero Downtime for SQL
Achieving Zero Downtime for SQL
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 
clustering and load balancing
clustering and load balancingclustering and load balancing
clustering and load balancing
 

Ähnlich wie CloudStack Scalability

Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity securityLen Bass
 
vCloud Automation Center 6.0 -My Notes on Architecture
vCloud Automation Center 6.0 -My Notes on ArchitecturevCloud Automation Center 6.0 -My Notes on Architecture
vCloud Automation Center 6.0 -My Notes on Architecturetechstarts
 
Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Dharma Shukla
 
Transforming Legacy Applications Into Dynamically Scalable Web Services
Transforming Legacy Applications Into Dynamically Scalable Web ServicesTransforming Legacy Applications Into Dynamically Scalable Web Services
Transforming Legacy Applications Into Dynamically Scalable Web ServicesAdam Takvam
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...ScyllaDB
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenParticular Software
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureBalaji Vignesh
 
adap-stability-202310.pptx
adap-stability-202310.pptxadap-stability-202310.pptx
adap-stability-202310.pptxMichael Ming Lei
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit
 
Membase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase
 
Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...WMLab,NCU
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
 
(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool ManagementBIOVIA
 
Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataSchubert Zhang
 
3 the cloud
3 the cloud 3 the cloud
3 the cloud Len Bass
 
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...Michael Mior
 
05. performance-concepts
05. performance-concepts05. performance-concepts
05. performance-conceptsMuhammad Ahad
 

Ähnlich wie CloudStack Scalability (20)

Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity security
 
vCloud Automation Center 6.0 -My Notes on Architecture
vCloud Automation Center 6.0 -My Notes on ArchitecturevCloud Automation Center 6.0 -My Notes on Architecture
vCloud Automation Center 6.0 -My Notes on Architecture
 
Database Management System - 2a
Database Management System - 2aDatabase Management System - 2a
Database Management System - 2a
 
Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019
 
Transforming Legacy Applications Into Dynamically Scalable Web Services
Transforming Legacy Applications Into Dynamically Scalable Web ServicesTransforming Legacy Applications Into Dynamically Scalable Web Services
Transforming Legacy Applications Into Dynamically Scalable Web Services
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer Architecture
 
adap-stability-202310.pptx
adap-stability-202310.pptxadap-stability-202310.pptx
adap-stability-202310.pptx
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Membase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San Francisco
 
Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
Release it! - Takeaways
Release it! - TakeawaysRelease it! - Takeaways
Release it! - Takeaways
 
(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management
 
Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of Data
 
3 the cloud
3 the cloud 3 the cloud
3 the cloud
 
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
 
05. performance-concepts
05. performance-concepts05. performance-concepts
05. performance-concepts
 

Mehr von CloudStack - Open Source Cloud Computing Project

Mehr von CloudStack - Open Source Cloud Computing Project (20)

Apache CloudStack from API to UI
Apache CloudStack from API to UIApache CloudStack from API to UI
Apache CloudStack from API to UI
 
CloudStack Hyderabad Meetup: How the Apache community works
CloudStack Hyderabad Meetup: How the Apache community worksCloudStack Hyderabad Meetup: How the Apache community works
CloudStack Hyderabad Meetup: How the Apache community works
 
CloudStack Hyderabad Meetup: Migrating applications to IaaS clouds
CloudStack Hyderabad Meetup: Migrating applications to IaaS cloudsCloudStack Hyderabad Meetup: Migrating applications to IaaS clouds
CloudStack Hyderabad Meetup: Migrating applications to IaaS clouds
 
CloudStack Hyderabad Meetup: Using CloudStack to build IaaS clouds
CloudStack Hyderabad Meetup: Using CloudStack to build IaaS cloudsCloudStack Hyderabad Meetup: Using CloudStack to build IaaS clouds
CloudStack Hyderabad Meetup: Using CloudStack to build IaaS clouds
 
CloudStack technical overview
CloudStack technical overviewCloudStack technical overview
CloudStack technical overview
 
Introduction to CloudStack: How to Deploy and Manage Infrastructure-as-a-Serv...
Introduction to CloudStack: How to Deploy and Manage Infrastructure-as-a-Serv...Introduction to CloudStack: How to Deploy and Manage Infrastructure-as-a-Serv...
Introduction to CloudStack: How to Deploy and Manage Infrastructure-as-a-Serv...
 
vBACD July 2012 - Apache Hadoop, Now and Beyond
vBACD July 2012 - Apache Hadoop, Now and BeyondvBACD July 2012 - Apache Hadoop, Now and Beyond
vBACD July 2012 - Apache Hadoop, Now and Beyond
 
vBACD July 2012 - Scaling Storage with Ceph
vBACD July 2012 - Scaling Storage with CephvBACD July 2012 - Scaling Storage with Ceph
vBACD July 2012 - Scaling Storage with Ceph
 
vBACD July 2012 - Deploying Private PaaS with ActiveState Stackato
vBACD July 2012 - Deploying Private PaaS with ActiveState StackatovBACD July 2012 - Deploying Private PaaS with ActiveState Stackato
vBACD July 2012 - Deploying Private PaaS with ActiveState Stackato
 
vBACD July 2012 - Xen Cloud Platform
vBACD July 2012 - Xen Cloud PlatformvBACD July 2012 - Xen Cloud Platform
vBACD July 2012 - Xen Cloud Platform
 
vBACD- July 2012 - Crash Course in Open Source Cloud Computing
vBACD- July 2012 - Crash Course in Open Source Cloud ComputingvBACD- July 2012 - Crash Course in Open Source Cloud Computing
vBACD- July 2012 - Crash Course in Open Source Cloud Computing
 
Virtualization in the cloud
Virtualization in the cloudVirtualization in the cloud
Virtualization in the cloud
 
Build a Cloud Day San Francisco - Ubuntu Cloud
Build a Cloud Day San Francisco - Ubuntu CloudBuild a Cloud Day San Francisco - Ubuntu Cloud
Build a Cloud Day San Francisco - Ubuntu Cloud
 
Cloudstack UI Customization
Cloudstack UI CustomizationCloudstack UI Customization
Cloudstack UI Customization
 
CloudStack Networking
CloudStack NetworkingCloudStack Networking
CloudStack Networking
 
CloudStack Architecture
CloudStack ArchitectureCloudStack Architecture
CloudStack Architecture
 
Management server internals
Management server internalsManagement server internals
Management server internals
 
Introduction to CloudStack
Introduction to CloudStack Introduction to CloudStack
Introduction to CloudStack
 
vBACD - Introduction to Puppet, Configuration Management and IT Automation So...
vBACD - Introduction to Puppet, Configuration Management and IT Automation So...vBACD - Introduction to Puppet, Configuration Management and IT Automation So...
vBACD - Introduction to Puppet, Configuration Management and IT Automation So...
 
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
 

Último

UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4DianaGray10
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Muhammad Tiham Siddiqui
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingFrancesco Corti
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2DianaGray10
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updateadam112203
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024Brian Pichman
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)codyslingerland1
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdfThe Good Food Institute
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingMAGNIntelligence
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch TuesdayIvanti
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 

Último (20)

UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is going
 
SheDev 2024
SheDev 2024SheDev 2024
SheDev 2024
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 update
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced Computing
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch Tuesday
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 

CloudStack Scalability

  • 1. CloudStack Scalability By Alex Huang
  • 2. Current Status • 10k resources managed per management server node • Scales out horizontally (must disable stats collector) • Real production deployment of tens of thousands of resources • Internal testing with software simulators up to 30k physical resources with 30k VMs managed by 4 management server nodes • We believe we can at least double that scale per management server node
  • 3. Balancing Incoming Requests • Each management server has two worker thread pools for incoming requests: effectively two servers in one. – Executor threads provided by tomcat – Job threads waiting on job queue • All incoming requests that requires mostly DB operations are short in duration and are executed by executor threads because incoming requests are already load balanced by the load balancer • All incoming requests needing resources, which often have long running durations, are checked against ACL by the executor threads and then queued and picked up by job threads. • # of job threads are scaled to the # of DB connections available to the management server • Requests may take a long time depending on the constraint of the resources but they don’t fail.
  • 4. The Much Harder Problem • CloudStack performs a number of tasks on behalf of the users and those tasks increases with the number of virtual and physical resources available – VM Sync – SG Sync – Hardware capacity monitoring – Virtual resource usage statistics collection – More to come • When done in number of hundreds, no big deal. • As numbers increase, this problem magnifies. • How to scale this horizontally across management servers?
  • 5. Comparison of two Approaches • Stats Collector – collects capacity statistics – Fires every five minutes to collect stats about host CPU and memory capacity – Smart server and dumb client model: Resource only collects info and management server processes – Runs the same way on every management server • VM Sync – Fires every minute – Peer to peer model: Resource does a full sync on connection and delta syncs thereafter. Management server trusts on resource for correct information. – Only runs against resources connected to the management server node
  • 6. Numbers • Assume 10k hosts and 500k VMs (50 VMs per host) • Stats Collector – Fires off 10k requests every 5 minutes or 33 requests a second. – Bad but not too bad: Occupies 33 threads every second. – But just wait: • 2 management servers: 66 requests • 3 management servers: 99 requests – It gets worse as # of management servers increase because it did not auto-balance across management servers – Oh but it gets worse still: Because the 10k hosts is now spread across 3 management servers. While it’s 99 requests generated, the number of threads involved is three-fold because requests need to be routed to the right management server. – It keeps the management server at 20% busy even at no load from incoming requests • VM Sync – Fires off 1 request at resource connection to sync about 50 VMs – Then, push from resource as resource knows what it has pushed before and only pushes changes that are out-of-band. – So essentially no threads occupied for a much larger data set.
  • 7. What’s the Down Side? • Resources must reconcile between VM states caused by management server commands and VM states it collects from the physical hardware so it requires more CPU • Resources must use more memory to keep track of what amounts to a journal of changes since the last sync point. • But data centers are full of these two resources.
  • 8. Resource Load Balancing • As management server is added into the cluster, resources are rebalanced seamlessly. – MS2 signals to MS1 to hand over a resource – MS1 wait for the commands on the resources to finish – MS1 holds further commands in a queue – MS1 signals to MS2 to take over – MS2 connects – MS2 signals to MS1 to complete transfer – MS1 discards its resource and flows the commands being held to MS2 • Listeners are provided to business logic to listen on connection status and adjusts work based on who’s connected. • By only working on resources that are connected to the management server the process is on, work is auto-balanced between management servers. • Also reduces the message routing between the management servers.
  • 9. Designing for Scalability • Take advantage of the most abundant resources in a data center (CPU, RAM) • Auto-scale to the least abundant resource (DB) • Do not hold DB connections/Transactions across resource calls. – Use lock table implementation (Merovingian2 or GenericDao.acquireLockInTable() call) over database row locks in this situation. – Database row locks are still fine quick short lock outs. • Balance the resource intensive tasks as # of management server nodes increases and decreases – Use job queues to balance long running processes across management servers – Make use of resource rebalancing in CloudStack to auto-balance your world load.
  • 11. The Five W’s of Unreliability • What is unreliable? Everything • Who is unreliable? Developers & administrators • When does unreliability happen? 3:04 a.m. no matter which time zone… Any time. • Where does unreliability happen? In carefully planned, everything has been considered data centers. • How does unreliability happen? Rather nonchalantly
  • 12. Dealing with Unreliability • Don’t assume! • Don’t bang your head against the wall! • Know when you don’t know any better. • Ask for help!
  • 13. Designs against Unreliability • Management Servers keeps an heartbeat with the DB. One ping a minute. • Management Servers self-fences if it cannot write the heartbeat • Other management servers wait to make sure the down management server is no longer writing to the heartbeat and then signal interested software to recover • Check points at every call to a resource and code to deal with recovering from those check points • Database records are not actually deleted to help with manual recovery when needed • Write code that is idempotent • Respect modularity when writing your code