2. Current Status
• 10k resources managed per management server
node
• Scales out horizontally (must disable stats
collector)
• Real production deployment of tens of thousands
of resources
• Internal testing with software simulators up to
30k physical resources with 30k VMs managed by
4 management server nodes
• We believe we can at least double that scale per
management server node
3. Balancing Incoming Requests
• Each management server has two worker thread pools for incoming
requests: effectively two servers in one.
– Executor threads provided by tomcat
– Job threads waiting on job queue
• All incoming requests that requires mostly DB operations are short
in duration and are executed by executor threads because incoming
requests are already load balanced by the load balancer
• All incoming requests needing resources, which often have long
running durations, are checked against ACL by the executor threads
and then queued and picked up by job threads.
• # of job threads are scaled to the # of DB connections available to
the management server
• Requests may take a long time depending on the constraint of the
resources but they don’t fail.
4. The Much Harder Problem
• CloudStack performs a number of tasks on behalf of
the users and those tasks increases with the number of
virtual and physical resources available
– VM Sync
– SG Sync
– Hardware capacity monitoring
– Virtual resource usage statistics collection
– More to come
• When done in number of hundreds, no big deal.
• As numbers increase, this problem magnifies.
• How to scale this horizontally across management
servers?
5. Comparison of two Approaches
• Stats Collector – collects capacity statistics
– Fires every five minutes to collect stats about host CPU and
memory capacity
– Smart server and dumb client model: Resource only
collects info and management server processes
– Runs the same way on every management server
• VM Sync
– Fires every minute
– Peer to peer model: Resource does a full sync on
connection and delta syncs thereafter. Management
server trusts on resource for correct information.
– Only runs against resources connected to the management
server node
6. Numbers
• Assume 10k hosts and 500k VMs (50 VMs per host)
• Stats Collector
– Fires off 10k requests every 5 minutes or 33 requests a second.
– Bad but not too bad: Occupies 33 threads every second.
– But just wait:
• 2 management servers: 66 requests
• 3 management servers: 99 requests
– It gets worse as # of management servers increase because it did not auto-balance across
management servers
– Oh but it gets worse still: Because the 10k hosts is now spread across 3 management servers.
While it’s 99 requests generated, the number of threads involved is three-fold because
requests need to be routed to the right management server.
– It keeps the management server at 20% busy even at no load from incoming requests
• VM Sync
– Fires off 1 request at resource connection to sync about 50 VMs
– Then, push from resource as resource knows what it has pushed before and only pushes
changes that are out-of-band.
– So essentially no threads occupied for a much larger data set.
7. What’s the Down Side?
• Resources must reconcile between VM states
caused by management server commands and
VM states it collects from the physical
hardware so it requires more CPU
• Resources must use more memory to keep
track of what amounts to a journal of changes
since the last sync point.
• But data centers are full of these two
resources.
8. Resource Load Balancing
• As management server is added into the cluster, resources are rebalanced
seamlessly.
– MS2 signals to MS1 to hand over a resource
– MS1 wait for the commands on the resources to finish
– MS1 holds further commands in a queue
– MS1 signals to MS2 to take over
– MS2 connects
– MS2 signals to MS1 to complete transfer
– MS1 discards its resource and flows the commands being held to MS2
• Listeners are provided to business logic to listen on connection status and
adjusts work based on who’s connected.
• By only working on resources that are connected to the management
server the process is on, work is auto-balanced between management
servers.
• Also reduces the message routing between the management servers.
9. Designing for Scalability
• Take advantage of the most abundant resources in a data center
(CPU, RAM)
• Auto-scale to the least abundant resource (DB)
• Do not hold DB connections/Transactions across resource calls.
– Use lock table implementation (Merovingian2 or
GenericDao.acquireLockInTable() call) over database row locks in this
situation.
– Database row locks are still fine quick short lock outs.
• Balance the resource intensive tasks as # of management server
nodes increases and decreases
– Use job queues to balance long running processes across management
servers
– Make use of resource rebalancing in CloudStack to auto-balance your
world load.
11. The Five W’s of Unreliability
• What is unreliable? Everything
• Who is unreliable? Developers & administrators
• When does unreliability happen? 3:04 a.m. no
matter which time zone… Any time.
• Where does unreliability happen? In carefully
planned, everything has been considered data
centers.
• How does unreliability happen? Rather
nonchalantly
12. Dealing with Unreliability
• Don’t assume!
• Don’t bang your head against the wall!
• Know when you don’t know any better.
• Ask for help!
13. Designs against Unreliability
• Management Servers keeps an heartbeat with the DB. One
ping a minute.
• Management Servers self-fences if it cannot write the
heartbeat
• Other management servers wait to make sure the down
management server is no longer writing to the heartbeat
and then signal interested software to recover
• Check points at every call to a resource and code to deal
with recovering from those check points
• Database records are not actually deleted to help with
manual recovery when needed
• Write code that is idempotent
• Respect modularity when writing your code