Silicon Valley Code Camp, The Self-healing Elastic Runtime that is Cloud Foundry.
While we did mostly demo in this session, these slides set a bit of context first. Also includes the four levels of HA in Cloud Foundry.
The rate and pace of change in our industry only continues to rise. We’re undergoing a transition to a whole new set of platforms and a whole new set of concerns. Cloud platforms, big data, real-time analytics, mobile delivery, systems of engagement and more are all bringing in waves of change and innovation.
The old model of committee-based collaboration to create a standard specification that can then be implemented by multiple vendors is just too slow for this world. Not only is it too slow, but it also does not produce solutions that are as good as those created by refining open source projects in the fire of real user feedback. So the industry has mostly replaced a standards-first approach with an open-source first approach. In many cases, with an open-source *only* approach, but there are also examples of open standards being created behind the bow-wave of open source, ratifying what have become defacto standards. [This recently happened with the Spring Batch project for example, where the lag between the first open source version being used in production, and the JEE standard being complete was about 5 years! That’s an eternity in today’s world of IT].
In the beginning people used to say that open source couldn’t innovate. That it was only good for commoditizing existing capabilities. That it was a race to the bottom which would destroy industry value.
Bullshit.
We never believed that. Still don’t. Open source is the best way to innovate because of the short feedback cycles it can create that we talked about previously. In the last few weeks, we’ve seen a subset of the vendors in the Hadoop space, which is itself just a part of the over 140 projects at the Apache Software Foundation, achieve a combined market valuation of $3Bn (Cloudera = $2Bn, HW = $1Bn). That’s a whole lot of industry value being created through open source. What happened?
First open source became a means of overcoming proprietary lock-in,. Then it replaced standards at the leading edge of industry adoption, fueled by a rate of innovation that standards could never keep up with. Then it became a strategic asset and an integral part of corporate strategy.
Largest Open source PaaS Ecosystem. Billions committed to the technology
Global SPs are “all in”: AT&T, Verizon, NTT, Swisscom, CenturyLink, Telus, etc.
WHY
Remember the good old days when you had a separate chunk of plastic to take live video, make phone calls, listen to music, snap a picture with friends, get instant messages from co-workers, check the time and use that new fangled world wide web? Can you imagine swapping your smart phone for 8 pieces of gear that barely fit into a duffle bag?
We are on the cusp of a similar transition in the datacenter. You shouldn’t need to work with different vendors when running applications. You shouldn’t need a separate vendor for your operating system, middleware, load balancer, system provisioning and policy management.
Why would you use these devices in a world where a single platform displaces a myriad of unnecessary and expensive products.
WHAT
Pivotal CF is next generation middleware that delivers 9 things that are typically delivered via point software products.
We provision operating systems and middleware.
We deliver workload density without compromising application performance.
We ensure that applications have appropriate network security safe guards to prevent security threats.
We support application connections to external sources including databases and legacy middleware.
We provide 4 levels of HA, with built in load balancing for scale in/out
We support multi-tenant environments so that each line of business can operate with a discrete quota and isolated system access.
We provision next generation data services including NOSQL databases, traditional databases and hadoop clusters.
We provide horizontal and vertical scaling for the underlying IaaS so that you can scale your infrastructure in lock step with your Business.
We provide a built-in log aggregation service, built-in APM metrics and utilization based auto-scaling so that you can monitor the health of your applications and scale out without human or 3rd party tool intervention.
I am going to cover each of these 9 capabilities in more detail, but it’s important to note the impact of this collection of capabilities. The following slides will include information on CAPEX and OPEX reduction. We will also discuss how you can deliver faster time to value while holding the line on infrastructure cost.
Zone could = rack, for example
The elastic runtime will keep the number of instances you’ve requested running by:
DEAs constantly reporting their state
Health manager constantly updating actual state model across all DEAs
HM periodically requests desired state from the cloud controller
When a difference is found, HM advises CC
CC initiates deployment of a new instance
A static version of the previous slide.
The processes running on the virtual machines (i.e. the health manager) are started with monit, a nice little utility that keeps an eye on processes and will respond when one dies. If a process dies then, monit will do two things: first, it will try to restart the process and second, whether the restart is successful or not, it will tell the BOSH agent about the failure. Recall that the Bosh agent is there to communicate with Operations Manager and in this case it will relay this failure information to the Operations Manager Health Monitor (not to be confused with the Health Manager of the elastic runtime that was discussed above) – we’ll abbreviate it OMHM. The OMHM will take this alert and pass it through a list of responders that can be configured to do things like send emails, page administrators and display alerts in operations dashboards. There’s a good chance that monit will already have recovered the process, but we also want there to be an opportunity for a human to respond.
The processes running on the virtual machines (i.e. the health manager) are started with monit, a nice little utility that keeps an eye on processes and will respond when one dies. If a process dies then, monit will do two things: first, it will try to restart the process and second, whether the restart is successful or not, it will tell the BOSH agent about the failure. Recall that the Bosh agent is there to communicate with Operations Manager and in this case it will relay this failure information to the Operations Manager Health Monitor (not to be confused with the Health Manager of the elastic runtime that was discussed above) – we’ll abbreviate it OMHM. The OMHM will take this alert and pass it through a list of responders that can be configured to do things like send emails, page administrators and display alerts in operations dashboards. There’s a good chance that monit will already have recovered the process, but we also want there to be an opportunity for a human to respond.
The processes running on the virtual machines (i.e. the health manager) are started with monit, a nice little utility that keeps an eye on processes and will respond when one dies. If a process dies then, monit will do two things: first, it will try to restart the process and second, whether the restart is successful or not, it will tell the BOSH agent about the failure. Recall that the Bosh agent is there to communicate with Operations Manager and in this case it will relay this failure information to the Operations Manager Health Monitor (not to be confused with the Health Manager of the elastic runtime that was discussed above) – we’ll abbreviate it OMHM. The OMHM will take this alert and pass it through a list of responders that can be configured to do things like send emails, page administrators and display alerts in operations dashboards. There’s a good chance that monit will already have recovered the process, but we also want there to be an opportunity for a human to respond.
Of course, the BOSH agent on a VM can only communicate back to the Operations Manager if the VM is there, so let’s talk about what happens when a VM disappears. First thing to understand is that by “disappear” I mean that the BOSH agent is not functional; the VM could be there, but Ops Manager no longer knows what it is up to so for all intents and purposes it’s “gone”. How does Ops Manager know? One of the things that a BOSH agent is responsible for is sending out heartbeat messages and by default it does so every 60 seconds. The OMHM is constantly listening for those heartbeats and when it finds that one is missing it will itself produce and alert and pass that through the list of responders. Just as described above, this could result in emails, pages and operations dashboard alerts, but in this case there is one more responder that kicks in – the “resurector”. The resurector will communicate with the IaaS over which PCF is running and will ask that the failed VM be replaced. Of course it will be replaced with a VM running the appropriate part of the elastic runtime – i.e. a health manager or DEA, etc. That’s right, Operations Manager will restart failed cluster components.
Of course, the BOSH agent on a VM can only communicate back to the Operations Manager if the VM is there, so let’s talk about what happens when a VM disappears. First thing to understand is that by “disappear” I mean that the BOSH agent is not functional; the VM could be there, but Ops Manager no longer knows what it is up to so for all intents and purposes it’s “gone”. How does Ops Manager know? One of the things that a BOSH agent is responsible for is sending out heartbeat messages and by default it does so every 60 seconds. The OMHM is constantly listening for those heartbeats and when it finds that one is missing it will itself produce and alert and pass that through the list of responders. Just as described above, this could result in emails, pages and operations dashboard alerts, but in this case there is one more responder that kicks in – the “resurector”. The resurector will communicate with the IaaS over which PCF is running and will ask that the failed VM be replaced. Of course it will be replaced with a VM running the appropriate part of the elastic runtime – i.e. a health manager or DEA, etc. That’s right, Operations Manager will restart failed cluster components.
Of course, the BOSH agent on a VM can only communicate back to the Operations Manager if the VM is there, so let’s talk about what happens when a VM disappears. First thing to understand is that by “disappear” I mean that the BOSH agent is not functional; the VM could be there, but Ops Manager no longer knows what it is up to so for all intents and purposes it’s “gone”. How does Ops Manager know? One of the things that a BOSH agent is responsible for is sending out heartbeat messages and by default it does so every 60 seconds. The OMHM is constantly listening for those heartbeats and when it finds that one is missing it will itself produce and alert and pass that through the list of responders. Just as described above, this could result in emails, pages and operations dashboard alerts, but in this case there is one more responder that kicks in – the “resurector”. The resurector will communicate with the IaaS over which PCF is running and will ask that the failed VM be replaced. Of course it will be replaced with a VM running the appropriate part of the elastic runtime – i.e. a health manager or DEA, etc. That’s right, Operations Manager will restart failed cluster components.