IT organizations have a wealth of Service Management and Service Delivery tools, processes and metrics that typically exist in relative isolation. This session will present detailed real-life examples of how existing tools and metrics can be brought together using big data techniques to optimize costs and performance of IT environments.
Our IT shop is huge. We have tens of thousands of servers, all platforms, all types, all kinds. We manage dozens of data centers and have one of the largest mainframe installations in North America. We manage thousands upon thousands of VMs, VDIs, and Citrix. Of course we have more storage than seems humanely possible – PetaBytes of storage.
We are huge and complex. We have huge amounts of data and don’t want to duplicate it. We need all of this data integrated into a single analytical tool. Performance Surveyor provides us with seamless Capacity Management across all of our data collection toolsets and platforms. We have many tools for configuration information, change records, trouble ticketing, etc. We also need to track our capacity risks from discovery through remediation.Performance Surveyor helps us pull that information together into a single, analytical report.Some highlights:We used Performance Surveyor to make the transition between two different collection tools without changing a single report.As you could guess, we have terabytes and terabytes of performance/capacity data. We cannot afford to duplicate this data, the SAN storage and management costs would be astronomical. Performance Surveyor easily integrates this data into our reports without data duplication.We have integrated many tools and data sets into Performance Surveyor. Things like BMC Performance AssuranceHP OpenViewVMWare Virtual CenterEMC Control CenterRemedyand our own capacity risk registryPerformance Surveyor understands all of these things. It is smart enough to adjust the reporting to properly reflect the differences between platforms, data types, and metrics. For example, the same report that analyzes a physical server’s CPU activity can also analyze a VMWare virtual server’s CPU activity. Performance Surveyor knows the difference, and the reports can make use of it.How about an example?
We all need enterprise-wide reporting. We want it automated and application specific. When we built ours, we asked ourselves three questions:How can we detect and track capacity/performance risks for an application?How can we report these risks to the application support teams?How can we reduce the report to only include actionable information?Our Performance Surveyor monthly application report automatically identifies capacity risks for the application. The automation makes it highly repeatable while reducing errors. It even makes our junior staff effective by removing guess work and their reliance on senior staff. The report contains only the actionable information, not pages and pages of unnecessary details.Let me show you an example.
This is our Monthly Health Check Report for the Brokers Data Warehouse application. Notice that the Introduction section of the report includes the What, Why, and How. This helps our customers fully understand what we are doing and why.The most important part of the report is the “Ratings For Month” section. This section is for the analyst to summarize the health of this application. For this report, the analyst is concerned about the predicted lack of memory due to a single-node failure and have rated all servers as YELLOW in the dashboard. Please note that this is the only manual entry in the report. Just these two paragraphs. No other editing is required.
Our Monthly Health Check also includes Usage Patterns for each tier of the application. A business hours only chart is on the left, and the critical batch window is on the right. Notice that this application’s usage pattern is highly repeatable except for the week of October 24th on the left. This chart helped us understand the effect of a business event on this application and was used as the “before picture” for tuning efforts. These types of period-over-period charts have become very valuable in our daily analysis tasks. And the good news is that they are extremely easy to create in Performance Surveyor.
Our Monthly Health Check report easily contains Asset information alongside our risk registry items. This particular server’s history contains three previous capacity issues for CPU, Memory, and file system space. Our homegrown risk registry is used to track these items from identification through remediation. We track the date opened, why it was opened, our notes, and the closure reason. Notice that two of the issues were closed based on feedback from the Application Owner. While the memory issue was resolved by tuning Oracle’s SGA. This history is invaluable for our analysis as well as providing historical context for the application owner. We have too many applications and servers to track this by hand. We had to have a tracking tool, and it had to be integrated into our reporting tool. This was easily done with Performance Surveyor.
Our Monthly Health Check also analyzes the resource consumption of the application to detect possible capacity issues. In this report, Performance Surveyor detected a possible CPU problem. Notice that only the relevant charts / tables are produced! No historical charts/tables for CPU and RunQ are displayed because there are no historical issues with CPU and RunQ. The report only included the chart for the broken rule (System Wait time), a CPU Utilization chart for context, and detailed charts to help understand what processes may have caused the problem.Why waste paper on irrelevant information? Why distract the reader? Thankfully, Performance Surveyor automatically removes the extraneous information so I don’t have to pay someone to do it.I only want actionable information in my reports.
Our Monthly Health Check does attempt to detect future capacity risks based on historical data. In this report, Performance Surveyor detected a possible disk space problem during the next 30 days. It also detected possible disk space and CPU problems over the next 180 days. Notice that a summary table is created that includes the projected problem date and metric value. Again, only the actionable information is included only the relevant charts for each risk are included. When there are no detected problems, we make sure to say so. Notice the automatically generated green “No Projected Processor Problems” message on the left.
In Business terms:Fixed Costs are business expenses that are not dependent on the level of goods or services produced by the business. These are overhead costs like salaries and real estate.Variable Costs are expenses that change in proportion to the activity of a business. These costs are based on volume, such as hamburger buns, shipping costs, or the number of cashiers.In technology terms:Fixed costs are defined as the resource utilization required to keep the system running regardless of changes to the user/application load. System monitoring is an example of fixed costs.Variable costs are defined as the resource utilization consumed as application load changes. An example would be "each bond trading transaction requires 3 seconds of CPU time; 100 bond trading transactions needs 300 seconds".For our purposes:fixed costs are analogous to the minimum utilization during business hours.Variable costs are analogous to the difference between the peak and minimum utilization during business hours.Notice that for our OnLine Transaction Processing system, the Transactions Per Second provides a curve very similar to the CPU utilization curve. This is usually the case during business hours.We also know that batch processing is very different than OLTP. To accommodate this, we apply the method twice: once during the OLTP window and once during the BATCH window.
We have a very large VMware installation. We have many thousands of Virtual Machines. We know that many of them are over provisioned and some are too small. How do we sift through all of that data and find them? How do we determine what the configuration should be?
We have developed a P2V worksheet. We took our rules and methods, and plugged them in. Now we can quickly and easily:test a server for virtual candidacyunderstand its resource consumptionAnd know what size VM is requiredAll of this is done auto-magically. Just pick a server and a timeframe, and then press the Go button.
We used Performance Surveyor to analyze our VMWare Clusters. The analysis determines which servers are idle, too big, and too small. It also determines what the VM’s size should be.In this report, we identified three idle virtual machines. We would target these VMs for shutdown.We also found thirteen VMs that were too big. The report recommends that the first VM be reduced to one virtual CPU and increased to four gigabytes of RAM.All of this is done automatically based on a set of rules embedded in the report. These rules are easily modified and could be changed based on needs and environment. For example, the rules may be very different for Virtual Desktops versus Virtual Servers. We have found that we like our rules better than being trapped into a vendor’s proprietary rules.In 2011, because of this report, we optimized a tenth of our VMs to reclaim over two thousand vCPUs. We have already identified thousands of more vCPUs to reclaim in 2012, and we will add memory and storage reclamation.Before we had this capability, we were constantly building VMware clusters. Because of this report and our reclamation activities, we have not built a cluster in quite a while and don’t have plans to build one soon. As you know, over provisioning in the Vmware space can affect performance. Due to reclamation, our applications’ performance increased and our trouble tickets decreased.This was a huge win on so many fronts!
With Performance Surveyor’s assistance, my team has successfully delivered on our mission to reduce, optimize, consolidate, and virtualize. Performance Surveyor allowed us to build repeatable, powerful processes. It helped us do our analysis faster and easier. It provides a flexible capability that accepts our analysis methods, seamlessly integrates our data sources, and delivers human-readable reports.We are pleased with what we have accomplished, and have plans for much, much more.