A multimedia recording of this presentation can be found at http://developers.sun.com/learning/javaoneonline/j1sessn.jsp?sessn=TS-6048&yr=2008&track=coolstuff
Abstract
The Orbitz Worldwide Data Centers host numerous leading online travel agency websites that utilize thousands of services distributed across hundreds of Jini connected VMs to service millions of monthly visitors. Monitoring and managing these large-scale, complex applications is a daunting task. Failure happens and downtime is money! Orbitz Worldwide has harnessed the power of Complex Event Processing to handle a torrent of monitoring events with minimal application development and hardware costs. The resulting system has improved manageability by reducing the Mean Time To Resolution (MTTR) for customer impacting events caused by software availability, reliability and performance issues.
Orbitz Worldwide has developed a proprietary Java instrumentation API named the Extremely Reusable Monitoring API (ERMA). ERMA is as simple to use as a logging API, yet flexible enough through configuration to satisfy most requirements for logging, monitoring, analytics and other event processing needs. ERMA dynamically correlates events across distributed VMs servicing a user request, enabling efficient drill-down root cause analysis for errors and latency as well as bottom-up impact analysis. ERMA has been applied using Filters, Interceptors, Listeners, Spring-AspectJ AOP integration and custom instrumentation of core Orbitz Worldwide object models. As a result we have access to data for over 100k distinct event types with minimal development cost.
Monitoring data corresponding to discrete events is streamed through ERMA from hundreds of VMs to a Complex Event Processing (CEP) engine in real-time where it is aggregated and processed with high throughput and low latency. A single 2-way commodity computer executing our most elaborate event processing application is able to handle nearly 100,000 events per second. The ability to handle such a large volume of data enables us to monitor services at a very fine-grained resolution as needed. Also, the hardware cost of adding new monitoring applications is minimal using this technology.
A high-level event processing language provided with the CEP engine makes it possible to develop new monitoring applications quickly and easily. A visual development environment makes it easy to trace event flow and wire in new functionality. The event processing language has been extended by Orbitz Worldwide with custom Java functions and operators tailored to the Orbitz Worldwide environment. For example, we have developed an operator that can deliver streams of data via SNMP using the OpenNMS API in order to integrate with our Service Operations Center infrastructure.
A Java portal has been developed to visualize the output from the CEP engine. The portal presents tabular and graphical views of vital system statistics. It also publishes RSS feeds for alarms. Users can subscribe to feeds for particular alarm severities and/or affected applications.
The future of Complex Event Processing at Orbitz Worldwide includes Event Pattern Monitoring capabilities. We are developing a solution that will reduce the volume of alarms delivered to the Operator by bundling Customer impacting event information with root cause estimation determined by detecting patterns of discrete events. As our business grows, it is imperative that our Operations team can manage the system in a scalable manner by relying on automated actionable event detection. Complex Event Processing is the solution to this problem for Orbitz Worldwide.
21. The MonitorProcessor Interface public interface MonitorProcessor { public void startup(); public void shutdown(); public void monitorCreated(Monitor monitor); public void monitorStarted(Monitor monitor); public void process(Monitor monitor); }
22. A MonitorProcessorAdaptor Example public class ResultCodeAnnotatingMonitorProcessor extends MonitorProcessorAdapter { public void process(Monitor monitor) { if (monitor.hasAttribute("failureThrowable")) { Throwable t = (Throwable) monitor.get("failureThrowable"); while (t.getCause() != null) { t = t.getCause(); } monitor.set("resultCode", t.getClass().getName()); } else { monitor.set("resultCode", "success"); } } }
25. EventPatternLoggingMonitorProcessor Output wl|AirSearchExecuteAction.search| NoSearchResultsAvailableException wl|jiniOut_ShopService_createResultSet| NoSearchResultsAvailableException tbs-shop|jiniIn_ShopService_createResultSet| NoSearchResultsAvailableException tbs-shop|c.o.t.h.s.ShopServiceImpl.createResultSet.AIR| NoSearchResultsAvailableException tbs-shop|c.o.t.s.SpiShopService.createResultSet.AIR| NoSearchResultsAvailableException tbs-shop|jiniOut_LowFareSearchService_execute| SearchSolutionNotFoundException air-search|jiniIn_LowFareSearchService_execute| SearchSolutionNotFoundException air-search|LowFareSearchRequest| SearchSolutionNotFoundException Follow the trail of Exceptions… don’t bother the on-call engineers for the higher layers… save time by narrowing your log search query!