Welcome to Session BC2520. We are going to talk about a little project we have been working on for a few years then revolves around a 911 Call Center in Pennsylvania.
synergIT and Washington County Public Safety have no control of how you may use information obtained from this presentation and does not intend this material to be a template for all 911 Public Safety environments. Each 911 center is highly unique and requires advance analysis and risk identification before any affect on life saving systems be undertaken.
“ The project team was made up of a joint effort but the success of this project lies squarely on the shoulders of the Public Safety team in Washington County. It was their vision to think outside the box, unwavering dedication that has resulted in the quality of the solution that exists today. It is unfortunate that those folks are not able to be with us today” Speakers: Dan Briner, Washington County Public Safety, Director of Information Technology (CIO) – heavily involved in the project and played pivotal role in the initial conception of the project. Dan has had a 30+ year IT career in private industry prior to joining Washington County. Extensive banking, healthcare and IT consulting experience, Dan is an State certified EMT, holds an Extra Class amateur radio license. This led him to pursue the initial concept of the virtualized call center. Keith Martin, Senior System Engineer synergIT, since 1990. Currently, Keith applies his expertise as a consultant on installation, troubleshooting and support of local &wide area networks, enterprise systems. Keith’s dedication to his projects as well as client concerns have award him with the highest compliments from clients and partners. Keith has over 15 years of experience with HP /Compaq file server and blade server hardware. Keith has nearly 10 years experience as a search and recovery diver in addition to his IT carreer. Bob Walker is currently serving as the Director of Enterprise Services with synergIT where he applies broad expertise across a variety of client projects ranging from virtualization & network assessments through, Business Process Re-engineering and general project management. Bob also manages a software development practice that provides unique solutions to our clients centered around customer applications integrated with MS SharePoint. Since joining synergIT, Bob has project managed many of the largest projects in the firms history and most recently served as the Project Manager and engineer for the 911 Technology Refresh project that we are discussion here today.
Good afternoon, I am Dan Briner the CIO of Washington County PA. I would like to set the stage for our presentation with a short history of the 911 Project. This has been a long project – but I will keep the history lesson short, it began in the later part of 2003 when Greg Clark the Operations Manager of the PSAP (Public Safety Answering Point), Jeff Yates the Public Safety Director and I sat down to begin the planning to get the County ready for the Phase II Wireless requirements. That is being able to receive and work with GPS information that was becoming available from the newer cell phones. PEMA (Pennsylvania’s little brother to FEMA) had asked the various PSAP’s for plans to address the Phase II requirements. We considered several approaches to refreshing the PSAP’s capabilities but after lots of discussion decided to beginning with a clean sheet of paper and base our approach on the guidelines for the Next Generation 9-1-1, or NG9-1-1. We presented our proposal to PEMA and asked for funding (this comes from the monthly charge on cell phone in PA) To our surprise they fully funded our plan and the County agreed to pickup any unfunded costs. And thus in 2005 began our approximately 14 million dollar adventure, that could not have even begun with out VMware at the core. One of our design goals was to develop a NG9-1-1 based system that had a 7x 9 system reliability. That was about 3 seconds of system unavailability a year. Lofty goals you say! Why 7x 9’s?
Heartbeat clip playing in background…. Dan: More than 900 people a day die because a defibrillator did not arrive within 5-7 minutes of a cardiac arrest. 1,300 people a year die from choking. And a delay of 1 minute in emergency response reduces survival by 7% to 10% and significant brain damage can occur after 3-4 minutes without oxygen. That is why we have a goal of 7x 9’s for the system, and consider VMware the lynch pin of our whole system. *** It is the first bullet (last to appear) that is at the core of what this project was about, and largely the justification for the use of VMWare
Background Southwestern Pennsylvania’s Washington County has a population of 208,000 within its land area of 863.6 square miles. First formed in 1781, the county today boasts a major university, three colleges, three premier hospitals, a racetrack, casino, outlet mall and several large mixed use technology centers as well as efficient access to all cultural, retail and sports activities in the Pittsburgh region. The junction of I70 and I79 interstates in the center of the county increases the Public Safety responsibilities with a high volume of traffic and the associated problems. Washington County’s public safety call center handles roughly 800-1000 calls per day, and is responsible for dispatching all fire, police, and emergency medical services, for the entire county 24 hours a day, seven days per week. As part of its vision for evolving its public safety services, the county sought to upgrade the datacenter in its 911 facilities by transitioning from physical servers to a virtual IT infrastructure powered by solutions like VMware Infrastructure 3. Performing this upgrade was easier said than done, however. “ The biggest constraint was the fact that we were not constructing a new building—this was an in-place upgrade,” “ That meant we couldn’t just shut things down and move people into another facility while we were performing the upgrade; the call center needed to stay up and running while the upgrade occurred.” . The 9-1-1 center has one supervisor position, two call taking positions, and seven dispatching positions. The center has the capability of communicating on fifteen different frequencies, with full cross-patch and telephone-patch capabilities. The 9-1-1 Center also provides 24 hour monitoring of several Emergency Management systems. Project started in 05, officially kicked off in 06 and much of the design was born right here at VMWorld in discussions with key folks around possibilities and our goals. All inclusive This upgrade covered everything from physical building improvements to complete technology replacement – while answering live 911 calls – “ You’ll get more details in a few slides about the unique roles that VMware virtualization playing in the project but its key to call out here that the use of Virtualization technology and specifically Vmotion, allowed the project to runs several concurrent tracks. We were deploying the initial proof of concept equipment while the physical building was be radically transformed to support the new systems. Electrical capacity (street power, Generator and UPS systems), cooling, structural changes, ..all performed in live environment with 911 calls being taken …. extreme risk handled without a single missed call For wireless: Phase I and Phase II Phase I conveys call back number + Pseudo-ANI (cell face identifier) to PSAP Phase II provides caller location (e.g., via GPS or TOA)
Call Flow Person dials 911 either from a cell phone or a traditional wireline phone Telco switch sends ANI to 911 tandem via ES trunk group Tandem looks up the ESN for the calling number Based on ESN, call is routed to the appropriate PSAP via EM trunk group (same basic flow for Cell calls with GPS) ANI/ALI data is retrieved and sent to PSAP by ALI computer and displayed on dispatcher’s phone and CAD consoles Intergraph CAD system shows/maps caller location and recommends to dispatcher response plan and tracks incident (SAN, Servers, VMware) Dispatcher notifies responders and works the incident with coordination of responders and escalation as required. T The core of the PSAP operation is the CAD (Computer Aided Dispatch) system – it ties every thing together and tracks the emergency start to finish. When we began this project our first step was to select a full time project manager to work with us to implement our vision for the Washington County NG9-1-1 Center. It was a critical decision for us and happy it has worked out well. I would like to turn the presentation over to Bob Walker of synergyIT who severed as our Overall Project Manager who put our plans to work.
We will come back to these goals later in our discussion and see how we did….
We’ve mentioned the goal was to build a state of the art emergency response center around VMware technology, and we have mentioned that much of the initial concept was honed from discussions rooted right here at VMworld: We have talked about the 1000 calls per day and the risk involved with a complete in-place overhaul of 24x7x365 center (that had NO backup/redundancy). What I hope to share with you in the next few slides is specific to the initial goal of Ensuring the 911 center could continue to respond to emergencies under all foreseeable circumstance.
Bob The use of VMware went from a proof of concept to being the production life line and in record time. VMware played a larger role than anyone imagined. (while dispatching live 911 calls!). This was accomplished without a single interruption to the ability to dispatch Police/Fire/EMS. VMware provided the means to migrate the back office services to the alternate site (again, with live 911 calls being dispatched). To this center, VMware = "Survivability" We jumped in to the project with a bold purchase of HP gear and VMware for a proof of concept. During the upgrade and moving forward, the center had to be able to remain online during outages of: Lost of an entire site Any SAN outage A failure of a blade chassis A failure of any server Loss or outage of a entire call taker position or individual software component
Its not the concept of virtualization that made the most risky and challenging parts of the upgrade possibly but the maturity and sphere of the tool set that make up VMware. VMware played a significant part in nearly every aspect of the upgrade. Power : Monitoring system work Power (both street, UPS and generator) were virtualized HVAC: All software controls for both the primary and secondary site were converted to run a virtual machines Network: Network Monitoring, alerting and much of the security systems are virtual SAN connectivity provided the common storage as well
600 KW Generator, 550 KVA UPS, 40 Ton of N+1 cooling Redundant phone, data, power to each dispatch position 2 cisco 6500 series switches and 2 3750 switches in Site B ESX deployed to HP Blade via Rapid Deployment Pack Using RDP allows Drag & Drop Patching Virtual Center, HP ILO Customized SolarWinds Orion Network Performance monitor and Application Performance monitor customer tuned for Vmware and SQL database mirroring Each Cluster consists of an HP Blade Chassis of 16 ESX servers in Site A and another chassis/16 in Site B. Total of xxx ESX Servers – xx per site Each Cluster Consists of Workstation 16 Workstation Blades…. HP Blade Servers Workstation Blades, Thin Clients Synchronous Data Replication Virtualization Benefits CAD, SQL, WEB, Document Management, AD impacting core vendors to adopt architecture Continue efforts with HP and core vendors to drive solutions beyond current audio limitations VMware Provides Unique Workflow Otherwise Not Possible VMware Provides Unique Workflow Otherwise Not Possible
2 VMware Clusters with redundant disk groups spread across both datacenters
2 VMware Clusters with redundant disk groups spread across both datacenters
The rate or new technology was incredible and in of itself presented significant risks. The complexity of the new technology identified early on as a major hurdle that would need to be overcome. The Public Safety team showed amazing dedication and ability to master the complex in record time. The time to recovery of the center has been reduced and continues to fall.
Sanitized slice of Enterprise Monitoring Framework highly customized around ESX hosts and Virtual Machines.
Overall Design. The overall design of the 911 center consists of a primary and secondary data center. Each data center contains a blade server chassis, and a workstation blade chassis. The server chassis in each data center contain both Windows servers and ESX hosts. There is a physical Windows domain controller in each data center. Each domain controller also functions as a DNS server, and standard Windows timing has been implemented with the domain controller containing the PDC emulator role pulling time from a local atomic clock. The ESX hosts are configured in two HA/DRS clusters with cluster members residing in each data center. Each data center also contains a 40 TB SAN. All of the virtual machines reside on SAN storage and the majority of the hardware based Windows servers are SAN boot, with a few exceptions. The primary data center also contains the imaging server, and the SAN management server. A second SAN management server is planned for the secondary data center only to be used if the primary management server becomes unavailable. We have an ancillary data center that houses our enterprise backup system and the UPS system for the primary data center. In addition, not depicted here, there are two separate generator rooms, one for each data center, and another UPS room dedicated to the secondary data center. The secondary data center shares floor space with the county network systems as well. However, the county domain and the 911 domain are two separate domains, there are no trust relationships between them. Several members of the county IT staff have accounts in the 911 domain, allowing login for administration. Access into the 911 domain from the county side is controlled by redundant firewalls allowing only limited inbound administrator access. The firewalls are also configured to permit outbound SMTP traffic from select systems for email alerting, outbound Internet access for a limited number of systems, and access for the CAD system, and for terminal emulators at each station, to the State Police NCIC system.
A closer look at the storage architecture. We’ll refer to the primary data center as “site A” and the secondary as “site B”. There are ESX hosts dedicated to each cluster that reside in each site, but when it comes to storage for the clusters, cluster A’s primary storage is in Site A on SAN A. That data is synchronously replicated to the SAN in site B. Cluster A contains the majority of our primary systems, the primary print server, our primary SAM database and Web servers and our GIS database. Cluster A also contains a number of ancillary systems as well. SAN A also contains storage for the majority of the physical Windows servers as well, including our primary CAD, primary CAD communications and primary 911-phone servers. Of these three physical servers, the 911-phone system is physically based due to hard serial connections, and the internal systems are fully redundant from the vendor, the 911 CAD communication server is physical due to serial and modem connections. It’s secondary server, CAD-Comm(cold) resides on SAN B (not SAN A) and is THE ONLY COLD failover system in the environment. The primary CAD server is physical only by the requirement of the vendor, even though we had the system running as a VM for over a year during the proof-of-concept phase. Site B, SAN B houses the storage for Cluster B. The VMs running in cluster B are our secondary or failover servers. They consist of a number of system, but the critical systems are the 2 nd SAM Web server, SAM DB failover, 2 nd print server and the CAD failover server. Our two cluster design was put in place to maximize redundancy and immediate state-full failover. The cost of failure in this environment couldn’t be any higher, if it fails people can die! As far as Windows servers, the only hardware based Windows server with primary storage on SAN B is the secondary CAD comm server. Our enterprise backup solutions consists of two physical servers for HP Data Protector and Vmware Consolidated Backup.
examples (several slides each) on why this would NOT have been possible without the tools that VMware technology provided to meet the needs project - 2 specific occasions where the design was put into action to save a life (one while still in proof of concept phase!)
Share stories where VMware was used CAD – Vmotion many times to site B during facilities upgrades (SQL DB Mirroring heavy player here as well) Electrical upgrades over 2 years. Active center ran CAD, GIS and all AD file/print/Web services from VMs that would ‘float’ as needed between the Sites Snapshots, used heavily to test & restore In general, the multiple levels of redundancy (Electrical, cooling, network, server, SAN Chassis, Host to VM) allows the flexibility required. There are some growth areas with specific aspects that will continue to be evolve as we work with vendors and the public safety industry. … .regardless of location (see next slide) 911 Calls, GIS data, Dispatch records and audio logs are fully integrated VMWare provided (and continues to provide) Test, training environments outside production cluster.
Graphic showing failure of disk group in site A (CAD outage) Entire room is designed for redundancy ½ of the stations are connected to completely separate network gear served by redundant back office services residing in VMs split across disks groups with the ability to fail over to the secondary site blades as a unit or an individual station.
A couple of the road blocks that we had to contend with were vendor push back in using or supporting a full Vmware solution and/or Windows clustering. Either would have been helpful. Our initial implementation had data for all of the systems residing on SAN A. We quickly realized that was going to be a problem when we had to perform one of several firmware upgrades which forced us to migrate primary data from SAN A to SAN B forcing the reboot of every SAN attached system. This migration or failover was a manual process and time consuming. Our solution, as state previously, was to create the second cluster, migrate all of our secondary systems to cluster B and have its storage reside on SAN B, replicating back to SAN A. Now if we have a failure in SAN A, causing the loss of functionality in Cluster A and the physical servers on cluster A, database processing fails over to SAN B through the use of SQL mirroring. Call processing continues from Cluster B now, and we are able to move the storage for cluster A and the physical servers on SAN A over to SAN B and again restore some level of redundancy. By having the second cluster reside on SAN B in Site B, and using database mirroring failover is immediate and transparent to the operator. In February of this year we had to perform firmware upgrades to both SANs. Although the firmware upgrades can be done with the SANs online there is about a two minute window where the SAN controllers have to reboot. At that point you have basically lost all communications with the SAN so there is no way that dispatching can continue. With the twp cluster configuration we were able upgrade the firmware first on SAN B, then manually failover the critical primary systems to their secondary counterparts residing in Cluster B on SAN B, and upgrade the firmware on SAN A. The process took a couple of days because we wanted to be sure that the firmware upgrade to SAN B were OK before doing them on SAN A, but we were able to move systems between the data centers without losing dispatching capabilities.
As I mentioned earlier, our clusters don’t completely reside in site A or Site B, but are spanned between the sites. Therefore with any loss of an ESX host in site A, HA will move the system to another ESX host, or if we lose all of the hosts in Site A they will be moved to hosts within the cluster in site B. Similarly with cluster B, processing will move to site A should anything happen to hosts in Site B. This configuration not only comes in handy in the case of a failure, but also for maintenance. In June we found out that we had to perform a complete firmware upgrade to all of the blades and interconnect modules in the blade chassis due to some issues we had been experiencing. We were able to manually failover the appropriate systems from one data center to the other, upgrade firmware, update drivers, reboot servers and drop portions of the network without ever losing dispatching capabilities. The process had to basically be choreographed and took a few days, but was accomplished without incident.
A closer look at the CAD failover process, and we keep going back to CAD because it encompasses just about all of our redundant techniques, hardware, VM/HA and database mirroring. Also, of the three primary systems in the center, Phone, CAD and Radio, the CAD system was the biggest challenge because it doesn’t have it’s own built-in hardware based redundancy, whereas the phone and the radio do. So a call comes in to the center, the 911 operator is actively working the call – let’s say it’s an EMS call, the EMS operator may not be the first operator to take the call, but they are immediately notified through the system and they begin the process of dispatching EMS. All of a sudden something happens to the primary CAD server, could be a hardware failure, a blue-screen…yes they happen in a 911 center too…whatever the case may be, that server is no longer available. Processing immediately fails over to the secondary CAD server, and within one or two packets the stations all begin processing on the secondary CAD server. No loss of dispatch capability is experienced. Now the CAD system is running from a VM and HA and Vmotion are now an option for redundancy as well. It was our intention to keep both the primary and secondary CAD servers in VMs, but again the vendor was reluctant to support that configuration and wanted at least the primary server to be hardware based. However, we have experienced these seamless failovers several times without a problem and in fact have run the CAD system in a VM for extended period without issues. I’m now going to turn the presentation back over to Dan.
May 2009 CAD based dispatch via Command vehicle – mobile dispatch terminal reaching into VM based 911 GIS dispatch application
3 take always from this session? -Migration facilitator - Survivability of the business – VMware is a key player! - Drive towards the best solution for your business “ Keep Moving Forward” - Big goals continue to 9-1-1 Next Generation - Taking it on the road. Leveraging a custom built state of that art command vehicle as a rolling datacenter, online/offline access. Mobile VMware servers, solid state disks…previous slide shows The Mobile Command Prost getting ready for deployment to support Allegheny County for the September G20 Summit. The command post can support direct dispatch as though the operator is sitting in the PSAP. Fully functional with VoIP and CAD and a whole host of capabilities designed to be seamlessly integrated with other County’s command posts.