OSMC 2013 | Enterprise Platforms Monitoring at s IT Solutions AT by Johannes Egger

Enterprise Platforms Monitoring
bei s IT Solutions AT, dem IT Dienstleister der Erste Group
OSMC 2013
Nürnberg, Oktober 2013
Johannes Egger johannes.egger@s-itsolutions.at

Erste Group im Überblick
Erste Group
 16,6 Mio. Kunden
 213 Mrd. Euro Bilanzsumme
 47.000 Mitarbeiter
 2.900 Filialen in 7 Ländern
 Marktführer in AT, CZ, RO, SK
s IT Solutions AT
 IT-Dienstleister der Erste Group und Sparkassen in Österreich
 Entwicklung, Implementierung, Support und Service von IT Lösungen in
Österreich sowie in Zentral- und Osteuropa
 Rechenzentrumsbetrieb der Erste Group, technischer vor Ort Service in
Österreich
29.10.2013
Enterprise Platforms Monitoring bei s IT Solutions AT, dem
IT Dienstleister der Erste Group
- Page 2

EP Monitoring Facts and Figures
Aktive checks
 2850 monitored hosts
 42000 monitored services
 Durchschnittliches check Intervall: 5 Minuten
 Umgebung designed für mind. 65.000 checks / 5 Minuten
Passive checks (EventDB)
 typisch ~ 15.000 – 25.000 Events pro Tag (viele „Info“ Events)
 Theoretisch ~ 1 Mio Events pro Tag verarbeitbar, 350.000 pro Tag getestet
29.10.2013
Enterprise Platforms Monitoring bei s IT Solutions AT, dem IT Dienstleister der Erste Group
- Seite 3
UNIX 950 Hosts (Linux, Solaris, AIX)
Windows 1400 Hosts
Applikationen 2000 Apps (apache, tomcat, jboss, weblogic, …)
Datenbanken 320 Instanzen (Oracle, PostgreSQL, Informix, Sybase)
Storage 100 SAN- & Storage-Komponenten (NetApp, VMAX, VNX, …)
Virtualisierung 75 ESX Hosts, 3 Virtual Center

Inhalt
29.10.2013
- Seite 4
■ Ziele und Anforderungen
■ Proof of Concept, Performance Tests
■ Large Installation: Probleme & Lösungen
■ Technische Architektur
■ Zusammenfassung & Diskussion
■ Live Sample

Ausgangssituation
Ausgangssituation
 Tivoli / TEC unternehmensweit im Einsatz
 Ablöse durch BMC gescheitert
 Neuplanung: Unternehmensweites Monitoring-System mit Schnittstellen zu
„verteilten“ Monitoring-Systemen in verschiedenen Business Units (Midrange,
Netzwerk, Mainframe, …)
 Jede Business Unit überwacht die technischen Komponenten in ihrer
Verantwortung
Unternehmensweites „Central Monitoring System“ (CeMOS)
 Schnittstelle zu Monitoring Backends (Midrange, Netzwerk, Mainframe, …)
 End-User Interface
 Service Views
 Event Correlation (Correlation Engine TEC wird durch Omnibus ersetzt)
 Schnittstellen zu Incident-, Problem-, Change-Management
 SLA Reporting, Performance- / Capacity Reports
29.10.2013
- Seite 5

Ziele und Anforderungen
Business Unit Enterprise Platforms: „EP Monitoring“ (EPM)
 Eines der Backends für CeMOS
 Ablöse von Tivoli Endpoints & deren Funktionalität durch Icinga
 Monitoring-Events werden an CeMOS weitergeleitet
 Performance-Daten werden CeMOS zur Verfügung gestellt
 Web-UI für Betrieb und Integration
 Keine Kunden-Zugriffe
In Enterprise Platforms umfangreicher scope:
 UNIX: AIX (inkl. VIO Server), Solaris (SPARC, X86), Linux (RHEL5 / RHEL6)
 Windows: Windows Server 2008, Windows Server 2012
 Applikationen: apache webserver, tomcat, jboss, weblogic, etc
 Datenbanksysteme: Oracle, PostgreSQL, Informix, Sybase, MySQL
 SAN/Storage: FC Switches, Storage-Systeme (NetApp, VMAX, VNX, …)
 Virtualisierung: VMWare Infra (vcenter, ESX hosts, …)
 Backup/Restore, Hardware Monitoring
29.10.2013
- Seite 6

Initial Concept - Overview
29.10.2013
- Seite 7

Projekt-Zeitplan | 1
Initiales Proof of Concept mit Icinga (POCv1)
 Sammlung von Erfahrung mit Icinga, Einbindung von Servern, Apps & DBs
Evaluierung von Alternativen (Zabbix, …)
Projekt Kickoff: Workshops, project scope, Architektur-Design
 Kategorisierung von Systemkomponenten, Definition von Default Checks
 Mengengerüst: Ausführung von 65.000 checks / 5 Minuten muss möglich sein!
Projektphase Proof of Concept II (POCv2) – Ziele (Dauer: 9 Wochen)
 Proof of Concept II Environment Setup (zukünftige Produktion)
 Environment Assessment durch Netways
 Einbindung von zusätzlichen Systemkomponenten (v.a. Storage & Virt, Windows)
 Performance Tests: 65k checks / 5 min. erreichen oder POCv2 scheitert!
 Simulation von 65k checks / 5 min. durch Reduktion des check-Intervalls
29.10.2013
- Seite 8

Projekt-Zeitplan | 2
Nach Proof of Concept II
 Lessons learned / fixes
 Automation mechanismen (configuration & deployment)
 Agent / Plugin rollout (kontinuierlich)
Event DB für passive checks
 HTTP Schnittstelle, python-script zum Einmelden (send_eventdb)
Custom checks
 Tivoli scripts Analyse IST-Situation, GAPs zu definierten default-checks
 Entwicklung von custom plugins (AIX, Solaris, …)
Parallelbetrieb Tivoli/Icinga
 Integrationstests
 Abschaltung Tivoli-Funktionalität
29.10.2013
- Seite 9

Schedule Overview
Phase / Deliverable
2012 2013
May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May
Proof of Concept I
Know-How Aufbau
Nur Server, Apps & DBs
Project setup, scope definition
Technische Workshops
Environment Setup
Proof of Concept II
Performance tests
Implementation of
Proof of Concept II findings
Implementation of plugins,
configuration management,
and automation
Agent Rollout
Plugin Development & Rollout
Replacement of Tivoli Checks
Trend

Project status by PM
X
Status period
From
27.05.13
To
31.05.13
Project start 07-2012
Project End 05-2013
Original End 03-2013
EP Monitoring
Project timeline
Johannes Egger
Netways
Architecture Review
OSMC 2012
Kick-off
Go-Live
EOY Freeze

Inhalt
29.10.2013
- Page 11
■ Live Sample

Proof of Concept – Performance Tests | 1
29.10.2013
- Page 12

Proof of Concept - Performance Tests | 2
Ausgangssituation
 Etwa 3500 aktive Service-checks, davon 450 unknown / warning / critical
 Etwa 350 Hosts
Vorgehensweise
 Simulation von 65k checks durch schrittweise Reduktion des check-Intervalls
 Annahme, dass avg plugin execution time vernachlässigbar ist
 POCv2 Env. wird von POCV1 Env. aus überwacht (Systeme, Icinga latency, …)
Performance Tests
 Run #1: MON 08:00 – 15.000 checks / 5 mins
 Run #2: … – 30.000 checks / 5 mins
 Run #3: … – WED 08:00 75.000 checks / 5 mins
 75.000 checks / 5 mins für eine Dauer von 12 Stunden
 Über 100.000 checks / 5 mins für eine Dauer von einer Stunde
29.10.2013
- Page 13

Proof of Concept - Performance Tests | 3
Runs #1, #2
 App und DB getrennt => I/O massiven perf. Impact, Icinga latency ~ 38 secs
 Mitigation: App und DB laufen gemeinsam auf physischem System
 Optimierungen: sync-commits in DB, tuned-adm ‚throughput-performance‘
 dnscache auf Icinga workern aktiviert (firewall sessions!)
 gearman worker parameter tuning
Während Run #3
 DB backup während Lasttests erfolgreich
 Icinga Latenz initial unter 3 Sekunden, steigt kontinuierlich mit app uptime
 75k checks / 5 min. über Zeitraum von 12 Stunden simuliert
 Verlust eines RZ-Standorts simuliert (50% der worker deaktiviert) => zusätzliche
CPU Cores erforderlich um die Last zu bewältigen
 Switch von 6 virtuellen workern zu 3 physischen workern
Proof of Concept II erfolgreich!
29.10.2013
- Page 14

Inhalt
29.10.2013
- Page 15
■ Live Sample

Probleme und Lösungen | 1
Latenzprobleme nach Proof of Concept
 Hohe active service latency führt zu delays im scheduling der checks
 Latenz steigt mit uptime kontinuierlich an, keine Lösung durch Upgrade 1.8.x
 Erkenntnis: Latenz korreliert mit memory leaks in ido2db (direkt proportional)
 Problem tritt auch mit MySQL auf, jedoch nicht so schwerwiegend
Root cause
 Massive memory-leaks in ido2db (inkorrektes libdbi handling), patch upstream
 Patch: zusätzliche Option max_entries_processed analog
apache/MaxRequestsPerChild als workaround für memory leaks
29.10.2013
- Page 16

Latenzprobleme nach Proof of Concept
 Nach Patch: keine memory leaks!
Problem in community nicht bekannt?
 „Kleine“ Installationen sind nicht betroffen
 MySQL vs PostgreSQL
 Sites mit mehreren Icinga-Instanzen („Instanzierung“)
 Tägliche Restarts: latency reset
 In einigen Installationen active service latency unbekannt oder nicht überwacht
Weitere Probleme
 checks doppelt gescheduled -> erledigt mit Upgrade auf 1.8.x
 Session cookies bei logout nicht gelöscht -> switch to ldap auth
 Wenige Installationen mit PostgreSQL (SQL Statements in Icinga-Web, …)
 Bug-fixes in einigen Plugins
 Unterstützung für Linux sehr gut, andere Plattformen problematisch
29.10.2013
- Page 17

Blocksize issues
 NRPE fixed blocksize (compile setting!) inkompatibel mit large installations
 Small vs large block size (network traffic!)
 Selbst große blocksize für einige checks nicht ausreichend (Oracle tablespaces)
 Mitigation: multi-line patch (bis zu 32 x 1K-chunks), Icinga-Core blocksize: 32K
Skalierbarkeit, Performance
 Nagios/Icinga nicht skalierbar designed: Icinga = single-Core app; ido2db: blocking
i/o, keine prepared-statements, kein thread-handling, connection pool, etc
 Workarounds: Instanzierung vs. Gearman
 Icinga-Web Performance
Agent / Plugin Konzept von Icinga
 Verschiedenen Plattformen schwierig
 Abhängigkeiten bei inhomogenem Environment (Perl, Python Versionen, etc)
 Vgl. Zabbix Agent / # KPIs out of the box
29.10.2013
- Page 18

Inhalt
29.10.2013
- Page 19
■ Live Sample

Architektur | 1
Icinga
 Icinga-Core 1.8.x
 gearman, mod_gearman
 Event DB (modified: http API, snmp traps)
 Schnittstelle zu Central Monitoring
Web/Tool Layer
 Icinga-Web
 PNP4Nagios
 BPM Addon, BP Views
 Jasper Reporting Engine
DB Layer
 PostgreSQL 9.2.x
29.10.2013
- Page 20

Architektur | 1
29.10.2013
- Page 21

Conf. Management & Deployment | 1
Icinga Core
 Manuell, file-basierend nicht praktikabel
 CMDB Integration nicht möglich – parallel großes CMDB-Migrationsprojekt
 Existierende Tools (lconf, nconf) ungeeignet
„Configuration Merger“
 Merge von Daten aus verschiedenen Datenquellen in eine Icinga Konfiguration
 Files aus Datenquellen werden in Subversion eingecheckt, spezifisches Format
 Datenquellen: verarbeitete CMDB exports oder manuell gepflegte listen
 UNIX: Systeme werden je nach Plattform vordefinierten hostgroups zugewiesen
Conf. Mgmnt Plugins & NRPE (UNIX)
 Automatischers deployment von NRPE & Plugins via puppet
 Plugins & NRPE command-definitions in subversion, für alle Systeme, Gruppen
von Systemen oder einzelne Systeme
 Override mechanism: per-machine configs können default configs überschreiben
29.10.2013
- Page 22

Icinga Deployment
 „Configuration Merger“ generiert 1x täglich die Icinga Konfiguration aus Daten in
Subversion
 Falls die Konfiguration gültig ist, wird Icinga reloaded
 Bei Konfigurations-Problemen wird die aktuelle config beibehalten, Icinga Admins
werden verständigt
Fachliche Verantwortung
 Das Icinga Environment wird von zwei Fachabteilungen verantwortet (UNIX,
Application & DB Management)
 Die Fachabteilungen (UNIX, Windows, App&DB, Storage&Virt) sind für ihre
Plugins und ihren Teil der Icinga Konfiguration selbst verantwortlich
29.10.2013
- Page 23

29.10.2013
- Page 24

Inhalt
29.10.2013
- Page 25
■ Live Sample

Icinga-Web
29.10.2013
- Page 26

BP Views | 1
Icinga-Web
 Für Large Installations nicht geeignet
 Notwendigkeit, checks nach „Produkten“ zu gruppieren & korrelieren
 Business Process Addon: Funktionalität OK, UI für unsere Zwecke nicht geeignet
 „Top Level Views“ (ING-DiBa, OSMC 2012): übersichtliche UI, aber keine
correlation
BP Views
 Universales Business Process View UI
 User-Interface ähnlich „Top Level Views“ (ING-DiBa, OSMC 2012)
 Zusätzlich Funktionalität des Business Process Addon (Event Correlation)
 Multiple Levels: Environments, Produktgruppen und Produkte / Business Cases
 Dashboards: Was wird angezeigt (e.g. für unterschiedliche Kunden / Teams)
 Detailansicht für Produkte bzw Business Cases (Host- und Servicechecks)
 Service Status direkt aus Icinga
29.10.2013
- Page 27

BP Views | 2
29.10.2013
- Page 28

BP Views | 3
29.10.2013
- Page 29

Inhalt
29.10.2013
- Page 30
■ Live Sample

Ergebnis der Migration
Tivoli Ablöse
 Icinga & EventDB als Monitoring-System etabliert
 Checks pro Systemkomponente definiert und implementiert
 Schnittstelle zu unternehmensweitem Central Monitoring
implementiert
 Aktiv: ~ 2850 monitored hosts, ~ 42000 monitored services (5 Min Intervall)
 Tivoli logfile adapter 1:1 migriert
 Umgebung designed für 65.000 – 80.000 aktive checks / 5 Minuten
 (Teilweise) automatische Erstellung der Icinga-Konfiguration
Weiters …
 Icinga Passiv: SCOM Schnittstelle implementiert
 Event DB: http & snmp Interface sowie Client für http-Schnittstelle implementiert
 Zusätzliche checks ins Monitoring reingenommen
 BP Views als User Interface
29.10.2013
- Page 31

Vielen Dank für Ihre Aufmerksamkeit!
29.10.2013
- Page 32

OSMC 2013 | Enterprise Platforms Monitoring at s IT Solutions AT by Johannes Egger

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie OSMC 2013 | Enterprise Platforms Monitoring at s IT Solutions AT by Johannes Egger

Ähnlich wie OSMC 2013 | Enterprise Platforms Monitoring at s IT Solutions AT by Johannes Egger (20)

OSMC 2013 | Enterprise Platforms Monitoring at s IT Solutions AT by Johannes Egger