Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Monitoring micro-services platform 
Boyan Dimitrov, 
Platform Engineering @ Hailo @nathariel
Outline 
• Intro to the Hailo world 
• Platform Overview 
• Monitoring Evolution
The Platform 
Troll a platform by Swinsto101 / CC BY-SA 3.0 / Desaturated 
from original
Platform specifics 
• SOA based on Go ( and Java… ) 
• 1000+ AWS instances spanning multiple regions 
• 160+ services in p...
eu-west-1 
Proxy Layer 
Message Bus+ 
Go Services 
Java 
Services 
C* 
us-east-1 
Proxy Layer 
Message Bus+ 
Go Services J...
Provisioning Service 
CI Pipeline (Janky/Jenkins) 
Amazon S3 
Provisioning Service Provisioning Service 
Provisioning Mana...
A micro-service under the hood 
Handler platform-layer 
Logic 
Storage 
Library for abstracting service-to- 
service comms...
Mission: 
Define high level platform and business metrics 
Gather as many insights as possible 
Add automatic failover and...
PHP Java 
Host Instance 
Graphite 
Zabbix 
Aspiration vs Reality 
CloudWatch 
Zabbix 
Agent 
StatsD Carbon
Challenges 
• Single StatsD instance and generic graphite setup cannot cope with all the traffic 
(surprise!) 
• No easy w...
Instrumentation++ 
“Airplaine board” by Smithore 
/ Desaturated from original
Host Instance 
Graphite 
Cache 
Zabbix 
Iterate on what we already know 
Relay 
CloudWatch 
CollectD StatsD 
Cache 
Cache ...
Result 
• Scaling up graphite and moving StatsD to every box allowed us to collect millions 
of metrics 
• Instrumenting e...
Monitoring & 
Instrumentation
RReatzhiinekl Service 
Monitoring
Provisioning Service 
Message bus 
Monitoring 
Service 
New 
Service 
Publish 
Healthchecks 
Host Instance 
Provisioning M...
healthcheck.Register(&healthcheck.HealthCheck{! 
Id: “MyHCId”,! 
ServiceName: ServiceName,! 
ServiceVersion: ServiceVersio...
Service level health checks
Result 
• Service health checks give us in-depth service performance details 
• The monitoring service has a holistic view...
Trace++ 
Monitoring & 
Instrumentation 
“Abstract conception of network and communication” 
by Leszekglasner / Desaturated...
Trace Architecture 
CollectD StatsD 
Zabbix 
Agent 
Provisioning Service 
Host Instance 
Phosphor 
Publish 
Trace 
Service...
Live traffic flows
Live traffic flows
Automatic request tracing
Result 
• Trace incoming requests and pinpoint bo#lenecks & SLA offenders 
• Easily identify problems on the request/respo...
Robomon
Automated Jobs
Result 
• Identify business impacting issues immediately 
• Highlight the service on the critical path that is most likely...
Event Correlation 
“Connection” by A2bb5s 
/ Desaturated from the original
CollectD StatsD 
Zabbix 
Agent 
Provisioning Service 
Host Instance 
Phosphor 
Publish 
c 
Dashboards 
Monitoring 
Persist...
Result 
• Answer to the most important “Did anything change?” question 
• Audit trail for any platform changes 
• Holistic...
It is not over yet! 
++ Machine Learning 
++ Event source weighting
Thanks! 
PS. We’re hiring! 
@nathariel 
boyan@hailocab.com London DevOps
Monitoring microservices platform
Monitoring microservices platform
Monitoring microservices platform
Nächste SlideShare
Wird geladen in …5
×

Monitoring microservices platform

7.860 Aufrufe

Veröffentlicht am

In this talk we explore some of the tools we built at Hailo to monitor our microservices platform. By using a combination of instrumentation, in-depth service monitoring, request tracing, event correlation and automation frameworks we manage to present a holistic view of our infrastructure.

Veröffentlicht in: Technologie

Monitoring microservices platform

  1. 1. Monitoring micro-services platform Boyan Dimitrov, Platform Engineering @ Hailo @nathariel
  2. 2. Outline • Intro to the Hailo world • Platform Overview • Monitoring Evolution
  3. 3. The Platform Troll a platform by Swinsto101 / CC BY-SA 3.0 / Desaturated from original
  4. 4. Platform specifics • SOA based on Go ( and Java… ) • 1000+ AWS instances spanning multiple regions • 160+ services in production • Designed specifically for the cloud – different building blocks and components will constantly be in flux, broken or unavailable.
  5. 5. eu-west-1 Proxy Layer Message Bus+ Go Services Java Services C* us-east-1 Proxy Layer Message Bus+ Go Services Java C* Services
  6. 6. Provisioning Service CI Pipeline (Janky/Jenkins) Amazon S3 Provisioning Service Provisioning Service Provisioning Manager Docker Registry Inside an environment
  7. 7. A micro-service under the hood Handler platform-layer Logic Storage Library for abstracting service-to- service comms service-layer Self-configuring external service adapters Service Any service gets for free: • Provisioning • Discovery • Configuration • Authentication/Authorization • A/B testing capabilities • Self-configuring connectivity to third-party services • Monitoring • Instrumentation
  8. 8. Mission: Define high level platform and business metrics Gather as many insights as possible Add automatic failover and recovery capabilities "A[ollo 8 Launch Control Room” by Tfawls / Desaturated from original
  9. 9. PHP Java Host Instance Graphite Zabbix Aspiration vs Reality CloudWatch Zabbix Agent StatsD Carbon
  10. 10. Challenges • Single StatsD instance and generic graphite setup cannot cope with all the traffic (surprise!) • No easy way of generating and searching for graphs quickly • We didn’t instrument everything • “Traditional” monitoring systems can only give basic app insights • Se#ing up app templates is a manual daunting process and does not scale • No in-depth visibility into our main KPIs • No way of identifying platform / release / config / cloud infrastructure changes
  11. 11. Instrumentation++ “Airplaine board” by Smithore / Desaturated from original
  12. 12. Host Instance Graphite Cache Zabbix Iterate on what we already know Relay CloudWatch CollectD StatsD Cache Cache Zabbix Agent
  13. 13. Result • Scaling up graphite and moving StatsD to every box allowed us to collect millions of metrics • Instrumenting everything gives us a lot of insights. • Grafana allows us to quickly build, store and search for important graphs. Widely adopted by the whole development team! Tip: Focus on upper 95th and 99th percentiles and work out from there.
  14. 14. Monitoring & Instrumentation
  15. 15. RReatzhiinekl Service Monitoring
  16. 16. Provisioning Service Message bus Monitoring Service New Service Publish Healthchecks Host Instance Provisioning Manager Binding Discovery Provisioning Service Host Instance Monitoring V2
  17. 17. healthcheck.Register(&healthcheck.HealthCheck{! Id: “MyHCId”,! ServiceName: ServiceName,! ServiceVersion: ServiceVersion,! Hostname: Hostname,! InstanceId: InstanceID,! Interval: time.Minute,! Checker: myCallbackFunc,! Priority: hc.Warning,! })!
  18. 18. Service level health checks
  19. 19. Result • Service health checks give us in-depth service performance details • The monitoring service has a holistic view of our platform health and can identify degraded availability zones • Developers can identify what is important for their service and track & alert on it.
  20. 20. Trace++ Monitoring & Instrumentation “Abstract conception of network and communication” by Leszekglasner / Desaturated from original
  21. 21. Trace Architecture CollectD StatsD Zabbix Agent Provisioning Service Host Instance Phosphor Publish Trace Service Dashboards Monitoring In-memory Aggregates Optional persistant storage Async UDP
  22. 22. Live traffic flows
  23. 23. Live traffic flows
  24. 24. Automatic request tracing
  25. 25. Result • Trace incoming requests and pinpoint bo#lenecks & SLA offenders • Easily identify problems on the request/response path • Quickly find out exactly which services participate on the request path
  26. 26. Robomon
  27. 27. Automated Jobs
  28. 28. Result • Identify business impacting issues immediately • Highlight the service on the critical path that is most likely responsible for the problems
  29. 29. Event Correlation “Connection” by A2bb5s / Desaturated from the original
  30. 30. CollectD StatsD Zabbix Agent Provisioning Service Host Instance Phosphor Publish c Dashboards Monitoring Persistent Storage SNS Platform Events Whisper Service c Platform events
  31. 31. Result • Answer to the most important “Did anything change?” question • Audit trail for any platform changes • Holistic view of our platform status
  32. 32. It is not over yet! ++ Machine Learning ++ Event source weighting
  33. 33. Thanks! PS. We’re hiring! @nathariel boyan@hailocab.com London DevOps

×