Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 60 Anzeige

Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch

Herunterladen, um offline zu lesen

Building any average complex system in the cloud requires telemetry to be the number one concern: you would probably even start with planning and building it first (or perhaps you wish you had!). As quoted by Werner Vogels “Netflix is a log generating application, that happens to stream video quote” - Logging/Monitoring/Alerting has been central to the success of Netflix.

In ASOS, we currently generate more than 1TB of logs daily that gets stored and analysed in our Elasticsearch cluster for monitoring and alerting purposes. ELK stack (Elasticsearch, Logstash and Kibana) has been a very popular tool for logging and monitoring but tuning ELasticsearch for handling such a load is an art form in itself.

In this talk, we start with an overview of ELK stack (we in ASOS use CoveyorBelt instead of logstash so ECK for us) and then move to sharing what we have learned from trying to scale our Elasticsearch for this load: from tuning various configuration parameters to planning your shards and mapping strategy, this talk has quite a bit to equip you to build or tune an ELK stack in your own company.

Building any average complex system in the cloud requires telemetry to be the number one concern: you would probably even start with planning and building it first (or perhaps you wish you had!). As quoted by Werner Vogels “Netflix is a log generating application, that happens to stream video quote” - Logging/Monitoring/Alerting has been central to the success of Netflix.

In ASOS, we currently generate more than 1TB of logs daily that gets stored and analysed in our Elasticsearch cluster for monitoring and alerting purposes. ELK stack (Elasticsearch, Logstash and Kibana) has been a very popular tool for logging and monitoring but tuning ELasticsearch for handling such a load is an art form in itself.

In this talk, we start with an overview of ELK stack (we in ASOS use CoveyorBelt instead of logstash so ECK for us) and then move to sharing what we have learned from trying to scale our Elasticsearch for this load: from tuning various configuration parameters to planning your shards and mapping strategy, this talk has quite a bit to equip you to build or tune an ELK stack in your own company.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch (20)

Anzeige

Weitere von Ali Kheyrollahi (16)

Aktuellste (20)

Anzeige

Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch

  1. 1. >>> Realtime Monitoring Storing 2TB of logs a day in Elasticsearch @aliostad Ali Kheyrollahi, ASOS
  2. 2. @aliostad
  3. 3. @aliostad The joy of hitting F5
  4. 4. @aliostad The joy of a single process
  5. 5. @aliostad The joy of a having a production-size database locally
  6. 6. @aliostad The joy of having a dev machine build running all services
  7. 7. @aliostad /// What if your systems is “Microservices”?
  8. 8. @aliostad
  9. 9. @aliostad - +40 platform teams - 1oos of microservices - some services >10k rps
  10. 10. @aliostad > stackoverflow > £1.5 bln global fashion destination > 35% year-on-year
  11. 11. @aliostad /// elements of observability
  12. 12. @aliostad /// observability >>> Control Theory “a measure for how well internal states of a system can be inferred by knowledge of its external outputs”
  13. 13. @aliostad Logging Telemetry Tracing /// mixed concerns
  14. 14. @aliostad Tracing Logging Credit: Peter Bourgon Events Aggregations Request Scope Telemetry Alerting /// Scope
  15. 15. @aliostad Logging Telemetry Tracing Alerting Log4Net ✓ Time-series DBs ✓ ✓ ✓ Zipkin ✓ ✓ ✓ ✓ Prometheus ✓ ✓ ✓ Elasticsearch ✓ ✓ ✓ ✓ New Relic* ✓ ✓ ✓ Circonus* ✓ ✓ ✓ * paid services /// comparison
  16. 16. @aliostad 1 2 3 At source (perf counters) At the storage (Circonus) In the visualisation tool (Kibana) /// aggregations 4 In the pipeline (Riemann)
  17. 17. @aliostad /// data sources
  18. 18. @aliostad /// use cases • Metrics (Visualisation) • CPU, number of errors • Response time percentiles • Full-text search capability (logs and errors) • Correlating across services • Alerting when there is an SLO breach
  19. 19. @aliostad /// azure logs • Azure Diagnostics (WADLogs table) • IIS logs • VM Windows Event Logs • Performance Counters (standard + custom)
  20. 20. @aliostad /// application logs Microservice ETW SLAB Azure Table Sink ETW Application Logs EC e.g. CRIT_ORD_API_DatabaseDown
  21. 21. @aliostad /// instrumentation logs Microservice Perf Counters SLAB Azure Table Sink ETW Instrumentation Logs Azure Performance Counter Logs PerfIt Azure Agent
  22. 22. @aliostad /// ingest process
  23. 23. @aliostad /// pull vs push
  24. 24. @aliostad /// logstash QUEUE VM Logstash collectd syslog Logstash app logs nginx To Elasticsearch UDP File-tailing
  25. 25. @aliostad /// ConveyorBelt Performance Counters ConveyorBelt Azure WAD logs ETW Logs Elasticsearch Instrumentation Logs IIS Logs Woodpecker Outputs (Pull Logs) Sources Config Up to 2TB/day
  26. 26. @aliostad /// ConveyorBelt Source Source Source Config Scheduler Parser units of work To Elasticsearch Source Actor Actor Actor
  27. 27. @aliostad /// Woodpecker Source Source Source Config Pull Telemetry record Azure Table (Regular Intervals) Source
  28. 28. @aliostad /// elasticsearch intro
  29. 29. @aliostad /// elasticsearch • Linearly-scalable and HA* search (and visualisation) • ELK Stack • Open Source (enterprise features require license) • Speaks JSON • REST API and very developer-friendly
  30. 30. @aliostad /// cluster • Cluster: No ZK • Gossip / discovery • Node type: • master - leader election • data • client
  31. 31. @aliostad /// data hierarchy • Index • Shard • Replica • Type/Mapping • Document: JSON, immutable,
 versioned INDEX MAPPING MAPPING MAPPING … Document Document Document …
  32. 32. @aliostad /// data types • JSON data types: bool, long, float, string*, datetime • Array? • String tokenisation/analysers • Best of both world? • Object • nested { “a”: { “b” : { “c”: 42 } } }
  33. 33. @aliostad /// doc operations • Upsert • Delete • Partial Update • Search (JSON-based query DSL)
  34. 34. @aliostad /// DEMO 1
  35. 35. @aliostad /// more advanced
  36. 36. @aliostad /// index shard/replica Write Read Master Shard Shard Replica Replica Index Index
  37. 37. @aliostad /// index • Daily indices • Hot/Cold with index alias • Creation => templates • Settings: • refresh_interval
  38. 38. @aliostad /// mapping/type • Schema • How many mappings per index? • Dynamic mapping • Operations • Upsert • Delete
  39. 39. @aliostad /// templates PUT https://es_cluster:9200/_template/my_template { “template”: “my_index_*”, “settings”: {…} “mappings”: { “mapping_1”: {…}, “mapping_2”: {…} } }
  40. 40. @aliostad /// bulk api • Always use Bulk API to index documents • Batches of 1K-5K documents • Watch-out for error 429 and back-off pattern • Check bulk rejects [change bulk queue length]
  41. 41. @aliostad /// DEMO 2
  42. 42. @aliostad /// physical architecture
  43. 43. @aliostad /// resources node type • Data: Disk, RAM, CPU, Network • Master: CPU, Network, (RAM) • Client: Network, CPU, (RAM) • Kibana: CPU, Network, RAM
  44. 44. @aliostad /// simple data/master/kibana • CPU • RAM • Disk • Network
  45. 45. @aliostad /// next level d a t a / m a s t e r kibana
  46. 46. @aliostad c l i e n t k i b a n a m a s t e r d a t a traffic traffic
  47. 47. @aliostad 3x client 2x kibana 20x data (hot) traffic traffic 10x data (warm) /// our setup 3x master ARM Template Desire State Configuration
  48. 48. @aliostad /// hot/warm • Hot => CPU, Warm => Memory • Index Allocation/Routing • At the index:
 "index.routing.allocation.require.box_type" : "warm" • At the node (elasticsearch.yml)
 box_type: warm
  49. 49. @aliostad /// security • x-pack: SSL + username/password security (basic, Kerberos) • No Federated Authentication • Proxy (nginx, apache, etc) • IP-whitelisting
  50. 50. @aliostad /// administration • Like all: logs, slow query logs, etc • top, htop, iostat • collectd + local logstash • two clusters, each watching the other • curator for hot/cold and deleting old indices
  51. 51. @aliostad /// tracing
  52. 52. @aliostad /// ActivityId Microservice Id IdId Thread Local Storage Id To Other APIs Id Event
  53. 53. @aliostad /// alerting
  54. 54. @aliostad /// watcher • Trigger • Input • Condition • Action
  55. 55. @aliostad /// watcher notes • All watches get executed on the active master • Use Action Throttling to limit alerts • Use watch templates when you see common patterns • Use transforms and metadata to include context in actions/emails
  56. 56. @aliostad /// lessons learnt
  57. 57. @aliostad /// Do you speak CAP? • Consistency? Treat all data dispensable. Back up data that gets mastered in Elasticsearch. Not a document db. • Highly-Available? For >99.9% availability use redundancy • Partition-Intolerance? Node intercommunication highly chatty, ideally keep in the same data centre and even in the same VPC (aws)/VNet (azure)
  58. 58. @aliostad /// Beware • Split brain common • Data corruption possible • Backup data that gets mastered in ES (kibana indices) • It seems safest High Availability is redundancy (expensive)
  59. 59. @aliostad Thank you :) Questions…?
  60. 60. @aliostad Credits • Picture: Embroidery thread macro - https://www.flickr.com/photos/39908901@N06/ • Picture: Calculate Red - https://www.flickr.com/photos/93277085@N08/10398245145 • Picture: 1950's wristwatch workings - https://www.flickr.com/photos/ 134832191@N08/27301612554/ • Picture: Tokyo Tower_58 - https://www.flickr.com/photos/ajari/2756645901 • Picture: 1Bamboo and Rust - https://www.flickr.com/photos/hammershaug/5816522126/ • Picture: IMG_1899 - https://www.flickr.com/photos/johnas/9650255412/ • Picture: fan2 https://www.flickr.com/photos/sidelong/444054290/ • Picture: Glass jar filled with pasta https://www.flickr.com/photos/76588981@N02/16766079567/ • Picture: Rusty cogs https://www.flickr.com/photos/paperpariah/25375888671/ • Picture: Do you see the world in different colours? https://www.flickr.com/photos/luopl/6012467435/ • Picture: danger https://www.flickr.com/photos/armydre2008/9650951334/ • Link: ETW equivalent for Linux http://blogs.microsoft.co.il/sasha/2017/04/02/tracing-net-core-on-linux- with-usdt-and-bcc/

×