Building a multi data center log aggregation framework at Squarespace on top of the open source ELK stack, featuring Elasticsearch, Logstash, Kibana, Filebeat and Kafka.
12. 01 Logs for all environments: corporate, QA, staging, production
02 Logs for all software services: monolith, microservices
03 Logs for all systems components: search, caching, discovery, etc.
04 Logs for all data centers
05 Enough room for random log aggregation by different teams
06 Scaling != $$$
Goals
22. Elastic Stack
Application Process
(e.g. Java)
Filebeat Tags: source_host and environment
Routing: automatic routing to
corresponding Kafka cluster based on
data center and environment
26. Elastic Stack
Kafka
10
Ingestion bottleneck: helped identify bottleneck, ruled out Filebeat as the root cause
Retention: gave us retention beyond Filebeat’s local buffer, now have 8 hour buffer
Operational issues: very high traffic logs would rotate quickly and Filebeat would hold onto
deleted file handles and fill up disks on servers
31. Elastic Stack
Log filters: specify how to parse individual log types using the full power
of Logstash filter plugins
Logstash
Indexers
35
32. Elastic Stack
Index definitions (new or existing): index durations and retention per
environment can be configured using Ansible and applied automatically
Handles routing within the indexers and index retention time (Curator)
Elasticsearch
Workhorses
16
1.5
TB
64
GB
33. Elastic Stack
Index definitions (new or existing): specify how many shards and
replicas are required and any field -> data type mappings also can be
configured using Ansible and automatically applied to the workhorses
Elasticsearch
Workhorses
16
1.5
TB
64
GB
39. 01 Output logs in a predictable format (e.g. JSON), save a lot of time!
02 Pay attention to Elasticsearch shard sizes!
● Shard sizes should be as even as possible; we target 20-30 GB shards
● Helps when moving shards during constant cluster rebalancing
● We recommend daily or weekly indexes and tweaking retention settings
● Consider index lifespan, number of shards, and sizes of logs ingested per lifespan
Lessons Learned
40. 03 Use x-pack security (Shield)
04 Use monitoring (Marvel)
● Export the monitoring metrics from every node into a separate ES cluster
● Monitor Kibana and Logstash using that separate cluster
● We had to add two security realms to our ES configuration: LDAP, local filesystem
● The fully-privileged admin user was hitting our LDAP servers hard!
Lessons Learned
43. Future: Elastic Stack
2 processes per node: run two Elasticsearch processes in each server
Beefier nodes: double disk capacity from 1.5 to 3 TBs
Retention: 30 days or more or retention for all indexes, as necessary
Elasticsearch
Workhorses
16
3
TB
64
GB