2. About me
DevOps at Viki, Inc - A global
video streaming site with
subtitles.
Previously a Twitter SRE,
National University of Singapore
Twitter @angadsg,
Github @angad
4. Metrics vs Logging
Metrics
● Numeric timeseries data
● Actionable
● Counts, Statistical (p90, p99 etc.)
● Scalable cost-effective solutions
already available
5. Logging
● Useful for debugging
● Catch-all
● Full text searching
● Computationally intensive, harder
to scale
Metrics vs Logging
Metrics
● Numeric timeseries data
● Actionable
● Counts, Statistical (p90, p99 etc.)
● Scalable cost-effective solutions
already available
6. Alerting and Monitoring at Viki
Deeper level
debugging with
application logs
Success Rate
Alert for
service X
7. Logs
● Application logs - Stack Traces, Handled Exceptions
● Access Logs - Status codes, URI, HTTP Method at all levels of the stack
● Client Logs - Direct HTTP requests containing log events from client-side
Javascript or Mobile application (android/ios)
● Standardized log format to JSON - easy to add / remove fields.
● Request tracing through various services using Unique-ID at Load Balancer
13. ● Golang program that sits next to log files, lumberjack protocol
● Forwards logs from a file to a logstash server
● Removes the need for a buffer (such as redis, or a queue) for
logs pending ingestion to logstash.
● Docker container with volume mounted /var/log.
Configuration stored in Consul.
● Application containers with volume mounted /var/log to
/var/log/docker/<container>/application.log
Logstash Forwarder
14. Logstash pool with HAProxy
4 x logstash machines, 8 cores, 16 GB
RAM
7 x logstash processes per machine, 5 for
application logs, 2 for HTTP client logs.
Fronted by HAProxy for both lumberjack
protocol as well as HTTP protocol.
Easily scalable by adding more machines
and spinning up more logstash processes.
16. Elasticsearch Hardware
12 core, 64GB RAM with RAID 0 - 2 x 3TB 7200rpm disks.
20 nodes, 20 shards, 3 replicas (with 1 primary).
Each day ~300GB x 4 copies (3 + 1) ~ 3 months of data on 120TB.
Average 6k-8k logs per second, peak 25k logs per second.
https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html
18. ● < 30.5 GB Heap - JAVA compressed pointers below 30.5GB heap
● Sweet spot - 64GB of RAM with half available for Lucene file buffers.
● SSD or RAID 0 (or multiple path directories similar to RAID 0).
● If SSD then set I/O scheduler to deadline instead of cfq.
● RAID0 - no need to worry about disks failing as machines can easily be
replaced due to multiple copies of data.
● Disable swap.
Hardware Tuning
19. ● 20 days of indexes open based on available memory, rest closed - open on
demand
● Field data - cache used while sorting and aggregating data.
● Circuit breaker - cancels requests which require large memory, prevent OOM,
http://elasticsearch:9200/_cache/clear if field data is very close to memory
limit.
● Shards >= Number of nodes
● Lucene forceMerge - minor performance improvements for older indexes
(https://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.
html)
Elasticsearch Configuration
20. Prevent split brain situation to avoid losing data - set minimum number of master
eligible nodes to (n/2 + 1)
Set higher ulimit for elasticsearch process
Daily cronjob which deletes data older than 90 days, closes indices older than 20
days, optimizes (forceMerge) indices older than 2 days
And also...
21.
22. Marvel - Official plugin from Elasticsearch
KOPF - Index management plugin
CAT APIs - REST APIs to view cluster information
Curator - Data management
Monitoring