Logging at scale is a common source of infrastructure expenses and frustration. While logging is something any organization does, there is still no silver bullet or just a simple and scalable solution without trade-offs. After studying the most popular logging systems, Aliaksandr came up with his design and vision of the problem. The proposed solution fits ideally for SRE, DevOps, and system engineers who need to provide logging solutions as a platform for the entire company or team. It accepts logs from existing logging agents, pipelines, and streams, efficiently stores them in a highly optimised log database and can be queried at lightning-fast speeds with perfect integration with tools like jq, awk, cut, etc.
10. The purpose of logging: debugging
● Which errors have occurred in the app during the last hour?
11. The purpose of logging: debugging
● Which errors have occurred in the app during the last hour?
● Why the app returned unexpected response?
12. The purpose of logging: debugging
● Which errors have occurred in the app during the last hour?
● Why the app returned unexpected response?
● Why the app wasn’t working correctly yesterday?
13. The purpose of logging: debugging
● Which errors have occurred in the app during the last hour?
● Why the app returned unexpected response?
● Why the app wasn’t working correctly yesterday?
● What the app was doing at the particular time range?
14. The purpose of logging: security
● Who dropped the database in production?
15. The purpose of logging: security
● Who dropped the database in production?
● Which IP addresses were used for logging in as admin during the last hour?
16. The purpose of logging: security
● Who dropped the database in production?
● Which IP addresses were used for logging in as admin during the last hour?
● Who performed a particular action at the given time?
17. The purpose of logging: security
● Who dropped the database in production?
● Which IP addresses were used for logging in as admin during the last hour?
● Who performed a particular action at the given time?
● How many failed login attempts were during the last day?
18. The purpose of logging: stats and metrics
● How many requests were served per hour during the last day?
19. The purpose of logging: stats and metrics
● How many requests were served per hour during the last day?
● How many unique users were accessing the app during the last month?
20. The purpose of logging: stats and metrics
● How many requests were served per hour during the last day?
● How many unique users were accessing the app during the last month?
● How many requests were served for a particular IP range yesterday?
21. The purpose of logging: stats and metrics
● How many requests were served per hour during the last day?
● How many unique users were accessing the app during the last month?
● How many requests were served for a particular IP range yesterday?
● What percentage of requests finished with errors during the last hour?
22. The purpose of logging: stats and metrics
● How many requests were served per hour during the last day?
● How many unique users were accessing the app during the last month?
● How many requests were served for a particular IP range yesterday?
● What percentage of requests finished with errors during the last hour?
● What was the 95th percentile of request duration for the given web page
yesterday?
25. Traditional logging
● Save logs to files on the local filesystem
● Use command-line tools for log analysis: cat, grep, awk, sort, uniq, head, tail,
etc.
28. Traditional logging: advantages
● Easy to setup and operate
● Easy to debug
● Easy to analyze logs with command-line tools and bash scripts
29. Traditional logging: advantages
● Easy to setup and operate
● Easy to debug
● Easy to analyze logs with command-line tools and bash scripts
● Works perfectly for 50 years (since 1970th)
32. Traditional logging: disadvantages
● Hard to analyze logs from hundreds of hosts (hello, Kubernetes and
microservices)
● Slow search speed over large log files (e.g. 1TB log file may require a hour to
scan)
33. Traditional logging: disadvantages
● Hard to analyze logs from hundreds of hosts (hello, Kubernetes and
microservices)
● Slow search speed over large log files (e.g. 1TB log file may require a hour to
scan)
● Imperfect support for structured logging (e.g. logs with arbitrary labels)
36. Large-scale logging: core principles
● Push logs from large number of apps to a centralized system
● Provide fast queries over all the ingested logs
37. Large-scale logging: core principles
● Push logs from large number of apps to a centralized system
● Provide fast queries over all the ingested logs
● Support structured logging
43. Large-scale logging: operational complexity
● Cloud: easy - cloud provider operates the system
● On-prem: harder - you need to setup and operate the system
49. Large-scale logging: on-prem: setup and operation
● Elasticsearch: hard because of non-trivial indexing configs for logs
50. Large-scale logging: on-prem: setup and operation
● Elasticsearch: hard because of non-trivial indexing configs for logs
● Grafana Loki: hard because of microservice architecture and complex configs
51. Large-scale logging: on-prem: setup and operation
● Elasticsearch: hard because of non-trivial indexing configs for logs
● Grafana Loki: hard because of microservice architecture and complex configs
● VictoriaLogs: easy because it runs out of the box from a single binary with
default configs
53. Large-scale logging: on-prem: costs
● Elasticsearch: high - it needs a lot of RAM and disk space
● Grafana Loki: medium - it needs a lot of RAM for high-cardinality labels
54. Large-scale logging: on-prem: costs
● Elasticsearch: high - it needs a lot of RAM and disk space
● Grafana Loki: medium - it needs a lot of RAM for high-cardinality labels
● VictoriaLogs: low - a single VictoriaLogs instance can replace a 30-node
Elasticsearch or Loki cluster
56. Large-scale logging: on-prem: full-text search support
● Elasticsearch: yes, but needs proper index configuration
● Grafana Loki: yes, but very slow
57. Large-scale logging: on-prem: full-text search support
● Elasticsearch: yes, but needs proper index configuration
● Grafana Loki: yes, but very slow
● VictoriaLogs: yes, works out of the box for all the ingested log fields and
labels without additional configs
58. Large-scale logging: on-prem: how to efficiently query
100TB of logs?
● Elasticsearch: to run a cluster with 200TB of disk space and 6TB of RAM.
Infrastructure costs at GCE or AWS: ~€50K/month
59. Large-scale logging: on-prem: how to efficiently query
100TB of logs?
● Elasticsearch: to run a cluster with 200TB of disk space and 6TB of RAM.
Infrastructure costs at GCE or AWS: ~€50K/month
● Grafana Loki: impossible because the query will take hours to execute
60. Large-scale logging: on-prem: how to efficiently query
100TB of logs?
● Elasticsearch: to run a cluster with 200TB of disk space and 6TB of RAM.
Infrastructure costs at GCE or AWS: ~€50K/month
● Grafana Loki: impossible because the query will take hours to execute
● VictoriaLogs: to run a single node with 6TB of disk space and 200GB of RAM.
Infrastructure costs at GCE or AWS: ~€2K/month
64. VictoriaLogs for large-scale logging
● Satisfies requirements for large-scale logging
○ Efficiently stores logs from large number of distributed apps
○ Provides fast full-text search
○ Supports both structured and unstructured logs
65. VictoriaLogs for large-scale logging
● Satisfies requirements for large-scale logging
○ Efficiently stores logs from large number of distributed apps
○ Provides fast full-text search
○ Supports both structured and unstructured logs
● Provides traditional logging features
○ Ease of use
○ Great integration with CLI tools - grep, awk, head, tail, less, etc.
73. Which errors have occurred in all the apps during the last hour?
Simple bash wrapper
around curl
74. Which errors have occurred in all the apps during the last hour?
LogsQL query
75. Which errors have occurred in all the apps during the last hour?
Plain old CLI tools
connected via Unix pipes
76. Which errors have occurred in all the apps during the last hour?
The result can be saved to file at any stage with
“… > response_file”
for later analysis
77. Which errors have occurred in all the apps during the last hour?
JSON lines
78. Which errors have occurred in all the apps during the last hour?
Log message
79. Which errors have occurred in all the apps during the last hour?
Log stream (aka app instance)
80. Which errors have occurred in all the apps during the last hour?
Log timestamp
81. Which errors have occurred in all the apps during the last hour?
Other log fields can be requested if needed
82. Which errors have occurred in all the apps during the last hour?
DEMO
115. Top client IPs for the last 4 weeks with 400 or 404
response status codes
116. Top client IPs for the last 4 weeks with 400 or 404
response status codes
Find logs with “remote_addr=”
phrase
117. Top client IPs for the last 4 weeks with 400 or 404
response status codes
Find logs with “remote_addr=”
phrase
… and with “status=404” or “status=400”
phrases
118. Top client IPs for the last 4 weeks with 400 or 404
response status codes
extract IP address from remote_addr=...
119. Top client IPs for the last 4 weeks with 400 or 404
response status codes
drop “remote_addr=” prefix
120. Top client IPs for the last 4 weeks with 400 or 404
response status codes
DEMO
122. Per-day stats for the given IP during the last 10 days
Search for log messages with the given IP
123. Per-day stats for the given IP during the last 10 days
A bit of bash-fu: extract log timestamp, cut it to
days and calculate the number of per day entries
133. VictoriaLogs: (temporary) drawbacks
● Missing data extraction and advanced stats functionality in LogsQL (but it can
be replaced with traditional CLI tools as we seen before)
134. VictoriaLogs: (temporary) drawbacks
● Missing data extraction and advanced stats functionality in LogsQL (but it can
be replaced with traditional CLI tools as we seen before)
● Missing cluster version
135. VictoriaLogs: (temporary) drawbacks
● Missing data extraction and advanced stats functionality in LogsQL (but it can
be replaced with traditional CLI tools as we seen before)
● Missing cluster version (but a single-node VictoriaLogs can replace a 30-node
Elasticsearch or Loki cluster)
136. VictoriaLogs: (temporary) drawbacks
● Missing data extraction and advanced stats functionality in LogsQL (but it can
be replaced with traditional CLI tools as we seen before)
● Missing cluster version (but a single-node VictoriaLogs can replace a 30-node
Elasticsearch or Loki cluster)
● Missing integration with Grafana
137. VictoriaLogs: (temporary) drawbacks
● Missing data extraction and advanced stats functionality in LogsQL (but it can
be replaced with traditional CLI tools as we seen before)
● Missing cluster version (but a single-node VictoriaLogs can replace a 30-node
Elasticsearch or Loki cluster)
● Missing integration with Grafana (but there is own web UI, which is going to
be better than Grafana for logs)
139. VictoriaLogs: recap
● Easy to setup and operate
● The lowest RAM usage and disk space usage (up to 30x less than
Elasticsearch and Grafana Loki)
140. VictoriaLogs: recap
● Easy to setup and operate
● The lowest RAM usage and disk space usage (up to 30x less than
Elasticsearch and Grafana Loki)
● Fast full-text search
141. VictoriaLogs: recap
● Easy to setup and operate
● The lowest RAM usage and disk space usage (up to 30x less than
Elasticsearch and Grafana Loki)
● Fast full-text search
● Excellent integration with traditional command-line tools for log analysis
142. VictoriaLogs: recap
● Easy to setup and operate
● The lowest RAM usage and disk space usage (up to 30x less than
Elasticsearch and Grafana Loki)
● Fast full-text search
● Excellent integration with traditional command-line tools for log analysis
● Accepts logs from all the popular log shippers (Filebeat, Fluentbit, Logstash,
Vector, Promtail)
143. VictoriaLogs: recap
● Easy to setup and operate
● The lowest RAM usage and disk space usage (up to 30x less than
Elasticsearch and Grafana Loki)
● Fast full-text search
● Excellent integration with traditional command-line tools for log analysis
● Accepts logs from all the popular log shippers (Filebeat, Fluentbit, Logstash,
Vector, Promtail)
● Open source and free to use!