Vector is an Open Source high-performance solution for collecting & processing your observability data. In this talk, I will share our experience using it for log collection in hundreds of Kubernetes clusters. Which features make Vector special, so we preferred it over other solutions? How did we deploy and setup it up in Kubernetes? What were our main challenges in adopting Vector, and how we overcame them? I will also summarise our lessons learned covering K8s logs collection in general.
3. DISCLAIMER
During this talk preparation,
no Kubernetes clusters were hurt
Just kidding, in reality,
there were ple-e-e-enty of outages
4. ABOUT
PALARK
We offer all-in-one DevOps-as-a-Service and pick
the best Open Source projects to fulfill our client goals
16 70
Years in Linux,
DevOps & Kubernetes
Managed
Kubernetes clusters
15 90
Awesome
engineers
Tech posts at
blog.palark.com
5. PLAN
LOGS IN KUBERNETES
Let’s recall what to collect
in Kubernetes
WHAT IS VECTOR
And in which way
it is applicable
PRACTICAL USE
Exciting operating (Ops)
experience cases
1
2
3
7. LOGS IN KUBERNETES: POD LOGS
Log file location path consists of a pod name, container name, and UID
Format and location of files depends on the CRI settings
Max size and rotation depends on the kubelet settings
kubernetes.io/docs/concepts/cluster-administration/logging/
/var/log/pods
pod-1 pod-2
kubelet
stdout
stderr
stdout
stderr
8. LOGS IN KUBERNETES: NODE SERVICES
Files in the /var/log directory (probably)
Max size and rotation configured by journald
Format can be anything…
kubernetes.io/docs/concepts/cluster-administration/logging/
containerd kubelet audit logs syslog
9. LOGS IN KUBERNETES: EVENTS
Can only be collected from the Kubernetes API
Can be collected as either logs, metrics, or traces
kubernetes.io/docs/concepts/cluster-administration/logging/
apiVersion: v1
kind: Event
count: 1
metadata:
name: standard-worker-1.178264e1185b006f
namespace: default
reason: RegisteredNode
firstTimestamp: '2023-09-06T19:08:47Z'
lastTimestamp: '2023-09-06T19:08:47Z'
involvedObject:
apiVersion: v1
kind: Node
name: standard-worker-1
uid: 50fb55c5-d97e-4851-85c6-187465154db6
message: 'Registered Node standard-worker-1 in Controller'
10. LOGS IN KUBERNETES: EVENTS
Can only be collected from the Kubernetes API
Can be collected as either logs, metrics, or traces
kubernetes.io/docs/concepts/cluster-administration/logging/
apiVersion: v1
kind: Event
count: 1
metadata:
name: standard-worker-1.178264e1185b006f
namespace: default
reason: RegisteredNode
firstTimestamp: '2023-09-06T19:08:47Z'
lastTimestamp: '2023-09-06T19:08:47Z'
involvedObject:
apiVersion: v1
kind: Node
name: standard-worker-1
uid: 50fb55c5-d97e-4851-85c6-187465154db6
message: 'Registered Node standard-worker-1 in Controller'
11. LOGS IN KUBERNETES: EVENTS
Can only be collected from the Kubernetes API
Can be collected as either logs, metrics, or traces
kubernetes.io/docs/concepts/cluster-administration/logging/
apiVersion: v1
kind: Event
count: 1
metadata:
name: standard-worker-1.178264e1185b006f
namespace: default
reason: RegisteredNode
firstTimestamp: '2023-09-06T19:08:47Z'
lastTimestamp: '2023-09-06T19:08:47Z'
involvedObject:
apiVersion: v1
kind: Node
name: standard-worker-1
uid: 50fb55c5-d97e-4851-85c6-187465154db6
message: 'Registered Node standard-worker-1 in Controller'
12. LOGS IN KUBERNETES: EVENTS
Can only be collected from the Kubernetes API
Can be collected as either logs, metrics, or traces
kubernetes.io/docs/concepts/cluster-administration/logging/
apiVersion: v1
kind: Event
count: 1
metadata:
name: standard-worker-1.178264e1185b006f
namespace: default
reason: RegisteredNode
firstTimestamp: '2023-09-06T19:08:47Z'
lastTimestamp: '2023-09-06T19:08:47Z'
involvedObject:
apiVersion: v1
kind: Node
name: standard-worker-1
uid: 50fb55c5-d97e-4851-85c6-187465154db6
message: 'Registered Node standard-worker-1 in Controller'
16. WHAT IS VECTOR
A lightweight, ultra-fast tool
for building observability pipelines
vector.dev
17. WHAT IS VECTOR
A lightweight, ultra-fast tool
for building observability pipelines
vector.dev
18. WHAT IS VECTOR
An open source, efficient tool
for building log collecting pipelines
vector.dev
19. WHAT IS VECTOR
Vendor agnostic
You do not need to rewrite Vector in Rust
Performance by design and continuous benchmarking
Flexible building block
vector.dev
An open source, efficient tool
for building log collecting pipelines
63. HOW TO SOLVE?
1. Tune buffer settings
Blocking (default) Drop Newest
In Memory (default) Disk buffer
Max events 1000 (default) 10000
2. Rule of a thumb
Let logs go out of the node as quick as possible
3. If you brave enough
sysctl -w fs.file-max=1000 (unsafe)
vector.dev/docs/about/under-the-hood/architecture/buffering-model/
CASE #1: NO SPACE LEFT ON THE DEVICE
76. HOW TO SOLVE? expire_metrics_secs=60
vector_component_errors_total
time
7
3
3
errors
4
m
ore
errors
expiration
triggered
3
errors
empty!
This behavior makes
the result of the rate
PromQL function
equal to zero.
CASE #2: PROMETHEUS EXPLODED
77. HOW TO SOLVE? expire_metrics_secs=60
CASE #2: PROMETHEUS EXPLODED
78. HOW TO SOLVE? expire_metrics_secs=60
Patch for Vector
to remove the file label
CASE #2: PROMETHEUS EXPLODED
91. 1. Cache read (resourceVersion=0)
LIST /api/v1/pods?fieldSelector=spec.nodeName=$NODE_NAME&resourceVersion=0
HOW TO SOLVE?
CASE #3: KUBERNETES CONTROL PLANE OUTAGE
92. 1. Cache read (resourceVersion=0)
LIST /api/v1/pods?fieldSelector=spec.nodeName=$NODE_NAME&resourceVersion=0
HOW TO SOLVE?
CASE #3: KUBERNETES CONTROL PLANE OUTAGE
93. 1. Cache read (resourceVersion=0)
LIST /api/v1/pods?fieldSelector=spec.nodeName=$NODE_NAME&resourceVersion=0
use_apiserver_cache=true
HOW TO SOLVE?
CASE #3: KUBERNETES CONTROL PLANE OUTAGE
96. 1. Cache read (resourceVersion=0)
2. Limit concurrent requests (Priority and Fairness API)
HOW TO SOLVE?
CASE #3: KUBERNETES CONTROL PLANE OUTAGE
97. 1. Cache read (resourceVersion=0)
2. Limit concurrent requests (Priority and Fairness API)
3. Use kubelet API instead of Kubernetes
Pods metadata can be fetched by requesting the /pods endpoint
HOW TO SOLVE?
CASE #3: KUBERNETES CONTROL PLANE OUTAGE
98. 1. Cache read (resourceVersion=0)
2. Limit concurrent requests (Priority and Fairness API)
3. Use kubelet API instead of Kubernetes
HOW TO SOLVE?
CASE #3: KUBERNETES CONTROL PLANE OUTAGE
99. CONCLUSION
1. Great to build platforms
2. Vector is awesome, seriously, deploy it today
3. Share practical cases and learn together