SlideShare ist ein Scribd-Unternehmen logo
1 von 51
2019.07
KOCOON
KAKAO Automatic k8s Monitoring
@@cloud.telemetry / issac.lim(임성국)
Before We Running
Who am I?
Telemetry ?
Telemetry is an automated communications process by which meas
urements and other data are collected at remote or inaccessible points
and transmitted to receiving equipment for monitoring.
ref: https://en.wikipedia.org/wiki/Telemetry
cloud.telemetry ?
• Remote & Inaccessible points
• Baremetal, IaaS, CaaS …
• We Develop & Provide automated communications process
• MaaS (Monitoring as a Service)
• KEMI Stats, KEMI Logs, KOCOON
ref: https://en.wikipedia.org/wiki/Telemetry
KEMI Stats
KEMI Logs
Is it enough?
Head: KEMI-*
Longtail : ?
ref: https://mgcabral.wordpress.com/2012/03/04/thelongtaileconomics/
Longtail
• Users want
 Get own resource
 Deal with resources in their own way
• We want
 Divide resources by users
 Provide self-monitoring service
DKOSv3
• k8s based container orchestrator @KAKAO
• Kubernetes v1.11.5
• 카카오 T 택시 사례를 통해 살펴보는 카카오 클라우드의 Kubernetes as a
Service (openinfradays days 2 Track 2 12:00 ~ 12:40)
그렇다면 …
서비스 클러스터 별로 모니터링 리소스를 분리하자!
KOCOON
img ref: https://www.treehugger.com/green-architecture/cocoon-tree-prefab-spherical-treehouse-pod.html
KOCOON
• KakaO COntainer based service mONitoring
 서비스 리소스 안에서 수집/조회/알람 등 모든 것을 해결
KOCOON-* Overview
Metric based Self Monitoring
Log Routing
Rule based Log Event
How to Use?
(KOCOON-Prometheus)
KOCOON-Prometheus Install
 Prometheus-operator, Prometheus, Alertmanager
 PrometheusRule(for alarm & recreate metric)
 Cupido (Prometheus webhook manager for kakao)
 kube state metrics, node exporter
 Grafana & Dashboard
 kakao etcd & kakao ingress controller service monitor
helm install kakao-stable/kocoon-prometheus --name $(kubectl config current-context) --namespace
monitoring
KOCOON-Prometheus Set LB
## grafana ingress
helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val
ues --set grafana.ingress.enabled=true --set grafana.ingress.hosts[0]=example-grafana.dev.
9rum.cc
## prometheus ingress
helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val
ues --set prometheus.ingress.enabled=true --set prometheus.ingress.hosts[0]=example-pro
metheus.dev.9rum.cc
## alertmanager ingress
helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val
ues --set alertmanager.ingress.enabled=true --set alertmanager.ingress.hosts[0]=example-al
ertmanager.dev.9rum.cc
KOCOON-Prometheus Set Kakaotalk
## 받아야 할 watchcenter group id가 6663, 4443 등 다수인 경우
helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-va
lues --set cupido.watchcenterGroups="6663;4443;”
## 받아야 할 watchcenter group id가 1개인(6663) 경우
helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-va
lues --set cupido.watchcenterGroups="6663;"
## watchcenter group id를 설정하지 않아도 문제가 생길 시에 dkosv3 운영하는 cloud.deploy 셀에
함께 알람이 갑니다.
k8s Status View
k8s DrillDown View
k8s Ingress Controller View
k8s ETCD View
KOCOON-Prometheus Alarm
Any Problems?
Goals
• Workers: 500
• Metric scrapes interval: 30s ~ 60s
• Check resource status
• service namespace (default)
• kube-system
• ingress
Step 1
• Basic Resources for Prometheus
• CPU request 1, limit 4
• Memory : request 2GB, limit 6GB
• it’s ok ~ 100 workers but…
• Internal node IP issue (max 128 nodes…)
Step 2
Workers
Scrape
Interval
CPU Memory
# of
metrics/sec
12 30s 0.1 780MB 4605
271 30s OOM
• Basic Resources for Prometheus
• CPU request 1, limit 4
• Memory : request 2GB, limit 6GB
Out Of Memory…
• Prometheus killed due to OOM in large scale
• https://groups.google.com/forum/#!topic/prometheus-users/DE
LLNNSVCSw
• https://github.com/prometheus/prometheus/issues/4553
• https://github.com/prometheus/prometheus/issues/1358
We considered …
• # of targets
• # of metrics
• # of rules
• frequency of scrapes & rule evaluations
• …
Must item
• # of targets
• # of rules
• frequency of scrapes & rule evaluations (every minutes)
# of metrics & topk
• # of metrics per sec
rate(prometheus_tsdb_head_samples_appended_total[5m])
• Top 10 metrics
topk(10, count({job=~".+"}) by(__name__))
Step 3
Workers
Scrape
Interval
CPU Memory
# of
metrics/sec
topk
271 30s OOM
271 60s 1.15 OOM 27500
container_network_tcp_usage_total:
246676
• Upgrade Resources for Prometheus
• CPU request 1, limit 4
• Memory : request 4GB, limit 8GB
• Scrape interval: 30s -> 60s
Useless Metrics
• cadvisor
• container_(network_tcp_usage_total|network_udp_usage_total
|tasks_state|cpu_load_average_10s|memory_failures_total)
• container_(cpu_schedstat_run_seconds_total|cpu_schedstat_ru
nqueue_seconds_total|cpu_schedstat_run_periods_total|cpu_s
ystem_seconds_total|cpu_user_seconds_total)
• container_(last_seen|memory_working_set_bytes|memory_cac
he|memory_failcnt|memory_max_usage_bytes|memory_swap|
start_time_seconds)
• container_(network_receive_packets_total|network_transmit_p
ackets_dropped_total|network_receive_errors_total|network_r
eceive_packets_dropped_total|network_transmit_packets_total
|network_transmit_errors_total)
Useless Metrics
• cadvisor
• container_(spec_([a-z_]+)|fs_([a-z_]+))
• kubelet_(runtime_operations_latency_microseconds|docker_op
erations_latency_microseconds)
Useless Metrics
• kube api
• apiserver_(admission_controller_admission_latencies_seconds_
bucket|admission_step_admission_latencies_seconds_bucket|a
dmission_controller_admission_latencies_seconds_sum|admissi
on_controller_admission_latencies_seconds_count|admission_s
tep_admission_latencies_seconds_summary)
• apiserver_(request_latencies_bucket|response_sizes_bucket|re
quest_latencies_summary)
Useless Metrics
• kube state metrics
• kube_pod_container_status_waiting_reason
Step 4
Workers
Scrape
Interval
CPU Memory
# of
metrics/sec
topk
271 60s 1.15 OOM 27500
container_network_tcp_usage_total:
246676
310 60s 0.4 2.9GB 9748
storage_operation_duration_seconds_bucket:
31834
container_memory_usage_bytes: 25737
container_memory_rss: 25737
• Upgrade Resources for Prometheus
• CPU request 1, limit 4
• Memory : request 4GB, limit 8GB
• Drop useless metrics: kube api, cadvisor, kube state metrics
• 1629 pods
Step 5
Workers
Scrape
Interval
CPU Memory
# of
metrics/sec
topk
500 60s 1.1
12.88GB
8.35GB(RSS)
18324
storage_operation_duration_seconds_bucket:
50402
container_memory_rss: 45149
container_memory_usage_bytes: 45149
• Upgrade Resources for Prometheus
• CPU request 1, limit 4
• Memory : request 9GB, limit 14GB
• 4107 pods
KOCOON-Prometheus Default SLA
• Metric 보관 주기 : 3 days
• Metric 수집 주기 : Every 60s
• Alarm 주기:
• 특정 시간(5M)동안 이벤트가 반복되면 1시간에 한 번 알림
• Target Cluster
• ~ 200 nodes, ~ 1500 pods
• 5분 평균 append되는 metric 수 : ~ 10,000/sec
• 현재 설정 그대로 사용 가능
• cpu 1~4 Core, memory 4~6GB
• 더 많은 node / pod를 돌리고 싶다면?
KOCOON-Prometheus SLA for 500
• Target Cluster
• ~ 500 nodes(worker+ingress), ~ 4000 pods
• 5분 평균 append되는 metric 수 : 18,000/sec
• upgrade memory!
## Upgrade prometheus memory request & limit
helm upgrade $(kubectl config current-context) kocoon-stg/kocoon-prometheus --reuse-values --set prometheus.prom
KOCOON-Prometheus Requirements
• ~ 200 nodes / ~1500 pods
• krane VM / PM: 8 Core, 8GB * 2
• ~ 500 nodes / ~4000 pods
• krane VM / PM : 8 Core, 16GB * 2
Consideration for Service..
• Resource Request & Limit
• cpu, memory request는 최소로 주고 limit 제한을 없앰
• 성능 시험을 진행하면서 앞에 공유했던 drill down (namespace to pod)
view를 보고 필요한 cpu, memory를 확인
• 실제 Production 투입 때는 앞에 실험한 결과를 가지고 request, limit을
설정
• 500 node cluster는 가능하지만
• 100 node 단위로 서비스 / 서비스 그룹들을 넣고,
• 5개의 master node, 2개의 Prometheus & Alertmanager,
• LB를 통해서 각 cluster에 RoundRobin로 접근하게 하는게 좀 더 안정적
으로 운영
Sum Up
• KOCOON-* is provided with helm charts
• KOCOON-Prometheus -> today’s main topic
• KOCOON-Cupido : included in KOCOON-Prometheus
• KOCOON-Hermes -> will be present at ifKakao 2019
• KOCOON-DIKE -> developing…
helm? : package manager for k8s https://helm.sh/
Thanks
@@cloud.telemetry with andi, beemo, issac, jenny, joanne & cloud part
KOCOON – KAKAO Automatic K8S Monitoring

Weitere ähnliche Inhalte

Was ist angesagt?

Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성 Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성 rockplace
 
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?OpenStack Korea Community
 
[2019] 200만 동접 게임을 위한 MySQL 샤딩
[2019] 200만 동접 게임을 위한 MySQL 샤딩[2019] 200만 동접 게임을 위한 MySQL 샤딩
[2019] 200만 동접 게임을 위한 MySQL 샤딩NHN FORWARD
 
카카오 광고 플랫폼 MSA 적용 사례 및 API Gateway와 인증 구현에 대한 소개
카카오 광고 플랫폼 MSA 적용 사례 및 API Gateway와 인증 구현에 대한 소개카카오 광고 플랫폼 MSA 적용 사례 및 API Gateway와 인증 구현에 대한 소개
카카오 광고 플랫폼 MSA 적용 사례 및 API Gateway와 인증 구현에 대한 소개if kakao
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요Jo Hoon
 
쿠키런: 킹덤 대규모 인프라 및 서버 운영 사례 공유 [데브시스터즈 - 레벨 200] - 발표자: 용찬호, R&D 엔지니어, 데브시스터즈 ...
쿠키런: 킹덤 대규모 인프라 및 서버 운영 사례 공유 [데브시스터즈 - 레벨 200] - 발표자: 용찬호, R&D 엔지니어, 데브시스터즈 ...쿠키런: 킹덤 대규모 인프라 및 서버 운영 사례 공유 [데브시스터즈 - 레벨 200] - 발표자: 용찬호, R&D 엔지니어, 데브시스터즈 ...
쿠키런: 킹덤 대규모 인프라 및 서버 운영 사례 공유 [데브시스터즈 - 레벨 200] - 발표자: 용찬호, R&D 엔지니어, 데브시스터즈 ...Amazon Web Services Korea
 
쿠버네티스 오픈 소스와 클라우드 매니지드 서비스 접점 소개
쿠버네티스 오픈 소스와 클라우드 매니지드 서비스 접점 소개쿠버네티스 오픈 소스와 클라우드 매니지드 서비스 접점 소개
쿠버네티스 오픈 소스와 클라우드 매니지드 서비스 접점 소개Ian Choi
 
Fargate 를 이용한 ECS with VPC 1부
Fargate 를 이용한 ECS with VPC 1부Fargate 를 이용한 ECS with VPC 1부
Fargate 를 이용한 ECS with VPC 1부Hyun-Mook Choi
 
[D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint
[D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint [D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint
[D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint NAVER D2
 
DynamoDB의 안과밖 - 정민영 (비트패킹 컴퍼니)
DynamoDB의 안과밖 - 정민영 (비트패킹 컴퍼니)DynamoDB의 안과밖 - 정민영 (비트패킹 컴퍼니)
DynamoDB의 안과밖 - 정민영 (비트패킹 컴퍼니)AWSKRUG - AWS한국사용자모임
 
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...Amazon Web Services Korea
 
마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)
마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트) 마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)
마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트) Amazon Web Services Korea
 
[GitOps] Argo CD on GKE (v0.9.2).pdf
[GitOps] Argo CD on GKE (v0.9.2).pdf[GitOps] Argo CD on GKE (v0.9.2).pdf
[GitOps] Argo CD on GKE (v0.9.2).pdfJo Hoon
 
게임을 위한 최적의 AWS DB 서비스 선정 퀘스트 깨기::최유정::AWS Summit Seoul 2018
게임을 위한 최적의 AWS DB 서비스 선정 퀘스트 깨기::최유정::AWS Summit Seoul 2018 게임을 위한 최적의 AWS DB 서비스 선정 퀘스트 깨기::최유정::AWS Summit Seoul 2018
게임을 위한 최적의 AWS DB 서비스 선정 퀘스트 깨기::최유정::AWS Summit Seoul 2018 Amazon Web Services Korea
 
Red Hat OpenStack 17 저자직강+스터디그룹_1주차
Red Hat OpenStack 17 저자직강+스터디그룹_1주차Red Hat OpenStack 17 저자직강+스터디그룹_1주차
Red Hat OpenStack 17 저자직강+스터디그룹_1주차Nalee Jang
 
[NDC17] Kubernetes로 개발서버 간단히 찍어내기
[NDC17] Kubernetes로 개발서버 간단히 찍어내기[NDC17] Kubernetes로 개발서버 간단히 찍어내기
[NDC17] Kubernetes로 개발서버 간단히 찍어내기SeungYong Oh
 
코틀린 멀티플랫폼, 미지와의 조우
코틀린 멀티플랫폼, 미지와의 조우코틀린 멀티플랫폼, 미지와의 조우
코틀린 멀티플랫폼, 미지와의 조우Arawn Park
 
Cloud Native Days Korea 2019 - kakao's k8s_as_a_service
Cloud Native Days Korea 2019 - kakao's k8s_as_a_serviceCloud Native Days Korea 2019 - kakao's k8s_as_a_service
Cloud Native Days Korea 2019 - kakao's k8s_as_a_serviceDennis Hong
 
Deep dive in container service discovery
Deep dive in container service discoveryDeep dive in container service discovery
Deep dive in container service discoveryDocker, Inc.
 

Was ist angesagt? (20)

Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성 Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성
 
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
 
[2019] 200만 동접 게임을 위한 MySQL 샤딩
[2019] 200만 동접 게임을 위한 MySQL 샤딩[2019] 200만 동접 게임을 위한 MySQL 샤딩
[2019] 200만 동접 게임을 위한 MySQL 샤딩
 
카카오 광고 플랫폼 MSA 적용 사례 및 API Gateway와 인증 구현에 대한 소개
카카오 광고 플랫폼 MSA 적용 사례 및 API Gateway와 인증 구현에 대한 소개카카오 광고 플랫폼 MSA 적용 사례 및 API Gateway와 인증 구현에 대한 소개
카카오 광고 플랫폼 MSA 적용 사례 및 API Gateway와 인증 구현에 대한 소개
 
[온라인교육시리즈] NKS에서 Cluster & Pods Autoscaling 적용
[온라인교육시리즈] NKS에서 Cluster & Pods Autoscaling 적용[온라인교육시리즈] NKS에서 Cluster & Pods Autoscaling 적용
[온라인교육시리즈] NKS에서 Cluster & Pods Autoscaling 적용
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
 
쿠키런: 킹덤 대규모 인프라 및 서버 운영 사례 공유 [데브시스터즈 - 레벨 200] - 발표자: 용찬호, R&D 엔지니어, 데브시스터즈 ...
쿠키런: 킹덤 대규모 인프라 및 서버 운영 사례 공유 [데브시스터즈 - 레벨 200] - 발표자: 용찬호, R&D 엔지니어, 데브시스터즈 ...쿠키런: 킹덤 대규모 인프라 및 서버 운영 사례 공유 [데브시스터즈 - 레벨 200] - 발표자: 용찬호, R&D 엔지니어, 데브시스터즈 ...
쿠키런: 킹덤 대규모 인프라 및 서버 운영 사례 공유 [데브시스터즈 - 레벨 200] - 발표자: 용찬호, R&D 엔지니어, 데브시스터즈 ...
 
쿠버네티스 오픈 소스와 클라우드 매니지드 서비스 접점 소개
쿠버네티스 오픈 소스와 클라우드 매니지드 서비스 접점 소개쿠버네티스 오픈 소스와 클라우드 매니지드 서비스 접점 소개
쿠버네티스 오픈 소스와 클라우드 매니지드 서비스 접점 소개
 
Fargate 를 이용한 ECS with VPC 1부
Fargate 를 이용한 ECS with VPC 1부Fargate 를 이용한 ECS with VPC 1부
Fargate 를 이용한 ECS with VPC 1부
 
[D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint
[D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint [D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint
[D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint
 
DynamoDB의 안과밖 - 정민영 (비트패킹 컴퍼니)
DynamoDB의 안과밖 - 정민영 (비트패킹 컴퍼니)DynamoDB의 안과밖 - 정민영 (비트패킹 컴퍼니)
DynamoDB의 안과밖 - 정민영 (비트패킹 컴퍼니)
 
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...
 
마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)
마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트) 마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)
마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)
 
[GitOps] Argo CD on GKE (v0.9.2).pdf
[GitOps] Argo CD on GKE (v0.9.2).pdf[GitOps] Argo CD on GKE (v0.9.2).pdf
[GitOps] Argo CD on GKE (v0.9.2).pdf
 
게임을 위한 최적의 AWS DB 서비스 선정 퀘스트 깨기::최유정::AWS Summit Seoul 2018
게임을 위한 최적의 AWS DB 서비스 선정 퀘스트 깨기::최유정::AWS Summit Seoul 2018 게임을 위한 최적의 AWS DB 서비스 선정 퀘스트 깨기::최유정::AWS Summit Seoul 2018
게임을 위한 최적의 AWS DB 서비스 선정 퀘스트 깨기::최유정::AWS Summit Seoul 2018
 
Red Hat OpenStack 17 저자직강+스터디그룹_1주차
Red Hat OpenStack 17 저자직강+스터디그룹_1주차Red Hat OpenStack 17 저자직강+스터디그룹_1주차
Red Hat OpenStack 17 저자직강+스터디그룹_1주차
 
[NDC17] Kubernetes로 개발서버 간단히 찍어내기
[NDC17] Kubernetes로 개발서버 간단히 찍어내기[NDC17] Kubernetes로 개발서버 간단히 찍어내기
[NDC17] Kubernetes로 개발서버 간단히 찍어내기
 
코틀린 멀티플랫폼, 미지와의 조우
코틀린 멀티플랫폼, 미지와의 조우코틀린 멀티플랫폼, 미지와의 조우
코틀린 멀티플랫폼, 미지와의 조우
 
Cloud Native Days Korea 2019 - kakao's k8s_as_a_service
Cloud Native Days Korea 2019 - kakao's k8s_as_a_serviceCloud Native Days Korea 2019 - kakao's k8s_as_a_service
Cloud Native Days Korea 2019 - kakao's k8s_as_a_service
 
Deep dive in container service discovery
Deep dive in container service discoveryDeep dive in container service discovery
Deep dive in container service discovery
 

Ähnlich wie KOCOON – KAKAO Automatic K8S Monitoring

GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerAndrew Yongjoon Kong
 
Toward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStackToward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStackTon Ngo
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheusBob Cotton
 
New Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceNew Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceAnil Nair
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)Tibo Beijen
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…Sergey Dzyuban
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefGaurav "GP" Pal
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4Gaurav "GP" Pal
 
Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28Xavier Lucas
 
Devoxx France 2018 : Mes Applications en Production sur Kubernetes
Devoxx France 2018 : Mes Applications en Production sur KubernetesDevoxx France 2018 : Mes Applications en Production sur Kubernetes
Devoxx France 2018 : Mes Applications en Production sur KubernetesMichaël Morello
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuNETWAYS
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuNETWAYS
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceEnkitec
 
20160503 Amazed by AWS | Tips about Performance on AWS
20160503 Amazed by AWS | Tips about Performance on AWS20160503 Amazed by AWS | Tips about Performance on AWS
20160503 Amazed by AWS | Tips about Performance on AWSAmazon Web Services Korea
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)Mathew Beane
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaSahdev Zala
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Bobby Curtis
 
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and... DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...PROIDEA
 

Ähnlich wie KOCOON – KAKAO Automatic K8S Monitoring (20)

GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
 
Toward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStackToward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStack
 
K8s monitoring with elk
K8s monitoring with elkK8s monitoring with elk
K8s monitoring with elk
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheus
 
New Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceNew Generation Oracle RAC Performance
New Generation Oracle RAC Performance
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4
 
Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28
 
Devoxx France 2018 : Mes Applications en Production sur Kubernetes
Devoxx France 2018 : Mes Applications en Production sur KubernetesDevoxx France 2018 : Mes Applications en Production sur Kubernetes
Devoxx France 2018 : Mes Applications en Production sur Kubernetes
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
 
20160503 Amazed by AWS | Tips about Performance on AWS
20160503 Amazed by AWS | Tips about Performance on AWS20160503 Amazed by AWS | Tips about Performance on AWS
20160503 Amazed by AWS | Tips about Performance on AWS
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangya
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
 
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and... DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 

Kürzlich hochgeladen

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

KOCOON – KAKAO Automatic K8S Monitoring

  • 1. 2019.07 KOCOON KAKAO Automatic k8s Monitoring @@cloud.telemetry / issac.lim(임성국)
  • 4. Telemetry ? Telemetry is an automated communications process by which meas urements and other data are collected at remote or inaccessible points and transmitted to receiving equipment for monitoring. ref: https://en.wikipedia.org/wiki/Telemetry
  • 5. cloud.telemetry ? • Remote & Inaccessible points • Baremetal, IaaS, CaaS … • We Develop & Provide automated communications process • MaaS (Monitoring as a Service) • KEMI Stats, KEMI Logs, KOCOON ref: https://en.wikipedia.org/wiki/Telemetry
  • 9. Head: KEMI-* Longtail : ? ref: https://mgcabral.wordpress.com/2012/03/04/thelongtaileconomics/
  • 10. Longtail • Users want  Get own resource  Deal with resources in their own way • We want  Divide resources by users  Provide self-monitoring service
  • 11. DKOSv3 • k8s based container orchestrator @KAKAO • Kubernetes v1.11.5 • 카카오 T 택시 사례를 통해 살펴보는 카카오 클라우드의 Kubernetes as a Service (openinfradays days 2 Track 2 12:00 ~ 12:40)
  • 12. 그렇다면 … 서비스 클러스터 별로 모니터링 리소스를 분리하자!
  • 14. KOCOON • KakaO COntainer based service mONitoring  서비스 리소스 안에서 수집/조회/알람 등 모든 것을 해결
  • 16. Metric based Self Monitoring
  • 18. Rule based Log Event
  • 19.
  • 21. KOCOON-Prometheus Install  Prometheus-operator, Prometheus, Alertmanager  PrometheusRule(for alarm & recreate metric)  Cupido (Prometheus webhook manager for kakao)  kube state metrics, node exporter  Grafana & Dashboard  kakao etcd & kakao ingress controller service monitor helm install kakao-stable/kocoon-prometheus --name $(kubectl config current-context) --namespace monitoring
  • 22.
  • 23. KOCOON-Prometheus Set LB ## grafana ingress helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val ues --set grafana.ingress.enabled=true --set grafana.ingress.hosts[0]=example-grafana.dev. 9rum.cc ## prometheus ingress helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val ues --set prometheus.ingress.enabled=true --set prometheus.ingress.hosts[0]=example-pro metheus.dev.9rum.cc ## alertmanager ingress helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val ues --set alertmanager.ingress.enabled=true --set alertmanager.ingress.hosts[0]=example-al ertmanager.dev.9rum.cc
  • 24. KOCOON-Prometheus Set Kakaotalk ## 받아야 할 watchcenter group id가 6663, 4443 등 다수인 경우 helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-va lues --set cupido.watchcenterGroups="6663;4443;” ## 받아야 할 watchcenter group id가 1개인(6663) 경우 helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-va lues --set cupido.watchcenterGroups="6663;" ## watchcenter group id를 설정하지 않아도 문제가 생길 시에 dkosv3 운영하는 cloud.deploy 셀에 함께 알람이 갑니다.
  • 31. Goals • Workers: 500 • Metric scrapes interval: 30s ~ 60s • Check resource status • service namespace (default) • kube-system • ingress
  • 32. Step 1 • Basic Resources for Prometheus • CPU request 1, limit 4 • Memory : request 2GB, limit 6GB • it’s ok ~ 100 workers but… • Internal node IP issue (max 128 nodes…)
  • 33. Step 2 Workers Scrape Interval CPU Memory # of metrics/sec 12 30s 0.1 780MB 4605 271 30s OOM • Basic Resources for Prometheus • CPU request 1, limit 4 • Memory : request 2GB, limit 6GB
  • 34. Out Of Memory… • Prometheus killed due to OOM in large scale • https://groups.google.com/forum/#!topic/prometheus-users/DE LLNNSVCSw • https://github.com/prometheus/prometheus/issues/4553 • https://github.com/prometheus/prometheus/issues/1358
  • 35. We considered … • # of targets • # of metrics • # of rules • frequency of scrapes & rule evaluations • …
  • 36. Must item • # of targets • # of rules • frequency of scrapes & rule evaluations (every minutes)
  • 37. # of metrics & topk • # of metrics per sec rate(prometheus_tsdb_head_samples_appended_total[5m]) • Top 10 metrics topk(10, count({job=~".+"}) by(__name__))
  • 38. Step 3 Workers Scrape Interval CPU Memory # of metrics/sec topk 271 30s OOM 271 60s 1.15 OOM 27500 container_network_tcp_usage_total: 246676 • Upgrade Resources for Prometheus • CPU request 1, limit 4 • Memory : request 4GB, limit 8GB • Scrape interval: 30s -> 60s
  • 39. Useless Metrics • cadvisor • container_(network_tcp_usage_total|network_udp_usage_total |tasks_state|cpu_load_average_10s|memory_failures_total) • container_(cpu_schedstat_run_seconds_total|cpu_schedstat_ru nqueue_seconds_total|cpu_schedstat_run_periods_total|cpu_s ystem_seconds_total|cpu_user_seconds_total) • container_(last_seen|memory_working_set_bytes|memory_cac he|memory_failcnt|memory_max_usage_bytes|memory_swap| start_time_seconds) • container_(network_receive_packets_total|network_transmit_p ackets_dropped_total|network_receive_errors_total|network_r eceive_packets_dropped_total|network_transmit_packets_total |network_transmit_errors_total)
  • 40. Useless Metrics • cadvisor • container_(spec_([a-z_]+)|fs_([a-z_]+)) • kubelet_(runtime_operations_latency_microseconds|docker_op erations_latency_microseconds)
  • 41. Useless Metrics • kube api • apiserver_(admission_controller_admission_latencies_seconds_ bucket|admission_step_admission_latencies_seconds_bucket|a dmission_controller_admission_latencies_seconds_sum|admissi on_controller_admission_latencies_seconds_count|admission_s tep_admission_latencies_seconds_summary) • apiserver_(request_latencies_bucket|response_sizes_bucket|re quest_latencies_summary)
  • 42. Useless Metrics • kube state metrics • kube_pod_container_status_waiting_reason
  • 43. Step 4 Workers Scrape Interval CPU Memory # of metrics/sec topk 271 60s 1.15 OOM 27500 container_network_tcp_usage_total: 246676 310 60s 0.4 2.9GB 9748 storage_operation_duration_seconds_bucket: 31834 container_memory_usage_bytes: 25737 container_memory_rss: 25737 • Upgrade Resources for Prometheus • CPU request 1, limit 4 • Memory : request 4GB, limit 8GB • Drop useless metrics: kube api, cadvisor, kube state metrics • 1629 pods
  • 44. Step 5 Workers Scrape Interval CPU Memory # of metrics/sec topk 500 60s 1.1 12.88GB 8.35GB(RSS) 18324 storage_operation_duration_seconds_bucket: 50402 container_memory_rss: 45149 container_memory_usage_bytes: 45149 • Upgrade Resources for Prometheus • CPU request 1, limit 4 • Memory : request 9GB, limit 14GB • 4107 pods
  • 45. KOCOON-Prometheus Default SLA • Metric 보관 주기 : 3 days • Metric 수집 주기 : Every 60s • Alarm 주기: • 특정 시간(5M)동안 이벤트가 반복되면 1시간에 한 번 알림 • Target Cluster • ~ 200 nodes, ~ 1500 pods • 5분 평균 append되는 metric 수 : ~ 10,000/sec • 현재 설정 그대로 사용 가능 • cpu 1~4 Core, memory 4~6GB • 더 많은 node / pod를 돌리고 싶다면?
  • 46. KOCOON-Prometheus SLA for 500 • Target Cluster • ~ 500 nodes(worker+ingress), ~ 4000 pods • 5분 평균 append되는 metric 수 : 18,000/sec • upgrade memory! ## Upgrade prometheus memory request & limit helm upgrade $(kubectl config current-context) kocoon-stg/kocoon-prometheus --reuse-values --set prometheus.prom
  • 47. KOCOON-Prometheus Requirements • ~ 200 nodes / ~1500 pods • krane VM / PM: 8 Core, 8GB * 2 • ~ 500 nodes / ~4000 pods • krane VM / PM : 8 Core, 16GB * 2
  • 48. Consideration for Service.. • Resource Request & Limit • cpu, memory request는 최소로 주고 limit 제한을 없앰 • 성능 시험을 진행하면서 앞에 공유했던 drill down (namespace to pod) view를 보고 필요한 cpu, memory를 확인 • 실제 Production 투입 때는 앞에 실험한 결과를 가지고 request, limit을 설정 • 500 node cluster는 가능하지만 • 100 node 단위로 서비스 / 서비스 그룹들을 넣고, • 5개의 master node, 2개의 Prometheus & Alertmanager, • LB를 통해서 각 cluster에 RoundRobin로 접근하게 하는게 좀 더 안정적 으로 운영
  • 49. Sum Up • KOCOON-* is provided with helm charts • KOCOON-Prometheus -> today’s main topic • KOCOON-Cupido : included in KOCOON-Prometheus • KOCOON-Hermes -> will be present at ifKakao 2019 • KOCOON-DIKE -> developing… helm? : package manager for k8s https://helm.sh/
  • 50. Thanks @@cloud.telemetry with andi, beemo, issac, jenny, joanne & cloud part

Hinweis der Redaktion

  1. 롱테일 파레토