SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
Calico のデプロイを
ミスって本番クラスタを
壊しそうになった話
2021/03/12
Cloud Native Days Online
WHO I AM?
- Name: Kawabe Katsuya
- Team: CyberAgent Group Infrastructure Unit
- Position: Software, Infra Engineer, 2020 New Graduate
- Hobby: Music, Comic
ABOUT US
@KKawabe108
1. What Happen
2. How To Resolve
TABLE
CONTENTS
問題発覚編: calico のデプロイをミスったことに
よって発生した事象について
解決編: プロダクト側への説明、監視の見直し、
再発防止への取り組み
図参照: https://docs.projectcalico.org/reference/architecture/overview
CNI: calico
アラートはいつも突然に
AKE での監視
Victoria Metrics で複数のクラスタを監視しています
What Happened
Ingress とノードの
BGP ピアがダウンしたというアラートが大量発生
Kubectl get po -A をすると Master ノードに乗っている
Pod がほとんど Evicted されていた 😇
Master ノードに ssh すると、どうやらディスク領域が
圧迫され、Eviction の閾値に到達していた
What Happened: Ingress の実装
Node
calico-node
Node
exporter
Node
calico-node
Node
calico-node
nginx ctrl nginx ctrl nginx ctrl
Node
calico-node
Node
calico-node
calico-node
exporter exporter
Big IP VS
BGP Routing
What Happened: Ingress の実装
Node
calico-node
Node
exporter
Node
calico-node
Node
calico-node
nginx ctrl nginx ctrl nginx ctrl
Node
calico-node
Node
calico-node
calico-node
exporter exporter
Big IP VS
BGP Routing
BGP Link is
Down
なぜ、Diskが圧迫されたのか
What Happened : ipamhandles リソースの爆発
calico-ipam が使用する Pod と IPを紐づけるリソース
通常、calico が Pod の作成と削除に合わせ
制御するリソースのはずだったが・・・
What Happened : ipamhandles リソースの爆発
😇
kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる
ちなみに、Pod の数はおよそ30個ぐらい
What Happened : ipamhandles リソースの爆発
😇
kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる
ちなみに、Pod の数はおよそ30個ぐらい
APIサーバが操作できない =
クラスタが操作不能 =
ヤバイ
What Happened : ipamhandles リソースの爆発
etcd backup
2GB
/var/backup/etcd
etcd backup
2GB
etcd backup
2GB
systemd
🧨
🧨 🧨
ipamhandles の爆発によって、etcd のバックアップデータが肥大化し、
20GB しかないディスクの圧迫へと繋がった
What Happened : まとめ
Step 1 Step 2 Step 3 Step 4
calico の ipamhandles が
爆発する
etcd のバックアップデータが
増加する (2GB)
Master のディスクが圧迫されて
calico-node とその他が Evicted
される
calico-node がダウンしたことに
より、BGPピアが切断され、
アラート発砲
原因と対応
How To Resolve
calico-kube-controllers というコンポーネント
をデプロイしていなかった
calico-node の ClusterRole の権限が
間違っていた
図参照: https://docs.projectcalico.org/reference/architecture/overview
How To Resolve: 反省点
calico-node に delete の権限を渡していないせいで
GCが発生していなかった
元々、3.8 のマニフェストをベースに弄っていたので
発生したミス
新しいマニフェストを公式から落としてそれをベースにすれば
今回のようなミスは発生しなかった
How To Resolve : プロダクトへの対応
今回、アラートが上がったのは監視用に立てているクラスタでプロダクトが利用しているクラスタでは
Pod がそこまで頻繁に作成削除されていなかったので、肥大化はディスクに影響が出るほどではなかった
すぐに事情を説明して、マニフェストの修正を行なった
発生するリスクは抑えたということを確認して、対応終了
How To Resolve: 監視基盤の対応
監視基盤で全クラスタで etcd の db size
と、オブジェクト数を監視するようにした
ディスクサイズのアラートが
Eviction Policy と同等だったので、
それより低く設定し直す
How To Resolve: calico のアップデートについて
極力アップデートしなくていいならしない
マニフェストは基本公式のものをそのまま使うので問題はないはず (Pod CIDR や IPIP の有効化フラグぐらい)
マニフェストが大きいので修正のレビューは三重チェックで丁度いい (それぐらいCNIはクリティカル)
CNI のアップデートには細心の
注意を払いましょう!
Thank you for listening !

Weitere ähnliche Inhalte

Was ist angesagt?

Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動するStargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動するKohei Tokunaga
 
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz SnapshotterThe overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz SnapshotterKohei Tokunaga
 
Startup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionStartup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionKohei Tokunaga
 
Starting up Containers Super Fast With Lazy Pulling of Images
Starting up Containers Super Fast With Lazy Pulling of ImagesStarting up Containers Super Fast With Lazy Pulling of Images
Starting up Containers Super Fast With Lazy Pulling of ImagesKohei Tokunaga
 
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...Kohei Tokunaga
 
Shifter singularity - june 7, 2018 - bw symposium
Shifter  singularity - june 7, 2018 - bw symposiumShifter  singularity - june 7, 2018 - bw symposium
Shifter singularity - june 7, 2018 - bw symposiuminside-BigData.com
 
Distributed tensorflow on kubernetes
Distributed tensorflow on kubernetesDistributed tensorflow on kubernetes
Distributed tensorflow on kubernetesinwin stack
 
Introduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdIntroduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdKohei Tokunaga
 
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project CalicoKubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project CalicoKubeAcademy
 
Kubernetes for Java developers
Kubernetes for Java developersKubernetes for Java developers
Kubernetes for Java developersRobert Barr
 
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?KubeAcademy
 
SCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with ChefSCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with ChefMatt Ray
 
Kubernetes extensibility
Kubernetes extensibilityKubernetes extensibility
Kubernetes extensibilityDocker, Inc.
 
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech TalkArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech TalkRed Hat Developers
 
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKitAkihiro Suda
 
Rtl sdr software defined radio
Rtl sdr   software defined radioRtl sdr   software defined radio
Rtl sdr software defined radioEueung Mulyana
 
What's new in FreeBSD 10
What's new in FreeBSD 10What's new in FreeBSD 10
What's new in FreeBSD 10Gleb Smirnoff
 
Cantainer CI/ CD with Kubernetes
Cantainer CI/ CD with KubernetesCantainer CI/ CD with Kubernetes
Cantainer CI/ CD with Kubernetesinwin stack
 

Was ist angesagt? (20)

Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動するStargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
 
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz SnapshotterThe overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
 
Startup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionStartup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image Distribution
 
Starting up Containers Super Fast With Lazy Pulling of Images
Starting up Containers Super Fast With Lazy Pulling of ImagesStarting up Containers Super Fast With Lazy Pulling of Images
Starting up Containers Super Fast With Lazy Pulling of Images
 
Learning kubernetes
Learning kubernetesLearning kubernetes
Learning kubernetes
 
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
 
Shifter singularity - june 7, 2018 - bw symposium
Shifter  singularity - june 7, 2018 - bw symposiumShifter  singularity - june 7, 2018 - bw symposium
Shifter singularity - june 7, 2018 - bw symposium
 
Distributed tensorflow on kubernetes
Distributed tensorflow on kubernetesDistributed tensorflow on kubernetes
Distributed tensorflow on kubernetes
 
Introduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdIntroduction and Deep Dive Into Containerd
Introduction and Deep Dive Into Containerd
 
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project CalicoKubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
 
Kubernetes for Java developers
Kubernetes for Java developersKubernetes for Java developers
Kubernetes for Java developers
 
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
 
SCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with ChefSCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with Chef
 
Kubernetes extensibility
Kubernetes extensibilityKubernetes extensibility
Kubernetes extensibility
 
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech TalkArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
 
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
 
Rtl sdr software defined radio
Rtl sdr   software defined radioRtl sdr   software defined radio
Rtl sdr software defined radio
 
Using Qt under LGPLv3
Using Qt under LGPLv3Using Qt under LGPLv3
Using Qt under LGPLv3
 
What's new in FreeBSD 10
What's new in FreeBSD 10What's new in FreeBSD 10
What's new in FreeBSD 10
 
Cantainer CI/ CD with Kubernetes
Cantainer CI/ CD with KubernetesCantainer CI/ CD with Kubernetes
Cantainer CI/ CD with Kubernetes
 

Ähnlich wie 【CNDO2021】Calicoのデプロイをミスって本番クラスタを壊しそうになった話

[Wroclaw #7] Why So Serial?
[Wroclaw #7] Why So Serial?[Wroclaw #7] Why So Serial?
[Wroclaw #7] Why So Serial?OWASP
 
Orchestrating Cloud Events - Knative Meetup 2020
Orchestrating Cloud Events - Knative Meetup 2020Orchestrating Cloud Events - Knative Meetup 2020
Orchestrating Cloud Events - Knative Meetup 2020Mauricio (Salaboy) Salatino
 
Qa in production singular 2019
Qa in production   singular 2019Qa in production   singular 2019
Qa in production singular 2019rouanw
 
Future of WCM - CM Forum Belgium
Future of WCM - CM Forum BelgiumFuture of WCM - CM Forum Belgium
Future of WCM - CM Forum BelgiumDavid Nuescheler
 
Cms forum, future of Web Content Management
Cms forum, future of Web Content ManagementCms forum, future of Web Content Management
Cms forum, future of Web Content Managementguest88136a
 
KubeCon 2017 Zero Touch Provision
KubeCon 2017 Zero Touch ProvisionKubeCon 2017 Zero Touch Provision
KubeCon 2017 Zero Touch ProvisionRackN
 
Kubecon 2017 Zero Touch Kubernetes
Kubecon 2017 Zero Touch KubernetesKubecon 2017 Zero Touch Kubernetes
Kubecon 2017 Zero Touch Kubernetesrhirschfeld
 
Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019RackN
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudTobias Schmidt
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudKubeAcademy
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesPeter Hlavaty
 
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】Hacks in Taiwan (HITCON)
 
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018Giulio Vian
 
Testability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testableTestability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testableAlexander Tarlinder
 
Recreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web ScrapingRecreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web ScrapingKP Kaiser
 
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDCBasics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDCMatt McNeeney
 
Debugging Go in Kubernetes
Debugging Go in KubernetesDebugging Go in Kubernetes
Debugging Go in KubernetesAlexei Ledenev
 
Simplifying Real Time Data Analytics with Docker, IoT & Cloud
Simplifying Real Time Data Analytics with Docker, IoT & CloudSimplifying Real Time Data Analytics with Docker, IoT & Cloud
Simplifying Real Time Data Analytics with Docker, IoT & CloudAjeet Singh Raina
 
Build and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in KubernetesBuild and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in KubernetesKP Kaiser
 
ドワンゴでのScala活用事例「ニコニコandroid」
ドワンゴでのScala活用事例「ニコニコandroid」ドワンゴでのScala活用事例「ニコニコandroid」
ドワンゴでのScala活用事例「ニコニコandroid」Satoshi Goto
 

Ähnlich wie 【CNDO2021】Calicoのデプロイをミスって本番クラスタを壊しそうになった話 (20)

[Wroclaw #7] Why So Serial?
[Wroclaw #7] Why So Serial?[Wroclaw #7] Why So Serial?
[Wroclaw #7] Why So Serial?
 
Orchestrating Cloud Events - Knative Meetup 2020
Orchestrating Cloud Events - Knative Meetup 2020Orchestrating Cloud Events - Knative Meetup 2020
Orchestrating Cloud Events - Knative Meetup 2020
 
Qa in production singular 2019
Qa in production   singular 2019Qa in production   singular 2019
Qa in production singular 2019
 
Future of WCM - CM Forum Belgium
Future of WCM - CM Forum BelgiumFuture of WCM - CM Forum Belgium
Future of WCM - CM Forum Belgium
 
Cms forum, future of Web Content Management
Cms forum, future of Web Content ManagementCms forum, future of Web Content Management
Cms forum, future of Web Content Management
 
KubeCon 2017 Zero Touch Provision
KubeCon 2017 Zero Touch ProvisionKubeCon 2017 Zero Touch Provision
KubeCon 2017 Zero Touch Provision
 
Kubecon 2017 Zero Touch Kubernetes
Kubecon 2017 Zero Touch KubernetesKubecon 2017 Zero Touch Kubernetes
Kubecon 2017 Zero Touch Kubernetes
 
Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
 
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
 
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
 
Testability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testableTestability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testable
 
Recreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web ScrapingRecreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web Scraping
 
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDCBasics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
 
Debugging Go in Kubernetes
Debugging Go in KubernetesDebugging Go in Kubernetes
Debugging Go in Kubernetes
 
Simplifying Real Time Data Analytics with Docker, IoT & Cloud
Simplifying Real Time Data Analytics with Docker, IoT & CloudSimplifying Real Time Data Analytics with Docker, IoT & Cloud
Simplifying Real Time Data Analytics with Docker, IoT & Cloud
 
Build and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in KubernetesBuild and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in Kubernetes
 
ドワンゴでのScala活用事例「ニコニコandroid」
ドワンゴでのScala活用事例「ニコニコandroid」ドワンゴでのScala活用事例「ニコニコandroid」
ドワンゴでのScala活用事例「ニコニコandroid」
 

Kürzlich hochgeladen

Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 

Kürzlich hochgeladen (20)

Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 

【CNDO2021】Calicoのデプロイをミスって本番クラスタを壊しそうになった話

  • 2. WHO I AM? - Name: Kawabe Katsuya - Team: CyberAgent Group Infrastructure Unit - Position: Software, Infra Engineer, 2020 New Graduate - Hobby: Music, Comic ABOUT US @KKawabe108
  • 3. 1. What Happen 2. How To Resolve TABLE CONTENTS 問題発覚編: calico のデプロイをミスったことに よって発生した事象について 解決編: プロダクト側への説明、監視の見直し、 再発防止への取り組み
  • 4.
  • 7. AKE での監視 Victoria Metrics で複数のクラスタを監視しています
  • 8. What Happened Ingress とノードの BGP ピアがダウンしたというアラートが大量発生 Kubectl get po -A をすると Master ノードに乗っている Pod がほとんど Evicted されていた 😇 Master ノードに ssh すると、どうやらディスク領域が 圧迫され、Eviction の閾値に到達していた
  • 9. What Happened: Ingress の実装 Node calico-node Node exporter Node calico-node Node calico-node nginx ctrl nginx ctrl nginx ctrl Node calico-node Node calico-node calico-node exporter exporter Big IP VS BGP Routing
  • 10. What Happened: Ingress の実装 Node calico-node Node exporter Node calico-node Node calico-node nginx ctrl nginx ctrl nginx ctrl Node calico-node Node calico-node calico-node exporter exporter Big IP VS BGP Routing BGP Link is Down
  • 12. What Happened : ipamhandles リソースの爆発 calico-ipam が使用する Pod と IPを紐づけるリソース 通常、calico が Pod の作成と削除に合わせ 制御するリソースのはずだったが・・・
  • 13. What Happened : ipamhandles リソースの爆発 😇 kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる ちなみに、Pod の数はおよそ30個ぐらい
  • 14. What Happened : ipamhandles リソースの爆発 😇 kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる ちなみに、Pod の数はおよそ30個ぐらい APIサーバが操作できない = クラスタが操作不能 = ヤバイ
  • 15. What Happened : ipamhandles リソースの爆発 etcd backup 2GB /var/backup/etcd etcd backup 2GB etcd backup 2GB systemd 🧨 🧨 🧨 ipamhandles の爆発によって、etcd のバックアップデータが肥大化し、 20GB しかないディスクの圧迫へと繋がった
  • 16. What Happened : まとめ Step 1 Step 2 Step 3 Step 4 calico の ipamhandles が 爆発する etcd のバックアップデータが 増加する (2GB) Master のディスクが圧迫されて calico-node とその他が Evicted される calico-node がダウンしたことに より、BGPピアが切断され、 アラート発砲
  • 18. How To Resolve calico-kube-controllers というコンポーネント をデプロイしていなかった calico-node の ClusterRole の権限が 間違っていた 図参照: https://docs.projectcalico.org/reference/architecture/overview
  • 19. How To Resolve: 反省点 calico-node に delete の権限を渡していないせいで GCが発生していなかった 元々、3.8 のマニフェストをベースに弄っていたので 発生したミス 新しいマニフェストを公式から落としてそれをベースにすれば 今回のようなミスは発生しなかった
  • 20. How To Resolve : プロダクトへの対応 今回、アラートが上がったのは監視用に立てているクラスタでプロダクトが利用しているクラスタでは Pod がそこまで頻繁に作成削除されていなかったので、肥大化はディスクに影響が出るほどではなかった すぐに事情を説明して、マニフェストの修正を行なった 発生するリスクは抑えたということを確認して、対応終了
  • 21. How To Resolve: 監視基盤の対応 監視基盤で全クラスタで etcd の db size と、オブジェクト数を監視するようにした ディスクサイズのアラートが Eviction Policy と同等だったので、 それより低く設定し直す
  • 22. How To Resolve: calico のアップデートについて 極力アップデートしなくていいならしない マニフェストは基本公式のものをそのまま使うので問題はないはず (Pod CIDR や IPIP の有効化フラグぐらい) マニフェストが大きいので修正のレビューは三重チェックで丁度いい (それぐらいCNIはクリティカル)
  • 24. Thank you for listening !