SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Kubernetes Clusters At Scale
Managing Hundreds Apache Pinot Kubernetes Clusters
Inside Each End User’s Own Cloud Infrastructure
Xiaoman Dong
DevOps, Software Engineer @StarTree
Apache Pinot
• OLAP Datastore
• Columnar, Indexed storage
• Real-time Low latency analytics
• Distributed – highly available, reliable, scalable
• Lambda architecture
• SQL Interface
• Open Source - Apache TLP
Typical/Traditional SaaS
● K8S Owned by SaaS Company
● Data Stays in SaaS Company
Virtual Private Cloud
We Do Delegated Management Solution
● K8S Owned by Customer
● Data Stays inside Customer’s
Virtual Private Cloud
● Fully Managed by Us
Design Context throughout the Talk
The 3 Major Constraints
● Cloud Boundaries
● Optimized for Apache Pinot
● Scale to hundreds or more
We will focus on how these 3 makes our system special
How do we design such a system?
(My job is safe from ChatGPT ... for now)
The journey: design such a system
• We are going to start small, automate, and dive deeper
• Always think about our context: customer’s cloud, our backend
Step 1: Creating the Clusters
• Each customer will be able to create and see their own clusters
• Self-serve provisioning via UI
• Multi-cloud support (AWS, GCP, Azure)
Step 1: Provisioning
The Manual Way
Automate this!
● Log into AWS UI by credentials provided by the customer
● Create Account, Networking, Kubernetes Cluster
● ❌ Bash script the aws eks creation
● ✅ Write your own microservice
- Use aws client library
- Terraform
Step 1: Provisioning (Cont’d)
- Scale to 1k customers?
Step 1: Provisioning - Orchestration
Orchestration Engine
Workflow Needed:
1. Create Account
2. Create Network
3. Create NodeGroup
4. Create K8S
5. Create …
6. Notify Finished
Retry in each step, report status
Step 2: Installing Applications
Goal: The customers needs to access their clusters with Pinot Running
Step 2: Installing Applications
The Manual Way
Automate This
● ❌ kubectl apply -f all-apps.yaml
● ✅ helm upgrade --install startree-platform …
● Build our own helm charts
● Run our own private helm repo (or pay for AWS ECR)
● All applications deployed via Helm Chart
● Call helm libraries in our code
K8S Cluster Runs as a Platform, Applications are Pluggable
Charts and docker owned by separate teams 😍
Step 3: Networking
A huge topic worth a dedicated session
Public facing vs. “Internal” facing (VPC Peering)
Kubernetes Has Good Network Modeling and EcoSystems
● Ingress - We choose Traefik, easy for teams to define ingress
● LoadBalancer by Each Cloud Provider
● ExtraVPC Peering on demand
● Multi-Zone High Availability
Step 4: TLS and Certificates - Problem
Secure connection is required nearly everywhere
● Even withinVPC/Firewall customers request it
● Manual certificate generation will not scale
Certificate has expiration dates
● Automated renewal is needed
● First Time Creation == Future Renewal
Step 4: TLS and Certificates - Knowledge
Facts of Certificates
- Proves that you own this DNS name properly
- To generate certificate, we need to do DNS related challenge to prove ownership
- Established by chain of trusts
- Issued by well-known/pre-installed 3rd party issuers like ZeroSSL
Step 4: TLS and Certificates: Centralized
Option 1: Centralized solution
✅ Better Security
❌ Harder to Scale
Step 4: TLS and Certificates: Distributed
Decentralized Certificate Renewal
❌ Less Secure
✅ Easier to Scale Up
Special Part for Delegated Management Solution
Step 5,6,7…
The Usual DevOps stuff
● OIDC for AuthZ/AuthN
● Prometheus + AlertManager for Observability
● Logging, Debugging
● Backup and Disaster Recovery
● Metrics push to centralized monitoring and/or customer’s metrics storage
● Backup to customer’s deep store
Checkpoint 1: Kubernetes Fleet Management
Architecture So Far A mini version of multi-cloud Kubernetes fleet
management system, like the KubeSphere
Wait, What About Apache Pinot?
Pinot Kubernetes Operator
Configuration/Customization
Templated Environment Creation
● Some customers like to enable groovy in Query, some don't
● Customizations/Configurations are applied onto templates
● Customization are applied like aVisitor pattern in the old Design Patterns
Are we there yet?
“Ops” part of DevOps!
* Image courtesy https://devopedia.org/devops
Version and Upgrades
Version and Upgrades (Cont’d)
The version matrix Lessons Learnt
● Create good release pipeline with tests
● Discipline: avoid releasing versions with
breaking changes
● Keep helm chart and image tag the
same as release version
Efficiency and Reliability
Efficiency and Reliability are key to Scale up
● Discipline in DevOps is important
● No architecture is bulletproof
● Less Outages == Better Efficiency
● DevOps are created for end to end ownership
Efficiency and Reliability - Cont’d
Best Practices
● Build Good Infra Integration/Regression Test
● Trunk-Based Release Pipelines
○ Always release from master
○ Say no for release branches
● Do not customize by Kubectl command
Operations and OnCalls
There is no silver bullet for OnCall
• Discipline and Process
- Root Cause every outage
- Follow up on every outage
• Effective Alerts
- Differentiate alerts from signals
- Review and Keep Improving
- Build metrics to measure effectiveness
Lessons Learnt
Security design in Provisioned Cluster is hard
• Centralized Control, less Scalability
• Decentralized Control, harder to protect credentials
• Build good debugging support on TLS certificates
Do not run complicated Terraforms
• Bugs if state gets complicated, unwanted recreation
• Internal states of terraform are hard to keep track
Lessons Learnt (cont’d)
Certificate Issuer like ZeroSSL may partially go down for half a day
• No new customer can onboard during that downtime
One 3rd Party Helm Repo goes down and blocks customer cluster upgrade
• Serve Helm Charts by your own repo like JFrog
What’s Ahead
• Improving Design For Layering
• Improve Resource Efficiency
• No Downtime Upgrade
• Cluster Federation
• …
Questions?
Thank you!
Reach me via https://www.linkedin.com/in/xiaoman/

Weitere ähnliche Inhalte

Ähnlich wie Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clusters Inside Each End User’s Own Cloud Infrastructure

9th docker meetup 2016.07.13
9th docker meetup 2016.07.139th docker meetup 2016.07.13
9th docker meetup 2016.07.13Amrita Prasad
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesJosef Adersberger
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesQAware GmbH
 
Why kubernetes matters
Why kubernetes mattersWhy kubernetes matters
Why kubernetes mattersPlatform9
 
Service-Level Objective for Serverless Applications
Service-Level Objective for Serverless ApplicationsService-Level Objective for Serverless Applications
Service-Level Objective for Serverless Applicationsalekn
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudDatadog
 
OSDC 2018 | Three years running containers with Kubernetes in Production by T...
OSDC 2018 | Three years running containers with Kubernetes in Production by T...OSDC 2018 | Three years running containers with Kubernetes in Production by T...
OSDC 2018 | Three years running containers with Kubernetes in Production by T...NETWAYS
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaSteven Wu
 
Hacking apache cloud stack
Hacking apache cloud stackHacking apache cloud stack
Hacking apache cloud stackNitin Mehta
 
The journey to Native Cloud Architecture & Microservices, tracing the footste...
The journey to Native Cloud Architecture & Microservices, tracing the footste...The journey to Native Cloud Architecture & Microservices, tracing the footste...
The journey to Native Cloud Architecture & Microservices, tracing the footste...Mek Srunyu Stittri
 
Continuos Integration and Delivery: from Zero to Hero with TeamCity, Docker a...
Continuos Integration and Delivery: from Zero to Hero with TeamCity, Docker a...Continuos Integration and Delivery: from Zero to Hero with TeamCity, Docker a...
Continuos Integration and Delivery: from Zero to Hero with TeamCity, Docker a...Lean IT Consulting
 
Breaking the Monolith Road to Containers
Breaking the Monolith Road to ContainersBreaking the Monolith Road to Containers
Breaking the Monolith Road to ContainersAmazon Web Services
 
.NET microservices with Azure Service Fabric
.NET microservices with Azure Service Fabric.NET microservices with Azure Service Fabric
.NET microservices with Azure Service FabricDavide Benvegnù
 
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...AWS User Group Kochi
 
Azure meetup cloud native concepts - may 28th 2018
Azure meetup   cloud native concepts - may 28th 2018Azure meetup   cloud native concepts - may 28th 2018
Azure meetup cloud native concepts - may 28th 2018Jim Bugwadia
 

Ähnlich wie Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clusters Inside Each End User’s Own Cloud Infrastructure (20)

9th docker meetup 2016.07.13
9th docker meetup 2016.07.139th docker meetup 2016.07.13
9th docker meetup 2016.07.13
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 
Why kubernetes matters
Why kubernetes mattersWhy kubernetes matters
Why kubernetes matters
 
Service-Level Objective for Serverless Applications
Service-Level Objective for Serverless ApplicationsService-Level Objective for Serverless Applications
Service-Level Objective for Serverless Applications
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
 
OSDC 2018 | Three years running containers with Kubernetes in Production by T...
OSDC 2018 | Three years running containers with Kubernetes in Production by T...OSDC 2018 | Three years running containers with Kubernetes in Production by T...
OSDC 2018 | Three years running containers with Kubernetes in Production by T...
 
Un-clouding the cloud
Un-clouding the cloudUn-clouding the cloud
Un-clouding the cloud
 
Data harmonycloudpowerpointclientfacing
Data harmonycloudpowerpointclientfacingData harmonycloudpowerpointclientfacing
Data harmonycloudpowerpointclientfacing
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
reBuy on Kubernetes
reBuy on KubernetesreBuy on Kubernetes
reBuy on Kubernetes
 
Hacking apache cloud stack
Hacking apache cloud stackHacking apache cloud stack
Hacking apache cloud stack
 
The journey to Native Cloud Architecture & Microservices, tracing the footste...
The journey to Native Cloud Architecture & Microservices, tracing the footste...The journey to Native Cloud Architecture & Microservices, tracing the footste...
The journey to Native Cloud Architecture & Microservices, tracing the footste...
 
Windows Azure Essentials V3
Windows Azure Essentials V3Windows Azure Essentials V3
Windows Azure Essentials V3
 
Continuos Integration and Delivery: from Zero to Hero with TeamCity, Docker a...
Continuos Integration and Delivery: from Zero to Hero with TeamCity, Docker a...Continuos Integration and Delivery: from Zero to Hero with TeamCity, Docker a...
Continuos Integration and Delivery: from Zero to Hero with TeamCity, Docker a...
 
Breaking the Monolith Road to Containers
Breaking the Monolith Road to ContainersBreaking the Monolith Road to Containers
Breaking the Monolith Road to Containers
 
.NET microservices with Azure Service Fabric
.NET microservices with Azure Service Fabric.NET microservices with Azure Service Fabric
.NET microservices with Azure Service Fabric
 
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
 
Azure meetup cloud native concepts - may 28th 2018
Azure meetup   cloud native concepts - may 28th 2018Azure meetup   cloud native concepts - may 28th 2018
Azure meetup cloud native concepts - may 28th 2018
 

Kürzlich hochgeladen

Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 

Kürzlich hochgeladen (20)

Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 

Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clusters Inside Each End User’s Own Cloud Infrastructure

  • 1. Kubernetes Clusters At Scale Managing Hundreds Apache Pinot Kubernetes Clusters Inside Each End User’s Own Cloud Infrastructure Xiaoman Dong DevOps, Software Engineer @StarTree
  • 2. Apache Pinot • OLAP Datastore • Columnar, Indexed storage • Real-time Low latency analytics • Distributed – highly available, reliable, scalable • Lambda architecture • SQL Interface • Open Source - Apache TLP
  • 3. Typical/Traditional SaaS ● K8S Owned by SaaS Company ● Data Stays in SaaS Company Virtual Private Cloud
  • 4. We Do Delegated Management Solution ● K8S Owned by Customer ● Data Stays inside Customer’s Virtual Private Cloud ● Fully Managed by Us
  • 5. Design Context throughout the Talk The 3 Major Constraints ● Cloud Boundaries ● Optimized for Apache Pinot ● Scale to hundreds or more We will focus on how these 3 makes our system special
  • 6. How do we design such a system? (My job is safe from ChatGPT ... for now)
  • 7. The journey: design such a system • We are going to start small, automate, and dive deeper • Always think about our context: customer’s cloud, our backend
  • 8. Step 1: Creating the Clusters • Each customer will be able to create and see their own clusters • Self-serve provisioning via UI • Multi-cloud support (AWS, GCP, Azure)
  • 9. Step 1: Provisioning The Manual Way Automate this! ● Log into AWS UI by credentials provided by the customer ● Create Account, Networking, Kubernetes Cluster ● ❌ Bash script the aws eks creation ● ✅ Write your own microservice - Use aws client library - Terraform
  • 10. Step 1: Provisioning (Cont’d) - Scale to 1k customers?
  • 11. Step 1: Provisioning - Orchestration Orchestration Engine Workflow Needed: 1. Create Account 2. Create Network 3. Create NodeGroup 4. Create K8S 5. Create … 6. Notify Finished Retry in each step, report status
  • 12. Step 2: Installing Applications Goal: The customers needs to access their clusters with Pinot Running
  • 13. Step 2: Installing Applications The Manual Way Automate This ● ❌ kubectl apply -f all-apps.yaml ● ✅ helm upgrade --install startree-platform … ● Build our own helm charts ● Run our own private helm repo (or pay for AWS ECR) ● All applications deployed via Helm Chart ● Call helm libraries in our code
  • 14. K8S Cluster Runs as a Platform, Applications are Pluggable Charts and docker owned by separate teams 😍
  • 15. Step 3: Networking A huge topic worth a dedicated session Public facing vs. “Internal” facing (VPC Peering) Kubernetes Has Good Network Modeling and EcoSystems ● Ingress - We choose Traefik, easy for teams to define ingress ● LoadBalancer by Each Cloud Provider ● ExtraVPC Peering on demand ● Multi-Zone High Availability
  • 16. Step 4: TLS and Certificates - Problem Secure connection is required nearly everywhere ● Even withinVPC/Firewall customers request it ● Manual certificate generation will not scale Certificate has expiration dates ● Automated renewal is needed ● First Time Creation == Future Renewal
  • 17. Step 4: TLS and Certificates - Knowledge Facts of Certificates - Proves that you own this DNS name properly - To generate certificate, we need to do DNS related challenge to prove ownership - Established by chain of trusts - Issued by well-known/pre-installed 3rd party issuers like ZeroSSL
  • 18. Step 4: TLS and Certificates: Centralized Option 1: Centralized solution ✅ Better Security ❌ Harder to Scale
  • 19. Step 4: TLS and Certificates: Distributed Decentralized Certificate Renewal ❌ Less Secure ✅ Easier to Scale Up
  • 20. Special Part for Delegated Management Solution Step 5,6,7… The Usual DevOps stuff ● OIDC for AuthZ/AuthN ● Prometheus + AlertManager for Observability ● Logging, Debugging ● Backup and Disaster Recovery ● Metrics push to centralized monitoring and/or customer’s metrics storage ● Backup to customer’s deep store
  • 21. Checkpoint 1: Kubernetes Fleet Management Architecture So Far A mini version of multi-cloud Kubernetes fleet management system, like the KubeSphere
  • 22. Wait, What About Apache Pinot? Pinot Kubernetes Operator
  • 23. Configuration/Customization Templated Environment Creation ● Some customers like to enable groovy in Query, some don't ● Customizations/Configurations are applied onto templates ● Customization are applied like aVisitor pattern in the old Design Patterns
  • 24. Are we there yet? “Ops” part of DevOps! * Image courtesy https://devopedia.org/devops
  • 26. Version and Upgrades (Cont’d) The version matrix Lessons Learnt ● Create good release pipeline with tests ● Discipline: avoid releasing versions with breaking changes ● Keep helm chart and image tag the same as release version
  • 27. Efficiency and Reliability Efficiency and Reliability are key to Scale up ● Discipline in DevOps is important ● No architecture is bulletproof ● Less Outages == Better Efficiency ● DevOps are created for end to end ownership
  • 28. Efficiency and Reliability - Cont’d Best Practices ● Build Good Infra Integration/Regression Test ● Trunk-Based Release Pipelines ○ Always release from master ○ Say no for release branches ● Do not customize by Kubectl command
  • 29. Operations and OnCalls There is no silver bullet for OnCall • Discipline and Process - Root Cause every outage - Follow up on every outage • Effective Alerts - Differentiate alerts from signals - Review and Keep Improving - Build metrics to measure effectiveness
  • 30. Lessons Learnt Security design in Provisioned Cluster is hard • Centralized Control, less Scalability • Decentralized Control, harder to protect credentials • Build good debugging support on TLS certificates Do not run complicated Terraforms • Bugs if state gets complicated, unwanted recreation • Internal states of terraform are hard to keep track
  • 31. Lessons Learnt (cont’d) Certificate Issuer like ZeroSSL may partially go down for half a day • No new customer can onboard during that downtime One 3rd Party Helm Repo goes down and blocks customer cluster upgrade • Serve Helm Charts by your own repo like JFrog
  • 32. What’s Ahead • Improving Design For Layering • Improve Resource Efficiency • No Downtime Upgrade • Cluster Federation • …
  • 34. Thank you! Reach me via https://www.linkedin.com/in/xiaoman/