SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Workday’s Next Generation Private Cloud
The Fifth Generation of an OpenStack Platform inside Workday
The Leading Enterprise
Cloud for Finance and HR
Customer Satisfaction:
Workers Supported
97%
60M+
Fortune 500 Companies:
50%+
Silvano Buback
Principal, Software Development Engineer
Jan Gutter
Senior Software Development Engineer
Workday
Private
Cloud
OpenStack at Workday
8 SREs
9 Developers
SLO: 99% API Call
success
87 Clusters
2 Million Cores
12.5 PB RAM
60k concurrent VMs
241k VMs recreated weekly
Simple set of OpenStack components to deliver a resilient platform.
● Single client (PaaS)
● ~300 compute nodes per cluster
● Workday service weekend maintenance “The Patch”
● OpenStack projects are used to denote Workday services
● Unique tooling for batch scheduling and capacity planning
Workday’s Use Case
Regular maintenance window every weekend where where service
VMs are recreated and the Workday application gets upgraded
● “The Power of One” is an important mission for us
● Largest impact to control and data plane during this time
● SLO target is 99% success for all API calls over the week
● 60% of instances deleted/created during “The Patch”
● Remaining 40% are recreated throughout the week
“The Patch”
Development Environment
We Treat
Everything
as a High
Security
Environment
Weekly
Builds in
Dev
Clusters
Dev Clusters
Run Internal
Services
Dev and
Production
Run Very
Different
Workloads
Workday
Private
Cloud
5
Fourth Generation
Private Cloud Evolution
• OpenStack Victoria
• CentOS Stream 8
• Kolla-Ansible + plain Ansible
• Kolla Containers (built from source)
• Calico
• L3 only BGP Fabric
• Zuul CI
• Internal solution for CD
• Branch for each stable series
Fifth Generation
• OpenStack Mitaka
• CentOS Linux 7
• Chef
• RPM
• Contrail
• Overlay Networks
• Jenkins for CI
• Jenkins + Internal solution for CD
• Single branch, releases are snapshots
First use of gated development!
CI Tooling
Target multiple scenarios:
• CLI
• Zuul
• Custom Ansible Orchestration Service
Three types of clusters:
• Overcloud - a cluster built from instances in a single tenant
• Zuul - a cluster built from a nodeset
• Baremetal
Pain Point - Multiple Deployment Scenarios
Zuul: Expectations vs Reality
Successfully
keeps a lot
of core code
stable
Naively
expected to
reuse
community
pipeline
Evolved pipeline
multiple times
with no
interruptions
Community
pipelines tied
to community
infrastructure
● Use branches for stable releases
● Nothing new about this: OpenStack community also uses this
● “branch for stable release” model was a new concept for us
● We forked https://opendev.org/openstack/releases to handle this
Zuul Pipeline Design
For every tool / service, there’s a Workday
name!
Home Grown
Tools
List of Home Grown Tools
DNS
Infrastructure IP Address
Management
Certificate
Authority
Ansible
Orchestration
Multi Cluster
Cloud
Overview
Compute
Node
Health
Check
List of (more) Home Grown Tools
Capacity
Management
Chef
Implementation
Batch
Scheduling
PaaS
(Image Build
Service, Instance
lifecycle
management) BM Lifecycle
Tracking
Bare Metal
Provisioning
Service
Differences with community version
Downstream
Changes
Downstream Changes
TLS everywhere
Compute nodes use
Prometheus/OpenStack
integration
Prometheus upgraded to newer version
Custom tags based on
Kolla-Ansible inventory
Wavefront integration
while we transition to
Cortex
● New Prometheus Exporters (some are upgrades)
○ libvirt exporter
○ OpenStack exporter upgrade
○ BIRD exporter (BGP router)
● Fluentd parses HAproxy/Apache logs to provide API request metrics
● “Singleton” containers
○ One running container per cluster
○ Using Keepalived for HA
○ Examples: Prometheus, DB Backup, openstack-exporter
● Timeouts/Retry/Performance improvements on K-A deployment
(more) Downstream Changes
● Kolla containers for Calico
● Enabled etcdv3 in Kolla-Ansible
● Building C8 binaries
● Using a local fork of the Neutron plugin
● Wrote our own metadata proxy (TLS support)
● Numerous small changes
○ MTU
○ Newer version of OpenStack
○ DHCP service monitoring
● Most of the changes were in the Neutron plugin, Felix code is
essentially unchanged
Calico Fork
Q & A
Random notes about our environment
Other
Interesting Bits
● Every instance gets an internally routable IPv4 address. 🤯
● Multiple layers of network security
● Previously: Contrail with virtual overlay networks
● Now: Calico with routing fabric
Requirements for Networking
● In preparation for OpenStack Victoria, we reduced the use of file
injection in our PaaS system significantly
● We were fortunate because we could move service accounts from
one cluster to another
● To reduce transition time, we allocate overlapping ranges
● During The Patch, instances running on the previous generation
are removed
Forklift
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Kai Wähner
 

Was ist angesagt? (20)

Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage Service
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Pub/Sub Messaging
Pub/Sub MessagingPub/Sub Messaging
Pub/Sub Messaging
 
Elasticsearch Monitoring in Openshift
Elasticsearch Monitoring in OpenshiftElasticsearch Monitoring in Openshift
Elasticsearch Monitoring in Openshift
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Cloud computing and OpenStack
Cloud computing and OpenStackCloud computing and OpenStack
Cloud computing and OpenStack
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
 
Kafka Retry and DLQ
Kafka Retry and DLQKafka Retry and DLQ
Kafka Retry and DLQ
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 

Ähnlich wie Workday's Next Generation Private Cloud

Intel open stack-summit-session-nov13-final
Intel open stack-summit-session-nov13-finalIntel open stack-summit-session-nov13-final
Intel open stack-summit-session-nov13-final
Deepak Mane
 
Kubernetes for Beginners
Kubernetes for BeginnersKubernetes for Beginners
Kubernetes for Beginners
DigitalOcean
 
20141111_SOS3_Gallo
20141111_SOS3_Gallo20141111_SOS3_Gallo
20141111_SOS3_Gallo
Andrea Gallo
 

Ähnlich wie Workday's Next Generation Private Cloud (20)

Running Production-Grade Kubernetes on AWS
Running Production-Grade Kubernetes on AWSRunning Production-Grade Kubernetes on AWS
Running Production-Grade Kubernetes on AWS
 
Pivotal Container Service Overview
Pivotal Container Service Overview Pivotal Container Service Overview
Pivotal Container Service Overview
 
Intel open stack-summit-session-nov13-final
Intel open stack-summit-session-nov13-finalIntel open stack-summit-session-nov13-final
Intel open stack-summit-session-nov13-final
 
Free GitOps Workshop
Free GitOps WorkshopFree GitOps Workshop
Free GitOps Workshop
 
Red Hat presentatie: Open stack Latest Pure Tech
Red Hat presentatie: Open stack Latest Pure TechRed Hat presentatie: Open stack Latest Pure Tech
Red Hat presentatie: Open stack Latest Pure Tech
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
 
Kubernetes for Beginners
Kubernetes for BeginnersKubernetes for Beginners
Kubernetes for Beginners
 
20141111_SOS3_Gallo
20141111_SOS3_Gallo20141111_SOS3_Gallo
20141111_SOS3_Gallo
 
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVM
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVMSven Vogel: Running CloudStack and OpenShift with NetApp on KVM
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVM
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
Kubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz RiceKubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz Rice
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
OpenEBS hangout #4
OpenEBS hangout #4OpenEBS hangout #4
OpenEBS hangout #4
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
 
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
 
Oracle week Israel - OpenStack Platform - 2013
Oracle week Israel - OpenStack Platform - 2013Oracle week Israel - OpenStack Platform - 2013
Oracle week Israel - OpenStack Platform - 2013
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Workday's Next Generation Private Cloud

  • 1. Workday’s Next Generation Private Cloud The Fifth Generation of an OpenStack Platform inside Workday
  • 2. The Leading Enterprise Cloud for Finance and HR Customer Satisfaction: Workers Supported 97% 60M+ Fortune 500 Companies: 50%+
  • 3. Silvano Buback Principal, Software Development Engineer Jan Gutter Senior Software Development Engineer
  • 5. OpenStack at Workday 8 SREs 9 Developers SLO: 99% API Call success 87 Clusters 2 Million Cores 12.5 PB RAM 60k concurrent VMs 241k VMs recreated weekly
  • 6. Simple set of OpenStack components to deliver a resilient platform. ● Single client (PaaS) ● ~300 compute nodes per cluster ● Workday service weekend maintenance “The Patch” ● OpenStack projects are used to denote Workday services ● Unique tooling for batch scheduling and capacity planning Workday’s Use Case
  • 7. Regular maintenance window every weekend where where service VMs are recreated and the Workday application gets upgraded ● “The Power of One” is an important mission for us ● Largest impact to control and data plane during this time ● SLO target is 99% success for all API calls over the week ● 60% of instances deleted/created during “The Patch” ● Remaining 40% are recreated throughout the week “The Patch”
  • 8. Development Environment We Treat Everything as a High Security Environment Weekly Builds in Dev Clusters Dev Clusters Run Internal Services Dev and Production Run Very Different Workloads
  • 10. Fourth Generation Private Cloud Evolution • OpenStack Victoria • CentOS Stream 8 • Kolla-Ansible + plain Ansible • Kolla Containers (built from source) • Calico • L3 only BGP Fabric • Zuul CI • Internal solution for CD • Branch for each stable series Fifth Generation • OpenStack Mitaka • CentOS Linux 7 • Chef • RPM • Contrail • Overlay Networks • Jenkins for CI • Jenkins + Internal solution for CD • Single branch, releases are snapshots
  • 11. First use of gated development! CI Tooling
  • 12. Target multiple scenarios: • CLI • Zuul • Custom Ansible Orchestration Service Three types of clusters: • Overcloud - a cluster built from instances in a single tenant • Zuul - a cluster built from a nodeset • Baremetal Pain Point - Multiple Deployment Scenarios
  • 13. Zuul: Expectations vs Reality Successfully keeps a lot of core code stable Naively expected to reuse community pipeline Evolved pipeline multiple times with no interruptions Community pipelines tied to community infrastructure
  • 14. ● Use branches for stable releases ● Nothing new about this: OpenStack community also uses this ● “branch for stable release” model was a new concept for us ● We forked https://opendev.org/openstack/releases to handle this Zuul Pipeline Design
  • 15. For every tool / service, there’s a Workday name! Home Grown Tools
  • 16. List of Home Grown Tools DNS Infrastructure IP Address Management Certificate Authority Ansible Orchestration Multi Cluster Cloud Overview Compute Node Health Check
  • 17. List of (more) Home Grown Tools Capacity Management Chef Implementation Batch Scheduling PaaS (Image Build Service, Instance lifecycle management) BM Lifecycle Tracking Bare Metal Provisioning Service
  • 18. Differences with community version Downstream Changes
  • 19. Downstream Changes TLS everywhere Compute nodes use Prometheus/OpenStack integration Prometheus upgraded to newer version Custom tags based on Kolla-Ansible inventory Wavefront integration while we transition to Cortex
  • 20. ● New Prometheus Exporters (some are upgrades) ○ libvirt exporter ○ OpenStack exporter upgrade ○ BIRD exporter (BGP router) ● Fluentd parses HAproxy/Apache logs to provide API request metrics ● “Singleton” containers ○ One running container per cluster ○ Using Keepalived for HA ○ Examples: Prometheus, DB Backup, openstack-exporter ● Timeouts/Retry/Performance improvements on K-A deployment (more) Downstream Changes
  • 21. ● Kolla containers for Calico ● Enabled etcdv3 in Kolla-Ansible ● Building C8 binaries ● Using a local fork of the Neutron plugin ● Wrote our own metadata proxy (TLS support) ● Numerous small changes ○ MTU ○ Newer version of OpenStack ○ DHCP service monitoring ● Most of the changes were in the Neutron plugin, Felix code is essentially unchanged Calico Fork
  • 22. Q & A
  • 23. Random notes about our environment Other Interesting Bits
  • 24. ● Every instance gets an internally routable IPv4 address. 🤯 ● Multiple layers of network security ● Previously: Contrail with virtual overlay networks ● Now: Calico with routing fabric Requirements for Networking
  • 25. ● In preparation for OpenStack Victoria, we reduced the use of file injection in our PaaS system significantly ● We were fortunate because we could move service accounts from one cluster to another ● To reduce transition time, we allocate overlapping ranges ● During The Patch, instances running on the previous generation are removed Forklift