SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
Moving from
CellsV1 to CellsV2
at CERN
OpenStack Summit - Vancouver 2018
Belmiro Moreira
belmiro.moreira@cern.ch @belmiromoreira
CERN - Cloud resources status board - 11/05/2018@09:23
Cells at CERN
â—Ź CERN uses cells since 2013
â—Ź Why cells?
â—‹ Single endpoint. Scale transparently between different Data Centres
â—‹ Availability and Resilience
â—‹ Isolate failure domains
â—‹ Dedicate cells to projects
â—‹ Hardware type per cell
â—‹ Easy to introduce new configurations
Cells at CERN
â—Ź Disadvantages
â—‹ Unmaintained upstream
â—‹ Only few deployments using Cells
â—‹ Several functionality missing
â–  Flavor propagation
â–  Aggregates
â–  Server groups
â–  Security groups
â–  ...
CellsV1 architecture at CERN
Nova API Servers
CellA compute nodes
CellA controller
CellB compute nodes
CellB controller
TOP Cell controllers
nova DB
nova DB
nova DB
RabbitMQ
RabbitMQ
RabbitMQ
CellsV1 architecture at CERN (Newton)
compute node
nova-compute
child cell controller
nova-api
nova-conductor
nova-scheduler
nova-network
nova-cells
RabbitMQ
top cell controller
nova-novncproxy
nova-consoleauth
nova-cellsnova DB
API nodes
nova-api
nova DB
x10
RabbitMQ cluster
x4
x200
x1
x70
nova_api DB
Journey to CellsV2
Newton Ocata Pike Queens
Before Ocata Upgrade
â—Ź Enable Placement
â—‹ Introduced in Newton release
â—‹ Required in Ocata
â—‹ nova-scheduler runs per cell in cellsV1
â—Ź How to deploy Placement with cellsV1 in a large production environment?
â—‹ Placement retrieves the allocation candidates to the scheduler
â—‹ Placement is not cell aware
â—‹ Global vs Local (in the Cell)
â–  Global: scheduler gets all allocation candidates available in the cloud
â–  Local: scheduler gets only the allocation candidates available in the cloud
Setup Placement per cell
â—Ź Create a region per cell
â—Ź Create a placement endpoint per region
● Configure a “nova_api” DB per cell
â—Ź Run a placement service per cell in each cell controller
â—Ź Configure the compute nodes of the cell to use the cell placement
CellsV1 architecture with local placement
compute node
nova-compute
child cell controller
nova-api
nova-conductor
nova-scheduler
nova-network
nova-cells
RabbitMQ
top cell controller
nova-novncproxy
nova-consoleauth
nova-cellsnova DB
API nodes
nova-api
nova DB
RabbitMQ cluster
placement-api
nova_api DB
nova_api DB
â—Ź Issues
○ “build_requests” not deleted in the top “nova_api”
â—‹ https://review.openstack.org/#/c/523187/
â—Ź Keystone needs to scale accordingly
Enable placement per cell
Keystone - number of requests when enabling placement
Upgrade to Ocata
â—Ź Data migrations
â—‹ flavors, keypairs, aggregates moved to nova_api (Top cell DB)
â—‹ migrate_instance_keypairs required to run in cells DBs
â–  However keypairs only exist in Top cell DB
â–  https://bugs.launchpad.net/nova/+bug/1761197
■ Migration tool that populates cells “instance_extra” table from “nova_api” DB
â—‹ No data migrations required in cells DBs
○ “db sync” in child cells fails because there are flavors not moved to nova_api (local)
â—Ź DB schema
â—‹ migration 346 can take a lot of time (remove 'schedule_at' column from instances table)
â–  consider archive and then truncate shadow tables
○ “api_db sync” fails if cells not defined even if running cellsV1
Upgrade to Ocata
● Add cells mapping in all “nova_api” DBs
â—‹ cell0 (will not be used) and Top cell
â—‹ Other cells mapping are not required
● “use_local” removed in Ocata
â—‹ Changed nova-network to continue to support it!
â—Ź Inventory min_unit, max_unit and step_size constraints are enforced in Ocata
â—‹ https://bugs.launchpad.net/nova/+bug/1638681
â—‹ Problematic if not all compute nodes are upgraded to Ocata
Consolidate Placement
compute node
nova-compute
child cell controller
nova-api
nova-conductor
nova-scheduler
nova-network
nova-cells
RabbitMQ
top cell controller
nova-novncproxy
nova-consoleauth
nova-cellsnova DB
API nodes
nova-api
nova DB
RabbitMQ cluster
nova_api DB
top cell controller
placement-api
nova_api DB
placement-api
Consolidate Placement
● Change endpoints to “central” placement
○ “placement_region” and “nova_api”
â—‹ Applied per cell (few cells per day)
â–  Need to learning how to scale placement-api
â—‹ Scheduling time expected to go up
Consolidate Placement
â—Ź Local placement disabled in all cells
○ Moved last 15 cells to “central” placement
â—‹ Scheduler time increased
â—‹ Placement request time also increased
Consolidate Placement
â—Ź Fell apart during the night...
● Memcached reached the “max_connections”
○ Increased “max_connections”
○ Increased the number of “placement-api” servers
Consolidate Placement
â—Ź Moved 70 local Placements to the central Placement
○ Didn’t copied the data from the local nova_api DBs
â—‹ resource_providers, inventory and allocations are recreated
â—Ź Running on apache WSGI
â—Ź 10 servers (VMs 4 vcpus/8 GiB)
â—‹ 4 processes/20 threads
○ Increased number of connections on “nova_api” DB
â—Ź ~1000 compute nodes per placement-api server
â—Ź memcached cluster for keystone auth_token
Scheduling time after Consolidate Placement
â—Ź Scheduling time in few cells was better than expected
â—Ź Ocata scheduler only uses Placement after all compute nodes are upgraded
if service_version < 16:
LOG.debug("Skipping call to placement, as upgrade in progress.")
CellsV2 in Queens
â—Ź Advantages
○ Finally using the “loved” code
â—‹ Can remove all internal cellsV1 patches
â—Ź Concerns
â—‹ Is someone else running cellsV2 with more than one cell?
â—‹ Scheduling limitations
â—‹ Availability/Resilience issues
Scheduling
â—Ź How to dedicate cells to projects?
â—‹ No cell_filters equivalent in cellsV2
â—Ź Scheduler is global
â—‹ Scheduler doesn't know about cells
○ Placement doesn’t know about cells
â—‹ Scheduler needs to receive all available allocation candidates from placement
â–  https://review.openstack.org/#/c/531517/ (scheduler/max_placement_results)
â—‹ Availability zone selection is a scheduler filter
● Can’t enable/disable scheduler filters per cell
● Can’t enable/disable a cell
â—‹ https://review.openstack.org/#/c/546684/
Scheduling
â—Ź Placement request-filter
â—‹ https://review.openstack.org/#/c/544585/
â—Ź Initial work already done for Rocky
â—Ź CERN backported it for Queens
â—Ź Created our own filters
â—‹ AVZ support
â—‹ project-cell mapping
â—‹ flavor-cell mapping
â—Ź Few commits you may want to consider to backport to Queens
â—‹ https://review.openstack.org/#/q/project:openstack/nova+branch:master+topic:bp/placement-req-filter
Scheduling
â—Ź Placement request-filter uses aggregates
â—‹ Create an aggregate per cell
â—‹ Add hosts to the aggregates
â—‹ Add the aggregate metadata for the request-filter
â—‹ Placement aggregates are created and resource providers mapped
â–  Mirror host aggregates to placement: https://review.openstack.org/#/c/545057/
â—Ź Difficult to manage in large deployments
○ “Forgotten” nodes will not receive instances
â—‹ Mistakes can lead to wrong scheduling
○ Deleting a cell doesn’t delete resource_providers, resource_provider_aggregates,
aggregate_hosts
â–  https://bugs.launchpad.net/nova/+bug/1749734
Availability
â—Ź If a cell/DB is down all cloud is affected
○ Can’t list instances
○ Can’t create instances
○ …
â—Ź Looking back we only had few issues with DBs
â—‹ Felt confident to move to CellsV2
â—Ź Upstream discussion on how to fix/improve the availability problem
â—‹ https://review.openstack.org/#/c/557369/
Upgrade to Queens
● “Shutdown the cloud”
â—Ź Steps we followed for the upgrade
â—‹ Upgrade packages
â—‹ Data migrations / DB schema
â–  Pike/Queens data migrations
â—Ź Quotas, service UUIDs, block_device UUIDs, migrations UUIDs
â–  Top cell DB will be removed
â—‹ Create cells in nova_api DB
â—‹ Delete current instance_mappings
â—‹ Recreate instance_mappings per cell
â—‹ Discover hosts
Upgrade to Queens
â—‹ Create aggregates per cell and populate aggregate_hosts, aggregate_metadata
â—‹ Create placement aggregates and populate resource_provider_aggregates
â—‹ Setup AVZs
â—‹ Enable nova-scheduler and nova-conductor services in the top control plane
â—‹ Remove nova-cells service from parent and child cells
â—‹ Remove nova-scheduler from child cells controllers
â—‹ Upgrade compute nodes
â—Ź Start the cloud
After Queens upgrade
CPU load in nova-api servers CPU load in placement-api servers
Placement - number of requests
90
What changed in Placement?
â—Ź Refresh aggregates, traits and aggregate-associated sharing providers
â—‹ ASSOCIATION_REFRESH = 5m
â—‹ Made the option configurable:
â–  Master: https://review.openstack.org/#/c/565526/
â–  Backported to Queens: https://review.openstack.org/#/c/566288/
â—‹ Set it to a very large value
â—Ź However it still runs when nova-compute restarts
â—‹ Problematic with Ironic
â—‹ At the end we removed this code path
Placement - number of requests
90
Placement
â—Ź Doubled the number of placement-api nodes
â—‹ ~500 compute nodes per placement-api server
â—Ź In average request time < 100ms
Nova API request time
Database load pattern
â—Ź Number of queries in Cell DBs more than double after the upgrade
â—‹ APIs only available to few users
â—Ź Connection rate increased
â—‹ Clients could not connect. API calls failed
â—‹ Reviewed DB configuration. Related with ulimits of mysql processes
nova list / nova boot
â—Ź To list instances the request goes to all cells DBs
â—‹ Problematic if a group of DBs is slow or has connection issues
â—‹ Fails if a DB is down
â—Ź DBs for Wigner data centre cells are located in Wigner
â—‹ API servers are located in Geneva
â—Ź To minimize the impact deployed few patches
â—‹ Nova list only queries the cells DBs where the project has instances
â–  https://review.openstack.org/#/c/509003
â—‹ Quota calculation only queries the cells DBs where the project has instances
â–  https://bugs.launchpad.net/nova/+bug/1771810
Minor issues
â—Ź Availability zones in api-metadata
â—‹ https://bugs.launchpad.net/nova/+bug/1768876
â—Ź nova-compute (ironic) creates new resource provider when failover
â—‹ resource_provider_aggregate lost
â—‹ https://bugs.launchpad.net/nova/+bug/1771806
â—Ź Scheduler host_manager gathering info
â—‹ Makes it parallel. Ignore cells down: https://review.openstack.org/#/c/539617/
â—Ź Service list
â—‹ Not parallel. Fails if a cell is down: https://bugs.launchpad.net/nova/+bug/1726310
● nova-network doesn’t start when using cellsV2
Today - metrics
CellsV2 architecture at CERN (Queens)
compute node
nova-compute
child cell controller
nova-api
nova-conductor
nova-network
RabbitMQ
top cell controller
nova-scheduler
nova-conductor
nova-placementnova DB
API nodes
nova-api
nova DB
RabbitMQ cluster
nova_api DB
x20
x16
x200
x1
x70
nova-novncproxy
nova-consoleauth
Summary
CERN cloud is running Nova Queens with CellsV2
â—Ź Moving from CellsV1 is not a trivial upgrade
â—Ź CellsV2 works at scale
â—Ź Availability/Resilience issues
Thanks to all Nova Team!
belmiro.moreira@cern.ch
@belmiromoreira
http://openstack-in-production.blogspot.com

Weitere ähnliche Inhalte

Was ist angesagt?

20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN Barcelona20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN BarcelonaTim Bell
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicTim Bell
 
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014Belmiro Moreira
 
20150924 rda federation_v1
20150924 rda federation_v120150924 rda federation_v1
20150924 rda federation_v1Tim Bell
 
20190620 accelerating containers v3
20190620 accelerating containers v320190620 accelerating containers v3
20190620 accelerating containers v3Tim Bell
 
Learning to Scale OpenStack
Learning to Scale OpenStackLearning to Scale OpenStack
Learning to Scale OpenStackRainya Mosher
 
OpenContrail Implementations
OpenContrail ImplementationsOpenContrail Implementations
OpenContrail ImplementationsJakub Pavlik
 
OpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim BellOpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim BellAmrita Prasad
 
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Belmiro Moreira
 
Operators experience and perspective on SDN with VLANs and L3 Networks
Operators experience and perspective on SDN with VLANs and L3 NetworksOperators experience and perspective on SDN with VLANs and L3 Networks
Operators experience and perspective on SDN with VLANs and L3 NetworksJakub Pavlik
 
Swami osi bangalore2017days pike release_updates
Swami osi bangalore2017days pike release_updatesSwami osi bangalore2017days pike release_updates
Swami osi bangalore2017days pike release_updatesRanga Swami Reddy Muthumula
 
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies Jakub Pavlik
 
The OpenStack Cloud at CERN
The OpenStack Cloud at CERNThe OpenStack Cloud at CERN
The OpenStack Cloud at CERNArne Wiebalck
 
OpenContrail Experience tcp cloud OpenStack Summit Tokyo
OpenContrail Experience tcp cloud OpenStack Summit TokyoOpenContrail Experience tcp cloud OpenStack Summit Tokyo
OpenContrail Experience tcp cloud OpenStack Summit TokyoJakub Pavlik
 
Kubernetes and OpenStack at Scale
Kubernetes and OpenStack at ScaleKubernetes and OpenStack at Scale
Kubernetes and OpenStack at ScaleStephen Gordon
 
Mirantis OpenStack-DC-Meetup 17 Sept 2014
Mirantis OpenStack-DC-Meetup 17 Sept 2014Mirantis OpenStack-DC-Meetup 17 Sept 2014
Mirantis OpenStack-DC-Meetup 17 Sept 2014Mirantis
 
High availability and fault tolerance of openstack
High availability and fault tolerance of openstackHigh availability and fault tolerance of openstack
High availability and fault tolerance of openstackDeepak Mane
 
Topologies of OpenStack
Topologies of OpenStackTopologies of OpenStack
Topologies of OpenStackharibabu kasturi
 
A Container Stack for Openstack - OpenStack Silicon Valley
A Container Stack for Openstack - OpenStack Silicon ValleyA Container Stack for Openstack - OpenStack Silicon Valley
A Container Stack for Openstack - OpenStack Silicon ValleyStephen Gordon
 

Was ist angesagt? (20)

20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN Barcelona20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN Barcelona
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack Nordic
 
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
 
20150924 rda federation_v1
20150924 rda federation_v120150924 rda federation_v1
20150924 rda federation_v1
 
20190620 accelerating containers v3
20190620 accelerating containers v320190620 accelerating containers v3
20190620 accelerating containers v3
 
Learning to Scale OpenStack
Learning to Scale OpenStackLearning to Scale OpenStack
Learning to Scale OpenStack
 
OpenContrail Implementations
OpenContrail ImplementationsOpenContrail Implementations
OpenContrail Implementations
 
OpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim BellOpenStack @ CERN, by Tim Bell
OpenStack @ CERN, by Tim Bell
 
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
 
Operators experience and perspective on SDN with VLANs and L3 Networks
Operators experience and perspective on SDN with VLANs and L3 NetworksOperators experience and perspective on SDN with VLANs and L3 Networks
Operators experience and perspective on SDN with VLANs and L3 Networks
 
Swami osi bangalore2017days pike release_updates
Swami osi bangalore2017days pike release_updatesSwami osi bangalore2017days pike release_updates
Swami osi bangalore2017days pike release_updates
 
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
 
The OpenStack Cloud at CERN
The OpenStack Cloud at CERNThe OpenStack Cloud at CERN
The OpenStack Cloud at CERN
 
OpenContrail Experience tcp cloud OpenStack Summit Tokyo
OpenContrail Experience tcp cloud OpenStack Summit TokyoOpenContrail Experience tcp cloud OpenStack Summit Tokyo
OpenContrail Experience tcp cloud OpenStack Summit Tokyo
 
Kubernetes and OpenStack at Scale
Kubernetes and OpenStack at ScaleKubernetes and OpenStack at Scale
Kubernetes and OpenStack at Scale
 
Mirantis OpenStack-DC-Meetup 17 Sept 2014
Mirantis OpenStack-DC-Meetup 17 Sept 2014Mirantis OpenStack-DC-Meetup 17 Sept 2014
Mirantis OpenStack-DC-Meetup 17 Sept 2014
 
High availability and fault tolerance of openstack
High availability and fault tolerance of openstackHigh availability and fault tolerance of openstack
High availability and fault tolerance of openstack
 
Topologies of OpenStack
Topologies of OpenStackTopologies of OpenStack
Topologies of OpenStack
 
Meetup 23 - 02 - OVN - The future of networking in OpenStack
Meetup 23 - 02 - OVN - The future of networking in OpenStackMeetup 23 - 02 - OVN - The future of networking in OpenStack
Meetup 23 - 02 - OVN - The future of networking in OpenStack
 
A Container Stack for Openstack - OpenStack Silicon Valley
A Container Stack for Openstack - OpenStack Silicon ValleyA Container Stack for Openstack - OpenStack Silicon Valley
A Container Stack for Openstack - OpenStack Silicon Valley
 

Ă„hnlich wie Moving from CellsV1 to CellsV2 at CERN

M|18 Creating a Reference Architecture for High Availability at Nokia
M|18 Creating a Reference Architecture for High Availability at NokiaM|18 Creating a Reference Architecture for High Availability at Nokia
M|18 Creating a Reference Architecture for High Availability at NokiaMariaDB plc
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKevin Lynch
 
Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Belmiro Moreira
 
2010 12 mysql_clusteroverview
2010 12 mysql_clusteroverview2010 12 mysql_clusteroverview
2010 12 mysql_clusteroverviewDimas Prasetyo
 
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut:  Orchestrating  Percona XtraDB Cluster with KubernetesClusternaut:  Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut: Orchestrating  Percona XtraDB Cluster with KubernetesRaghavendra Prabhu
 
Rook - cloud-native storage
Rook - cloud-native storageRook - cloud-native storage
Rook - cloud-native storageKarol Chrapek
 
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBasehbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBaseMichael Stack
 
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...Nitin Kumar
 
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsHan Zhou
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices worldKarol Chrapek
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Sharma Podila
 
A first look at MariaDB 11.x features and ideas on how to use them
A first look at MariaDB 11.x features and ideas on how to use themA first look at MariaDB 11.x features and ideas on how to use them
A first look at MariaDB 11.x features and ideas on how to use themFederico Razzoli
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17aspyker
 
OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin Honka
OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin HonkaOSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin Honka
OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin HonkaNETWAYS
 
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...DevOps_Fest
 
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...OpenNebula Project
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.
 
How the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My NamespaceHow the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My NamespaceLaurent Bernaille
 

Ă„hnlich wie Moving from CellsV1 to CellsV2 at CERN (20)

M|18 Creating a Reference Architecture for High Availability at Nokia
M|18 Creating a Reference Architecture for High Availability at NokiaM|18 Creating a Reference Architecture for High Availability at Nokia
M|18 Creating a Reference Architecture for High Availability at Nokia
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
 
Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015
 
2010 12 mysql_clusteroverview
2010 12 mysql_clusteroverview2010 12 mysql_clusteroverview
2010 12 mysql_clusteroverview
 
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut:  Orchestrating  Percona XtraDB Cluster with KubernetesClusternaut:  Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
 
Rook - cloud-native storage
Rook - cloud-native storageRook - cloud-native storage
Rook - cloud-native storage
 
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBasehbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
 
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutions
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
A first look at MariaDB 11.x features and ideas on how to use them
A first look at MariaDB 11.x features and ideas on how to use themA first look at MariaDB 11.x features and ideas on how to use them
A first look at MariaDB 11.x features and ideas on how to use them
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 
OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin Honka
OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin HonkaOSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin Honka
OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin Honka
 
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
 
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
How the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My NamespaceHow the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My Namespace
 

KĂĽrzlich hochgeladen

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 

KĂĽrzlich hochgeladen (20)

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 

Moving from CellsV1 to CellsV2 at CERN

  • 1.
  • 2. Moving from CellsV1 to CellsV2 at CERN OpenStack Summit - Vancouver 2018 Belmiro Moreira belmiro.moreira@cern.ch @belmiromoreira
  • 3.
  • 4.
  • 5. CERN - Cloud resources status board - 11/05/2018@09:23
  • 6. Cells at CERN â—Ź CERN uses cells since 2013 â—Ź Why cells? â—‹ Single endpoint. Scale transparently between different Data Centres â—‹ Availability and Resilience â—‹ Isolate failure domains â—‹ Dedicate cells to projects â—‹ Hardware type per cell â—‹ Easy to introduce new configurations
  • 7. Cells at CERN â—Ź Disadvantages â—‹ Unmaintained upstream â—‹ Only few deployments using Cells â—‹ Several functionality missing â–  Flavor propagation â–  Aggregates â–  Server groups â–  Security groups â–  ...
  • 8. CellsV1 architecture at CERN Nova API Servers CellA compute nodes CellA controller CellB compute nodes CellB controller TOP Cell controllers nova DB nova DB nova DB RabbitMQ RabbitMQ RabbitMQ
  • 9. CellsV1 architecture at CERN (Newton) compute node nova-compute child cell controller nova-api nova-conductor nova-scheduler nova-network nova-cells RabbitMQ top cell controller nova-novncproxy nova-consoleauth nova-cellsnova DB API nodes nova-api nova DB x10 RabbitMQ cluster x4 x200 x1 x70 nova_api DB
  • 10. Journey to CellsV2 Newton Ocata Pike Queens
  • 11. Before Ocata Upgrade â—Ź Enable Placement â—‹ Introduced in Newton release â—‹ Required in Ocata â—‹ nova-scheduler runs per cell in cellsV1 â—Ź How to deploy Placement with cellsV1 in a large production environment? â—‹ Placement retrieves the allocation candidates to the scheduler â—‹ Placement is not cell aware â—‹ Global vs Local (in the Cell) â–  Global: scheduler gets all allocation candidates available in the cloud â–  Local: scheduler gets only the allocation candidates available in the cloud
  • 12. Setup Placement per cell â—Ź Create a region per cell â—Ź Create a placement endpoint per region â—Ź Configure a “nova_api” DB per cell â—Ź Run a placement service per cell in each cell controller â—Ź Configure the compute nodes of the cell to use the cell placement
  • 13. CellsV1 architecture with local placement compute node nova-compute child cell controller nova-api nova-conductor nova-scheduler nova-network nova-cells RabbitMQ top cell controller nova-novncproxy nova-consoleauth nova-cellsnova DB API nodes nova-api nova DB RabbitMQ cluster placement-api nova_api DB nova_api DB
  • 14. â—Ź Issues â—‹ “build_requests” not deleted in the top “nova_api” â—‹ https://review.openstack.org/#/c/523187/ â—Ź Keystone needs to scale accordingly Enable placement per cell Keystone - number of requests when enabling placement
  • 15. Upgrade to Ocata â—Ź Data migrations â—‹ flavors, keypairs, aggregates moved to nova_api (Top cell DB) â—‹ migrate_instance_keypairs required to run in cells DBs â–  However keypairs only exist in Top cell DB â–  https://bugs.launchpad.net/nova/+bug/1761197 â–  Migration tool that populates cells “instance_extra” table from “nova_api” DB â—‹ No data migrations required in cells DBs â—‹ “db sync” in child cells fails because there are flavors not moved to nova_api (local) â—Ź DB schema â—‹ migration 346 can take a lot of time (remove 'schedule_at' column from instances table) â–  consider archive and then truncate shadow tables â—‹ “api_db sync” fails if cells not defined even if running cellsV1
  • 16. Upgrade to Ocata â—Ź Add cells mapping in all “nova_api” DBs â—‹ cell0 (will not be used) and Top cell â—‹ Other cells mapping are not required â—Ź “use_local” removed in Ocata â—‹ Changed nova-network to continue to support it! â—Ź Inventory min_unit, max_unit and step_size constraints are enforced in Ocata â—‹ https://bugs.launchpad.net/nova/+bug/1638681 â—‹ Problematic if not all compute nodes are upgraded to Ocata
  • 17. Consolidate Placement compute node nova-compute child cell controller nova-api nova-conductor nova-scheduler nova-network nova-cells RabbitMQ top cell controller nova-novncproxy nova-consoleauth nova-cellsnova DB API nodes nova-api nova DB RabbitMQ cluster nova_api DB top cell controller placement-api nova_api DB placement-api
  • 18. Consolidate Placement â—Ź Change endpoints to “central” placement â—‹ “placement_region” and “nova_api” â—‹ Applied per cell (few cells per day) â–  Need to learning how to scale placement-api â—‹ Scheduling time expected to go up
  • 19. Consolidate Placement â—Ź Local placement disabled in all cells â—‹ Moved last 15 cells to “central” placement â—‹ Scheduler time increased â—‹ Placement request time also increased
  • 20. Consolidate Placement â—Ź Fell apart during the night... â—Ź Memcached reached the “max_connections” â—‹ Increased “max_connections” â—‹ Increased the number of “placement-api” servers
  • 21. Consolidate Placement â—Ź Moved 70 local Placements to the central Placement â—‹ Didn’t copied the data from the local nova_api DBs â—‹ resource_providers, inventory and allocations are recreated â—Ź Running on apache WSGI â—Ź 10 servers (VMs 4 vcpus/8 GiB) â—‹ 4 processes/20 threads â—‹ Increased number of connections on “nova_api” DB â—Ź ~1000 compute nodes per placement-api server â—Ź memcached cluster for keystone auth_token
  • 22. Scheduling time after Consolidate Placement â—Ź Scheduling time in few cells was better than expected â—Ź Ocata scheduler only uses Placement after all compute nodes are upgraded if service_version < 16: LOG.debug("Skipping call to placement, as upgrade in progress.")
  • 23. CellsV2 in Queens â—Ź Advantages â—‹ Finally using the “loved” code â—‹ Can remove all internal cellsV1 patches â—Ź Concerns â—‹ Is someone else running cellsV2 with more than one cell? â—‹ Scheduling limitations â—‹ Availability/Resilience issues
  • 24. Scheduling â—Ź How to dedicate cells to projects? â—‹ No cell_filters equivalent in cellsV2 â—Ź Scheduler is global â—‹ Scheduler doesn't know about cells â—‹ Placement doesn’t know about cells â—‹ Scheduler needs to receive all available allocation candidates from placement â–  https://review.openstack.org/#/c/531517/ (scheduler/max_placement_results) â—‹ Availability zone selection is a scheduler filter â—Ź Can’t enable/disable scheduler filters per cell â—Ź Can’t enable/disable a cell â—‹ https://review.openstack.org/#/c/546684/
  • 25. Scheduling â—Ź Placement request-filter â—‹ https://review.openstack.org/#/c/544585/ â—Ź Initial work already done for Rocky â—Ź CERN backported it for Queens â—Ź Created our own filters â—‹ AVZ support â—‹ project-cell mapping â—‹ flavor-cell mapping â—Ź Few commits you may want to consider to backport to Queens â—‹ https://review.openstack.org/#/q/project:openstack/nova+branch:master+topic:bp/placement-req-filter
  • 26. Scheduling â—Ź Placement request-filter uses aggregates â—‹ Create an aggregate per cell â—‹ Add hosts to the aggregates â—‹ Add the aggregate metadata for the request-filter â—‹ Placement aggregates are created and resource providers mapped â–  Mirror host aggregates to placement: https://review.openstack.org/#/c/545057/ â—Ź Difficult to manage in large deployments â—‹ “Forgotten” nodes will not receive instances â—‹ Mistakes can lead to wrong scheduling â—‹ Deleting a cell doesn’t delete resource_providers, resource_provider_aggregates, aggregate_hosts â–  https://bugs.launchpad.net/nova/+bug/1749734
  • 27. Availability â—Ź If a cell/DB is down all cloud is affected â—‹ Can’t list instances â—‹ Can’t create instances â—‹ … â—Ź Looking back we only had few issues with DBs â—‹ Felt confident to move to CellsV2 â—Ź Upstream discussion on how to fix/improve the availability problem â—‹ https://review.openstack.org/#/c/557369/
  • 28. Upgrade to Queens â—Ź “Shutdown the cloud” â—Ź Steps we followed for the upgrade â—‹ Upgrade packages â—‹ Data migrations / DB schema â–  Pike/Queens data migrations â—Ź Quotas, service UUIDs, block_device UUIDs, migrations UUIDs â–  Top cell DB will be removed â—‹ Create cells in nova_api DB â—‹ Delete current instance_mappings â—‹ Recreate instance_mappings per cell â—‹ Discover hosts
  • 29. Upgrade to Queens â—‹ Create aggregates per cell and populate aggregate_hosts, aggregate_metadata â—‹ Create placement aggregates and populate resource_provider_aggregates â—‹ Setup AVZs â—‹ Enable nova-scheduler and nova-conductor services in the top control plane â—‹ Remove nova-cells service from parent and child cells â—‹ Remove nova-scheduler from child cells controllers â—‹ Upgrade compute nodes â—Ź Start the cloud
  • 30. After Queens upgrade CPU load in nova-api servers CPU load in placement-api servers
  • 31. Placement - number of requests 90
  • 32. What changed in Placement? â—Ź Refresh aggregates, traits and aggregate-associated sharing providers â—‹ ASSOCIATION_REFRESH = 5m â—‹ Made the option configurable: â–  Master: https://review.openstack.org/#/c/565526/ â–  Backported to Queens: https://review.openstack.org/#/c/566288/ â—‹ Set it to a very large value â—Ź However it still runs when nova-compute restarts â—‹ Problematic with Ironic â—‹ At the end we removed this code path
  • 33. Placement - number of requests 90
  • 34. Placement â—Ź Doubled the number of placement-api nodes â—‹ ~500 compute nodes per placement-api server â—Ź In average request time < 100ms
  • 36. Database load pattern â—Ź Number of queries in Cell DBs more than double after the upgrade â—‹ APIs only available to few users â—Ź Connection rate increased â—‹ Clients could not connect. API calls failed â—‹ Reviewed DB configuration. Related with ulimits of mysql processes
  • 37. nova list / nova boot â—Ź To list instances the request goes to all cells DBs â—‹ Problematic if a group of DBs is slow or has connection issues â—‹ Fails if a DB is down â—Ź DBs for Wigner data centre cells are located in Wigner â—‹ API servers are located in Geneva â—Ź To minimize the impact deployed few patches â—‹ Nova list only queries the cells DBs where the project has instances â–  https://review.openstack.org/#/c/509003 â—‹ Quota calculation only queries the cells DBs where the project has instances â–  https://bugs.launchpad.net/nova/+bug/1771810
  • 38. Minor issues â—Ź Availability zones in api-metadata â—‹ https://bugs.launchpad.net/nova/+bug/1768876 â—Ź nova-compute (ironic) creates new resource provider when failover â—‹ resource_provider_aggregate lost â—‹ https://bugs.launchpad.net/nova/+bug/1771806 â—Ź Scheduler host_manager gathering info â—‹ Makes it parallel. Ignore cells down: https://review.openstack.org/#/c/539617/ â—Ź Service list â—‹ Not parallel. Fails if a cell is down: https://bugs.launchpad.net/nova/+bug/1726310 â—Ź nova-network doesn’t start when using cellsV2
  • 40. CellsV2 architecture at CERN (Queens) compute node nova-compute child cell controller nova-api nova-conductor nova-network RabbitMQ top cell controller nova-scheduler nova-conductor nova-placementnova DB API nodes nova-api nova DB RabbitMQ cluster nova_api DB x20 x16 x200 x1 x70 nova-novncproxy nova-consoleauth
  • 41. Summary CERN cloud is running Nova Queens with CellsV2 â—Ź Moving from CellsV1 is not a trivial upgrade â—Ź CellsV2 works at scale â—Ź Availability/Resilience issues Thanks to all Nova Team!