OVHcloud utilise Ceph depuis cinq ans pour certains de ses besoins de stockage, bien qu'étant composée de 2000 serveurs physiques et 20000 conteneurs, cette infrastructure est gérée au quotidien par une seule personne au RUN. Nous ferons une présentation et un retour d'expérience sur les différents moyens mis en oeuvre pour y arriver.
1. 1 sysadmin vs 250 clusters
Etienne Menguy
SysadminDays
November 19, 2019
2. OVHcloud
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
2
1 500 000 customers
2200 employees
380 000 Bare-metal servers
3. Ceph at OVHcloud
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
3
Public Cloud
Virtual
machines
Additional
disks
Additional
disks
Additional
disks
Additional
disks
Cloud Disk Array
As A
Service
4. Evolution
„2015
• 4 dev
• 1 ops
• 8 clusters
• 4 regions
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
4
„2019
• 9 dev
• 250 clusters
• 10 regions
5. Daily work
„1 sysadmin
• Monitoring
• Prodding
• Support
• Training
• Deploying regions, servers
• And the daily surprises
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
5
8 devs
• Ceph as a service
• Infra as code
• Code review
• Tests
• R&D
6. Ceph setup
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
6
FlashcacheFlashcacheFlashcache
LXC
Data
LXC
Data
LXC
Data
NVME
Partition
Partition
Partition
x12
HDD
HDD
x12
HDD
Flashcache
LXC
Data
Bare-metal server
40Gbps NIC
7. Ceph as a service
„Autonomous users
• Creating cluster
• Managing users, pools, rights
• Managing network
• Cluster growth
„Backup management
• 500TB/day
• Ceph -> Swift
• Ceph -> Ceph
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
7
„Managing our infrastructure
• Cluster upgrade
• Deploy new ceph versions
• Manage tasks
• Host management
• Network management
• Containers management
8. Infrastructure
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
8
Serveurs
Conteneurs
VM
Instances
BDD
Puppet
API
Python
API
OVH
RabbitMQ
Celery
9. Task management
„ RabbitMQ
„ Celery
• https://github.com/ovh/celery-dyrygent
• Complex workflow
• Reliable
• Monitoring
• Web interface
• Planned tasks
• NVME replacement
• Self healing
• Triggered by monitoring probe
• Executes any operation
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
9
10. Example
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
10
start
Check
operation
safety
Lower disk
weight
Wait
cluster_health_ok
Remove disk
from cluster
Yes
No
Weight
equals 0
12. Infra as code
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
12
13. Inconsistent hardware
„Hardware profile
• 12 profils on production
• CPU
• NVME
• HDD
„Firmwares
„Ceph versions
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
13
• Generic tools
• 1 profile = 1 cluster
14. Monitoring
„ Automatic downtimes by tasks
„ Some alarms on working hours
„ Services/hosts aggregation
„ 143 000 services
„ 25 000 hosts
„ 3 infrastructures
• 6 masters
• 12 satellites
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
14
15. Metrics
„ Clusters metrics
• Usage
• Latency
„ Hardware
• Cpu, mermory usage
• Cache hit ratio
„ Service
• KPI
• Usage per openstack region
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
15
„ Metrics Data Platform
• https://www.ovh.com/fr/data-platforms/metrics/
„ 13 Millions series
„ 13 Billions points per day
„ Performance
• IO/s
• Latency
16. Logs
„ Infrastructure
• OS
• Ceph
„ Applications
• CAAS
• Celery / RabbitMQ
• Uniq step/task ID
„ API
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
16
„ Logs Data Platform
• https://www.ovh.com/fr/data-
platforms/logs/
„ 15 000 logs/second
„ Graylog
„ Filebeat
17. D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
p ag e 17
Conclusion
18. D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
p ag e 18
Questions?