SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Object storage optimization in Swift
Alexandre LECUYER
DevOps / irc: alecuyer
Romain LE DISEZ
DevOps / irc: rledisez
What’s the problem?
• Performance is bad
• Disks 100% busy
• Replication/reconstruction is very (very) slow
2
Replica in Swift
3
/srv/node/<device>/objects/<partition>/<suffix>/<hash>/<timestamp>.data
012345
012345
012345
012345
Erasure Coding in Swift
4
012345
03
14
25
/srv/node/<device>/objects-1/<partition>/<suffix>/<hash>/<timestamp>#<fragment>#d.data
a9
Comparison
• Replica:
– Performance
– Overhead
– 3 files per object
(3 replicas)
• Erasure coding
– Cost effective
– Slow-ish
– 15 files per object
(12+3 fragments)
5
Where inodes join the party…
• XFS:
– one inode per file
– one inode per directory
• Inode:
– ctime/mtime/atime
– owner/group
– Permissions
6
Bad things happen
• One inode takes 300 bytes to 1k of memory
• Average: 2.4 inodes per fragment
– Data file: 1
– Object directory: 1
– Suffix directory + Partition directory: 0.4
7
Memory issues
• Inodes cannot fit in cache anymore
– But every inode of the path must be checked to
open a data file
• Only top level directories are cached
– Only 20% of hit on inode cache
– Up to 50% of devices activity to read inodes
8
Stability issues
• More filesystem corruptions
• Inability to run xfs_repair
– 1K of memory per inode
• Need a dedicated servers just to repair filesystems
– About 48 hours to repair one filesystem
9
Let’s fix it!
(a.k.a. inodes are useless, right?)
10
We tried crazy things
• Storing objects in a K/V (RocksDB, LevelDB, …)
– Not suited to synchronous IO. Write amplification.
• Storing in a K/V the file handle of datafiles
– Atomicity on two separate data structures
• Patching XFS to drop useless information
– It’s already well optimized, inodes may be compressed
• Storing in ZFS DMU
– Lots of very cool features, but performance issues if full, low
level development
11
12
Object Header
Volume Header
Object Data
Object Header
Object Data
Store multiple objects in
large files
13
Object Header
Volume Header
Object Data
Object Header
Object Data
Dedicated to a partition
No concurrent writes
Append only
Swift request path
14
Proxy server
Proxy server
Object server Object server Object server
PUT / GET requests
How does Swift organize data ?
• PUT: « photo.jpg » -> MD5 hash:
bc6a624f493bf3042662064285f355c4
• Partition : bc6a -> 48234
• Suffix : 5c4
• Timestamp : 1449519086.42102.data
• /srv/node/sda/objects/48234/5c4/bc6a624f493b
f3042662064285f355c4/1449519086.42102.data
15
Example : writing an object
16
Proxy server Object server Index server
Volume Volume Volume
Obtain a write lock on a volume (fcntl)
Write the object at the end of the volume
Register the objectPUT
Example : reading an object
17
Proxy server Object server Index server
Volume Volume Volume
Open the volume
Read the object at the given offset
Get object locationGET
Index server
• Stores data in a key/value store : LevelDB
• Communication with gRPC
• Key : hash + filename
• Value : volume index + offset
• Keys are sorted on-disk for efficient seeks
18
Index server – keys example
• ……
• bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data
• bc6a624f493bf3042662064285f355c41449519086.42102.data
• bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data
• ……
19
What about directories ?
20
• bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data
• bc6a624f493bf3042662064285f355c41449519086.42102.data
• bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data
48234
48235
9e3
5c4
7d1
bc6a46b... 1475194591.74265.data
bc6a624...
bc6b78b…
1449519086.42102.data
1415965115.56792.data
Deletion - Hole punching
21https://en.wikipedia.org/wiki/Sparse_file#/media/File:Sparse_file_(en).svg
Deletion
• Hole-punching with fallocate()
• Reclaim space without
changing the file size!
22
Object Header
Volume Header
Object Data
Object Header
Object Data
Space reclaimed by the filesystem
Implementation overview
23
Swift code,
patched.
diskfile.py
Index server,
with levelDB as
the backing key-
value store
gRPC
vfile.py
module
vfile.py
• Provides a file like interface
• f = vfile.open(« /path/to/file »)
• f.read()
• vfile.listdir(« /srv/node/<disk>/<partition>/ »)
24
Managing fragmentation
Dedicated volumes for short lived files
25
Volume
Volume
Volume
Volume
Volume
Volume
« .data » files « .ts » files
Write performance
• We cannot afford two synchronous writes
• The large file write is synchronous (fdatasync)
• The large file is preallocated
• K/V writes are asynchronous
26
Recovery
• Scan the volumes backwards
• Add missing information to the key value
27
How does it perform ?
• Bytes per objects in K/V : 42 bytes
• Latency : slightly worse when empty, much
better when full
• REPLICATE : served from memory
• Saved space
• Room for improvement
28
Benchmarks
• PUT single thread
– XFS: 17/s
– Volumes: 40/s
• PUT 20 threads
– XFS: 4.7s (99%)
– Volumes: 615ms
(99%)
29
• GET
– XFS: 39/s
– Volumes: 93/s
What’s next
• Upstream
• Store short-lived objects in dedicated volumes
• Replication of volumes
• Choose replica/erasure-coding on the fly
30
Credits
• Haystack (Facebook project)
• Openstack Swift community
31
Thank you
Metadata storage
• (extra slide if time)
• Previously stored as extended attributes
• Now serialized with protobuf and stored in the
volume
33

Weitere ähnliche Inhalte

Was ist angesagt?

KubeCon EU 2016: Kubernetes Storage 101
KubeCon EU 2016: Kubernetes Storage 101KubeCon EU 2016: Kubernetes Storage 101
KubeCon EU 2016: Kubernetes Storage 101KubeAcademy
 
High Availability for OpenStack
High Availability for OpenStackHigh Availability for OpenStack
High Availability for OpenStackKamesh Pemmaraju
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introductionchrislusf
 
An intro to Kubernetes operators
An intro to Kubernetes operatorsAn intro to Kubernetes operators
An intro to Kubernetes operatorsJ On The Beach
 
[2018] 오픈스택 5년 운영의 경험
[2018] 오픈스택 5년 운영의 경험[2018] 오픈스택 5년 운영의 경험
[2018] 오픈스택 5년 운영의 경험NHN FORWARD
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
 
Kubernetes Workshop
Kubernetes WorkshopKubernetes Workshop
Kubernetes Workshoploodse
 
Galera explained 3
Galera explained 3Galera explained 3
Galera explained 3Marco Tusa
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaMetrics
 
Configuring global infrastructure in terraform
Configuring global infrastructure in terraformConfiguring global infrastructure in terraform
Configuring global infrastructure in terraformSANGGI CHOI
 
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...Tokuhiro Matsuno
 
Reigning in Protobuf with David Navalho and Graham Stirling | Kafka Summit Lo...
Reigning in Protobuf with David Navalho and Graham Stirling | Kafka Summit Lo...Reigning in Protobuf with David Navalho and Graham Stirling | Kafka Summit Lo...
Reigning in Protobuf with David Navalho and Graham Stirling | Kafka Summit Lo...HostedbyConfluent
 
Terraform: An Overview & Introduction
Terraform: An Overview & IntroductionTerraform: An Overview & Introduction
Terraform: An Overview & IntroductionLee Trout
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldJignesh Shah
 
Openshift NGINX Kubernetes (Japanese Webinar)
Openshift NGINX Kubernetes (Japanese Webinar)Openshift NGINX Kubernetes (Japanese Webinar)
Openshift NGINX Kubernetes (Japanese Webinar)NGINX, Inc.
 
Kubernetes Security with Calico and Open Policy Agent
Kubernetes Security with Calico and Open Policy AgentKubernetes Security with Calico and Open Policy Agent
Kubernetes Security with Calico and Open Policy AgentCloudOps2005
 
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...Akihiro Suda
 

Was ist angesagt? (20)

KubeCon EU 2016: Kubernetes Storage 101
KubeCon EU 2016: Kubernetes Storage 101KubeCon EU 2016: Kubernetes Storage 101
KubeCon EU 2016: Kubernetes Storage 101
 
High Availability for OpenStack
High Availability for OpenStackHigh Availability for OpenStack
High Availability for OpenStack
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
An intro to Kubernetes operators
An intro to Kubernetes operatorsAn intro to Kubernetes operators
An intro to Kubernetes operators
 
[2018] 오픈스택 5년 운영의 경험
[2018] 오픈스택 5년 운영의 경험[2018] 오픈스택 5년 운영의 경험
[2018] 오픈스택 5년 운영의 경험
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
Kubernetes Workshop
Kubernetes WorkshopKubernetes Workshop
Kubernetes Workshop
 
OpenShift on OpenStack with Kuryr
OpenShift on OpenStack with KuryrOpenShift on OpenStack with Kuryr
OpenShift on OpenStack with Kuryr
 
Galera explained 3
Galera explained 3Galera explained 3
Galera explained 3
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
 
Configuring global infrastructure in terraform
Configuring global infrastructure in terraformConfiguring global infrastructure in terraform
Configuring global infrastructure in terraform
 
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
 
Reigning in Protobuf with David Navalho and Graham Stirling | Kafka Summit Lo...
Reigning in Protobuf with David Navalho and Graham Stirling | Kafka Summit Lo...Reigning in Protobuf with David Navalho and Graham Stirling | Kafka Summit Lo...
Reigning in Protobuf with David Navalho and Graham Stirling | Kafka Summit Lo...
 
Helm intro
Helm introHelm intro
Helm intro
 
Terraform: An Overview & Introduction
Terraform: An Overview & IntroductionTerraform: An Overview & Introduction
Terraform: An Overview & Introduction
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized World
 
Openshift NGINX Kubernetes (Japanese Webinar)
Openshift NGINX Kubernetes (Japanese Webinar)Openshift NGINX Kubernetes (Japanese Webinar)
Openshift NGINX Kubernetes (Japanese Webinar)
 
Kubernetes Security with Calico and Open Policy Agent
Kubernetes Security with Calico and Open Policy AgentKubernetes Security with Calico and Open Policy Agent
Kubernetes Security with Calico and Open Policy Agent
 
AWS Fargate on EKS 실전 사용하기
AWS Fargate on EKS 실전 사용하기AWS Fargate on EKS 실전 사용하기
AWS Fargate on EKS 실전 사용하기
 
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
 

Ähnlich wie Openstack Swift - Lots of small files

SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSeeQuality.net
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Haoyuan Li
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Take your database source code and data under control
Take your database source code and data under controlTake your database source code and data under control
Take your database source code and data under controlMarcin Przepiórowski
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Nexus, Inc.
 
Collaborate instant cloning_kyle
Collaborate instant cloning_kyleCollaborate instant cloning_kyle
Collaborate instant cloning_kyleKyle Hailey
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformMaris Elsins
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationKyle Hailey
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nlbartzon
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsLars Nielsen
 
Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i  Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i Zend by Rogue Wave Software
 
W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in databasegafurov_x
 
Building Storage for Clouds (ONUG Spring 2015)
Building Storage for Clouds (ONUG Spring 2015)Building Storage for Clouds (ONUG Spring 2015)
Building Storage for Clouds (ONUG Spring 2015)Howard Marks
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 

Ähnlich wie Openstack Swift - Lots of small files (20)

SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Take your database source code and data under control
Take your database source code and data under controlTake your database source code and data under control
Take your database source code and data under control
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)
 
Collaborate instant cloning_kyle
Collaborate instant cloning_kyleCollaborate instant cloning_kyle
Collaborate instant cloning_kyle
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Flashback in OCI
Flashback in OCIFlashback in OCI
Flashback in OCI
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance Platform
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualization
 
Super hybrid2016 tdc
Super hybrid2016 tdcSuper hybrid2016 tdc
Super hybrid2016 tdc
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Scalability
ScalabilityScalability
Scalability
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data Systems
 
Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i  Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i
 
W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in database
 
Building Storage for Clouds (ONUG Spring 2015)
Building Storage for Clouds (ONUG Spring 2015)Building Storage for Clouds (ONUG Spring 2015)
Building Storage for Clouds (ONUG Spring 2015)
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 

Kürzlich hochgeladen

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 

Kürzlich hochgeladen (20)

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 

Openstack Swift - Lots of small files

  • 1. Object storage optimization in Swift Alexandre LECUYER DevOps / irc: alecuyer Romain LE DISEZ DevOps / irc: rledisez
  • 2. What’s the problem? • Performance is bad • Disks 100% busy • Replication/reconstruction is very (very) slow 2
  • 4. Erasure Coding in Swift 4 012345 03 14 25 /srv/node/<device>/objects-1/<partition>/<suffix>/<hash>/<timestamp>#<fragment>#d.data a9
  • 5. Comparison • Replica: – Performance – Overhead – 3 files per object (3 replicas) • Erasure coding – Cost effective – Slow-ish – 15 files per object (12+3 fragments) 5
  • 6. Where inodes join the party… • XFS: – one inode per file – one inode per directory • Inode: – ctime/mtime/atime – owner/group – Permissions 6
  • 7. Bad things happen • One inode takes 300 bytes to 1k of memory • Average: 2.4 inodes per fragment – Data file: 1 – Object directory: 1 – Suffix directory + Partition directory: 0.4 7
  • 8. Memory issues • Inodes cannot fit in cache anymore – But every inode of the path must be checked to open a data file • Only top level directories are cached – Only 20% of hit on inode cache – Up to 50% of devices activity to read inodes 8
  • 9. Stability issues • More filesystem corruptions • Inability to run xfs_repair – 1K of memory per inode • Need a dedicated servers just to repair filesystems – About 48 hours to repair one filesystem 9
  • 10. Let’s fix it! (a.k.a. inodes are useless, right?) 10
  • 11. We tried crazy things • Storing objects in a K/V (RocksDB, LevelDB, …) – Not suited to synchronous IO. Write amplification. • Storing in a K/V the file handle of datafiles – Atomicity on two separate data structures • Patching XFS to drop useless information – It’s already well optimized, inodes may be compressed • Storing in ZFS DMU – Lots of very cool features, but performance issues if full, low level development 11
  • 12. 12 Object Header Volume Header Object Data Object Header Object Data Store multiple objects in large files
  • 13. 13 Object Header Volume Header Object Data Object Header Object Data Dedicated to a partition No concurrent writes Append only
  • 14. Swift request path 14 Proxy server Proxy server Object server Object server Object server PUT / GET requests
  • 15. How does Swift organize data ? • PUT: « photo.jpg » -> MD5 hash: bc6a624f493bf3042662064285f355c4 • Partition : bc6a -> 48234 • Suffix : 5c4 • Timestamp : 1449519086.42102.data • /srv/node/sda/objects/48234/5c4/bc6a624f493b f3042662064285f355c4/1449519086.42102.data 15
  • 16. Example : writing an object 16 Proxy server Object server Index server Volume Volume Volume Obtain a write lock on a volume (fcntl) Write the object at the end of the volume Register the objectPUT
  • 17. Example : reading an object 17 Proxy server Object server Index server Volume Volume Volume Open the volume Read the object at the given offset Get object locationGET
  • 18. Index server • Stores data in a key/value store : LevelDB • Communication with gRPC • Key : hash + filename • Value : volume index + offset • Keys are sorted on-disk for efficient seeks 18
  • 19. Index server – keys example • …… • bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data • bc6a624f493bf3042662064285f355c41449519086.42102.data • bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data • …… 19
  • 20. What about directories ? 20 • bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data • bc6a624f493bf3042662064285f355c41449519086.42102.data • bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data 48234 48235 9e3 5c4 7d1 bc6a46b... 1475194591.74265.data bc6a624... bc6b78b… 1449519086.42102.data 1415965115.56792.data
  • 21. Deletion - Hole punching 21https://en.wikipedia.org/wiki/Sparse_file#/media/File:Sparse_file_(en).svg
  • 22. Deletion • Hole-punching with fallocate() • Reclaim space without changing the file size! 22 Object Header Volume Header Object Data Object Header Object Data Space reclaimed by the filesystem
  • 23. Implementation overview 23 Swift code, patched. diskfile.py Index server, with levelDB as the backing key- value store gRPC vfile.py module
  • 24. vfile.py • Provides a file like interface • f = vfile.open(« /path/to/file ») • f.read() • vfile.listdir(« /srv/node/<disk>/<partition>/ ») 24
  • 25. Managing fragmentation Dedicated volumes for short lived files 25 Volume Volume Volume Volume Volume Volume « .data » files « .ts » files
  • 26. Write performance • We cannot afford two synchronous writes • The large file write is synchronous (fdatasync) • The large file is preallocated • K/V writes are asynchronous 26
  • 27. Recovery • Scan the volumes backwards • Add missing information to the key value 27
  • 28. How does it perform ? • Bytes per objects in K/V : 42 bytes • Latency : slightly worse when empty, much better when full • REPLICATE : served from memory • Saved space • Room for improvement 28
  • 29. Benchmarks • PUT single thread – XFS: 17/s – Volumes: 40/s • PUT 20 threads – XFS: 4.7s (99%) – Volumes: 615ms (99%) 29 • GET – XFS: 39/s – Volumes: 93/s
  • 30. What’s next • Upstream • Store short-lived objects in dedicated volumes • Replication of volumes • Choose replica/erasure-coding on the fly 30
  • 31. Credits • Haystack (Facebook project) • Openstack Swift community 31
  • 33. Metadata storage • (extra slide if time) • Previously stored as extended attributes • Now serialized with protobuf and stored in the volume 33

Hinweis der Redaktion

  1. Je vais vous parler d’un travail d’optimisation réalisé sur openstack swift. OVH opère plusieurs cluster swift, connus commercialement sous les noms Hubic, et PCS. Nos clients ont tendances à stocker énormément de petits fichiers sur ces infras. En particulier sur Hubic. Regarder le public (ordi entre moi et public) Pas répéter trop (replica / EC) Expliquer vfile = file, sur implementation Discuter après sur le stand
  2. This is really the case on hubic. No problem on PCS, because there are more spindles
  3. I’m going remind quickly some differences between replica and erasure code in Swift. In a replica policy, each object is written many times, on different devices. The usual replication factor is 3, but this is configurable. The durability of the object is dependent on the replication factor. In this example, each object is written 3 times, it means that even if you lose 2 replica, the object is still available. It is also a good way to increase download bandwidth by distributing the requests over the devices. Drawback of replication is the overhead. Each bytes is written N times. In this example, 6 bytes of the user becomes 18 bytes on the cluster. Each replica of an object is stored in a file, you can see the path on top. Important parts are the hash, which is a computation of the URL of the object, partition and suffix are extrracted from the hash. The timestamp is the date of the upload of the object, it is set by the cluster during the upload. The user can’t set it. It is essential in the « eventual consistency » model of Swift. In case of an incident, by comparing the different timestamps of a single objects, Swift can decice which one is the good one. The latest actually.
  4. Erasure Coding is a bit different. I’m not going to do all the theoritical explanation, with Reed Solomon and stuff, there is a good introduction in the Swift documentation. Each object will be split in N fragments, and M fragments of parity will be added to ensure the redondency, so the durability. In this example, the cluster is configured with 3 fragments of data and 1 fragment of parity. It means that if I lose 1 device, my object is still accesssible. All the computation of fragmenting and calculating parity is done on the swift proxies. The major interest of erasure coding is that you can balance overhead and durability in your cluster. In this example, the overhead is 1.3, but durability is not that good (2 device down and the object is unavailable). If you choose 10 fragments of data and 2 fragments of parity, you get the same level of durability than 3 replica, but with an overhead of only 1.2. (Well, durability is not that simple, because the more devices, the more risk, it’s statistics, but i’m simplifying) Compared to replica, you can’t scale the downloads, each fragment must be accessed to rebuild the object. Also, you have to anticipate the CPU consumption on the proxies. To sumarize, you can think of replication as RAID-1 while Erasure Coding is like RAID-5 or RAID-6, but with more configuration possibilities. Looking at the path of file, there is a new information: the fragment number. As each fragment is unique, they must be accessed in correct order to rebuild the object.
  5. It was even 30 files per object at beginning because of the durable file. Thankfully, it was dropped since then. X5 factor in number of files. -> problem is most acute for erasure coding
  6. 40M (to confirm?) inodes per devices, 36 devices per server, for 64GB of RAM => would require 700+GB of RAM to have everything in cache Bad choice at first: too man partitions per device. Reducing the number of partitions would tend to 2 inodes per fragments (17% improvement)
  7. K/V not suited at all to synchronous IO, which is required before the proxy replies that we object is actually safe on disk Explain write amp. Persistent file handle : open a file without having to walk through all inodes in the path So what’s the solution ? Too many inodes means we have too many files. Let’s have less files !
  8. Limiter les inodes veut dire limiter le nombre de fichiers. Evident ! On les appelle des « volumes ». Quelles sont leurs caractéristiques?
  9. Three important characteristics : Dedicated to a partition : Not one large volume the size of the disk !  Make a volume dedicated to a partition. It makes it easier to move a partition to another node (ring change) Append-only : we only append new objects at the end of the file. Nothing is ever overwritten. We don’t want to write a space allocator No concurrent writes : We must support concurrent writes to the same partition. Create multiple volumes. Now, we need a way to locate the objects we write in those large files. Let’s take a step back first
  10. Very simplified overview, for a replica configuration. not discussing authentication or container server, etc.. An object-server may have multiple disks with multiple object server processes. Explain PUT, GET (one server only) The request will arrive on one proxy server, which will contact specific object-servers based on the ring. Won’t go in details about that, but just to explain that we are modifying the object server code only, nothing above. We are at the bottom of the stack. The problem which we described is on the object server. This is where we are working, let’s zoom in.
  11. Explain consistent hashing We calculate a MD5 hash from the object name Then the partition is extracted from the hash, given the cluster configuration The ring tells us which object-servers will store a partition The suffix is used to limit the number of entries in a directory. (XFS developers unhappy about that) Timestamp : to manage versions : user uploads a new version of photo.jpg Now, let’s see in practice how this works with the new system
  12. Take care to explain again the request : Object server receives something like PUT toto.jpg Will calculate the object hash, and then PUT that to the object server
  13. Explain the get Now let’s zoom on the index server
  14. Un peu de détail sur l’index server. Il est écrit en go. Il y a une instance par disque : 1 base + 1 process.
  15. Explain key, value We are now able to find our files. What about directories ? Files are stored below multiple directories : partition, suffix These are necessary for the cluster (replicator, reconstructor)
  16. Give examples of operations happening : Per partition (placement through the ring configuration) Per suffix (Replication) Explain the partition power and its relation to the partition Explain how we scan seek to the prefix, and continue until the next partition number For suffixes just get the end of the name We trade CPU for memory. Ok we can write, read, and listdir. What about deletion?
  17. Explain hole punching mechanism. Reclaim space without changing the file size Extent count will increase
  18. Explain hole punching mechanism. Reclaim space without changing the file size
  19. Explain the flow One golang process and database per disk : avoid hanging or slowing down everyone if a disk is being slow I left out a few details
  20. Explain the flow One golang process and database per disk : avoid hanging or slowing down everyone if a disk is being slow I left out a few details
  21. Hole punching is great but there is still a small cost : more extents in the file Tombstone volumes can be closed and deleted once all files have been deleted Also planned for files with a X-Delete-At header Not a problem until you have lots of extents. Not expected to be needed often
  22. Explain why we can’t sync the KV Describe the recovery procedure in case we crashed
  23. Explain why we can’t sync the KV Describe the recovery procedure in case we crashed
  24. For 10 millions files, 400MB, vs 3 to 8GB with inodes Explain REPLICATE (non intuitive name) Improvement : smaller keys..
  25. Better performance expected now (fdatasync)
  26. Add hybrid access