Lots of small objects in a swift cluster can lead to performance issues on the object servers. We propose a backend change to improve performance for this workload.
6. Where inodes join the party…
• XFS:
– one inode per file
– one inode per directory
• Inode:
– ctime/mtime/atime
– owner/group
– Permissions
6
7. Bad things happen
• One inode takes 300 bytes to 1k of memory
• Average: 2.4 inodes per fragment
– Data file: 1
– Object directory: 1
– Suffix directory + Partition directory: 0.4
7
8. Memory issues
• Inodes cannot fit in cache anymore
– But every inode of the path must be checked to
open a data file
• Only top level directories are cached
– Only 20% of hit on inode cache
– Up to 50% of devices activity to read inodes
8
9. Stability issues
• More filesystem corruptions
• Inability to run xfs_repair
– 1K of memory per inode
• Need a dedicated servers just to repair filesystems
– About 48 hours to repair one filesystem
9
11. We tried crazy things
• Storing objects in a K/V (RocksDB, LevelDB, …)
– Not suited to synchronous IO. Write amplification.
• Storing in a K/V the file handle of datafiles
– Atomicity on two separate data structures
• Patching XFS to drop useless information
– It’s already well optimized, inodes may be compressed
• Storing in ZFS DMU
– Lots of very cool features, but performance issues if full, low
level development
11
14. Swift request path
14
Proxy server
Proxy server
Object server Object server Object server
PUT / GET requests
15. How does Swift organize data ?
• PUT: « photo.jpg » -> MD5 hash:
bc6a624f493bf3042662064285f355c4
• Partition : bc6a -> 48234
• Suffix : 5c4
• Timestamp : 1449519086.42102.data
• /srv/node/sda/objects/48234/5c4/bc6a624f493b
f3042662064285f355c4/1449519086.42102.data
15
16. Example : writing an object
16
Proxy server Object server Index server
Volume Volume Volume
Obtain a write lock on a volume (fcntl)
Write the object at the end of the volume
Register the objectPUT
17. Example : reading an object
17
Proxy server Object server Index server
Volume Volume Volume
Open the volume
Read the object at the given offset
Get object locationGET
18. Index server
• Stores data in a key/value store : LevelDB
• Communication with gRPC
• Key : hash + filename
• Value : volume index + offset
• Keys are sorted on-disk for efficient seeks
18
19. Index server – keys example
• ……
• bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data
• bc6a624f493bf3042662064285f355c41449519086.42102.data
• bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data
• ……
19
22. Deletion
• Hole-punching with fallocate()
• Reclaim space without
changing the file size!
22
Object Header
Volume Header
Object Data
Object Header
Object Data
Space reclaimed by the filesystem
26. Write performance
• We cannot afford two synchronous writes
• The large file write is synchronous (fdatasync)
• The large file is preallocated
• K/V writes are asynchronous
26
27. Recovery
• Scan the volumes backwards
• Add missing information to the key value
27
28. How does it perform ?
• Bytes per objects in K/V : 42 bytes
• Latency : slightly worse when empty, much
better when full
• REPLICATE : served from memory
• Saved space
• Room for improvement
28
29. Benchmarks
• PUT single thread
– XFS: 17/s
– Volumes: 40/s
• PUT 20 threads
– XFS: 4.7s (99%)
– Volumes: 615ms
(99%)
29
• GET
– XFS: 39/s
– Volumes: 93/s
30. What’s next
• Upstream
• Store short-lived objects in dedicated volumes
• Replication of volumes
• Choose replica/erasure-coding on the fly
30
33. Metadata storage
• (extra slide if time)
• Previously stored as extended attributes
• Now serialized with protobuf and stored in the
volume
33
Hinweis der Redaktion
Je vais vous parler d’un travail d’optimisation réalisé sur openstack swift.
OVH opère plusieurs cluster swift, connus commercialement sous les noms Hubic, et PCS.
Nos clients ont tendances à stocker énormément de petits fichiers sur ces infras. En particulier sur Hubic.
Regarder le public (ordi entre moi et public)
Pas répéter trop (replica / EC)
Expliquer vfile = file, sur implementation
Discuter après sur le stand
This is really the case on hubic.
No problem on PCS, because there are more spindles
I’m going remind quickly some differences between replica and erasure code in Swift. In a replica policy, each object is written many times, on different devices. The usual replication factor is 3, but this is configurable. The durability of the object is dependent on the replication factor.
In this example, each object is written 3 times, it means that even if you lose 2 replica, the object is still available. It is also a good way to increase download bandwidth by distributing the requests over the devices.
Drawback of replication is the overhead. Each bytes is written N times. In this example, 6 bytes of the user becomes 18 bytes on the cluster.
Each replica of an object is stored in a file, you can see the path on top. Important parts are the hash, which is a computation of the URL of the object, partition and suffix are extrracted from the hash. The timestamp is the date of the upload of the object, it is set by the cluster during the upload. The user can’t set it. It is essential in the « eventual consistency » model of Swift. In case of an incident, by comparing the different timestamps of a single objects, Swift can decice which one is the good one. The latest actually.
Erasure Coding is a bit different. I’m not going to do all the theoritical explanation, with Reed Solomon and stuff, there is a good introduction in the Swift documentation.
Each object will be split in N fragments, and M fragments of parity will be added to ensure the redondency, so the durability. In this example, the cluster is configured with 3 fragments of data and 1 fragment of parity. It means that if I lose 1 device, my object is still accesssible. All the computation of fragmenting and calculating parity is done on the swift proxies.
The major interest of erasure coding is that you can balance overhead and durability in your cluster. In this example, the overhead is 1.3, but durability is not that good (2 device down and the object is unavailable). If you choose 10 fragments of data and 2 fragments of parity, you get the same level of durability than 3 replica, but with an overhead of only 1.2. (Well, durability is not that simple, because the more devices, the more risk, it’s statistics, but i’m simplifying)
Compared to replica, you can’t scale the downloads, each fragment must be accessed to rebuild the object. Also, you have to anticipate the CPU consumption on the proxies.
To sumarize, you can think of replication as RAID-1 while Erasure Coding is like RAID-5 or RAID-6, but with more configuration possibilities.
Looking at the path of file, there is a new information: the fragment number. As each fragment is unique, they must be accessed in correct order to rebuild the object.
It was even 30 files per object at beginning because of the durable file. Thankfully, it was dropped since then.
X5 factor in number of files. -> problem is most acute for erasure coding
40M (to confirm?) inodes per devices, 36 devices per server, for 64GB of RAM => would require 700+GB of RAM to have everything in cache
Bad choice at first: too man partitions per device. Reducing the number of partitions would tend to 2 inodes per fragments (17% improvement)
K/V not suited at all to synchronous IO, which is required before the proxy replies that we object is actually safe on disk
Explain write amp.
Persistent file handle : open a file without having to walk through all inodes in the path
So what’s the solution ? Too many inodes means we have too many files. Let’s have less files !
Limiter les inodes veut dire limiter le nombre de fichiers. Evident !
On les appelle des « volumes ».
Quelles sont leurs caractéristiques?
Three important characteristics :
Dedicated to a partition : Not one large volume the size of the disk !
Make a volume dedicated to a partition. It makes it easier to move a partition to another node (ring change)
Append-only : we only append new objects at the end of the file. Nothing is ever overwritten. We don’t want to write a space allocator
No concurrent writes : We must support concurrent writes to the same partition. Create multiple volumes.
Now, we need a way to locate the objects we write in those large files. Let’s take a step back first
Very simplified overview, for a replica configuration. not discussing authentication or container server, etc..
An object-server may have multiple disks with multiple object server processes.
Explain PUT, GET (one server only)
The request will arrive on one proxy server, which will contact specific object-servers based on the ring.
Won’t go in details about that, but just to explain that we are modifying the object server code only, nothing above.
We are at the bottom of the stack.
The problem which we described is on the object server. This is where we are working, let’s zoom in.
Explain consistent hashing
We calculate a MD5 hash from the object name
Then the partition is extracted from the hash, given the cluster configuration
The ring tells us which object-servers will store a partition
The suffix is used to limit the number of entries in a directory. (XFS developers unhappy about that)
Timestamp : to manage versions : user uploads a new version of photo.jpg
Now, let’s see in practice how this works with the new system
Take care to explain again the request :
Object server receives something like PUT toto.jpg
Will calculate the object hash, and then PUT that to the object server
Explain the get
Now let’s zoom on the index server
Un peu de détail sur l’index server. Il est écrit en go. Il y a une instance par disque : 1 base + 1 process.
Explain key, value
We are now able to find our files. What about directories ?
Files are stored below multiple directories : partition, suffix
These are necessary for the cluster (replicator, reconstructor)
Give examples of operations happening :
Per partition (placement through the ring configuration)
Per suffix (Replication)
Explain the partition power and its relation to the partition
Explain how we scan seek to the prefix, and continue until the next partition number
For suffixes just get the end of the name
We trade CPU for memory. Ok we can write, read, and listdir. What about deletion?
Explain hole punching mechanism.
Reclaim space without changing the file size
Extent count will increase
Explain hole punching mechanism.
Reclaim space without changing the file size
Explain the flow
One golang process and database per disk : avoid hanging or slowing down everyone if a disk is being slow
I left out a few details
Explain the flow
One golang process and database per disk : avoid hanging or slowing down everyone if a disk is being slow
I left out a few details
Hole punching is great but there is still a small cost : more extents in the file
Tombstone volumes can be closed and deleted once all files have been deleted
Also planned for files with a X-Delete-At header
Not a problem until you have lots of extents. Not expected to be needed often
Explain why we can’t sync the KV
Describe the recovery procedure in case we crashed
Explain why we can’t sync the KV
Describe the recovery procedure in case we crashed
For 10 millions files, 400MB, vs 3 to 8GB with inodes
Explain REPLICATE (non intuitive name)
Improvement : smaller keys..