Openstack Swift is a very powerful object storage that is used in several of the largest object storage deployments around the globe. It ensures a very high level of data durability and can withstand epic disasters if setup in the right way.
3. Intro - eNovance
● Christian Schwede
● Developer @ eNovance / Red Hat
● Mostly working on Swift, testing, automation and developer tools
● Swift Core
● IRC: cschwede in #openstack-swift
● christian@enovance.com / cschwede@redhat.com
● Twitter: @cschwede_de
5. Proxy
Node
Proxy
Node
Network
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
6. Proxy
Node
Proxy
Node
Network
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Zone 0 Zone 1 Zone 2
7. Proxy
Node
Proxy
Node
Network
Zone 0 Zone 1
Region 0 (⅔ of the data)
Zone 2
Region 1 (⅓ of the data)
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Zone 0
Disk
Disk
Disk
9. Ring : the Map of data
● One file per type of data. Ring files map each copy of a
data to a physical device through partitions.
● An object’s partition number is computed from the hash
of the object’s name.
● A Ring file is: a (replica, partition) to device ID table, a
devices table and a number of hash bits.
● Visualize a Ring: https://github.com/victorlin/swiftsense
10. Concrete example of Ring
Replica & Partition to Device ID table Devices table
0 1 2 3 0 1 2 3
1 2 3 0 1 2 3 0
2 3 0 1 2 3 0 1
Partition number
0
1
2
Replica number
0 1 2 3 4 5 6 7
ID Host Port Device
0 192.168.0.10 6000 sdb1
1 192.168.0.10 6000 sdc1
2 192.168.0.11 6000 sdb1
3 192.168.0.11 6000 sdc1
Bit count (partition power) = 3
→ 23 = 8 partitions
11. Storage policies
● Included in the Juno release (Swift > 2.0.0)
● Applied on a per-container basis
● Flexibility to use multiple rings, for example:
○ Basic: 2 replicas on spinning disks, single datacenter
○ Strong: 3 replicas in three different datacenters around the globe
○ Fast: 3 replicas on SSDs and much more powerful proxies
13. Object durability
● Disk failures: pd ~ 2-5% per year
● Unrecoverable bit read errors: pb = 10-15 ⋅ 8 ⋅ objectsize
Failure Failure Failure
3 replicas 2 replicas 1 replica Data loss
Replication Replication Replication
● Durability in the range of 10-11 nines with 3 replicas (99.99999999%)
● http://enovance.github.io/swift-durability-calculator/
14. Recover from a disk failure
Set failed device weight to 0, rebalance, push new ring
Failed
15. Object availability & durability
Zone 0 Zone 1
Region 0 (⅔ of the data)
Zone 2
Region 1 (⅓ of the data)
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Zone 0
Disk
Disk
Disk
17. Maintainability by Simplicity
● Standalone `swift-ring-builder` tool to manipulate the Ring
○ Uses `builders` files to keep architectural information on the cluster
○ Smartly assigns partitions to devices
○ Generates Ring files easily checked
● Processes on Swift nodes focus on ensuring that files are stored
uncorrupted at the appropriate location
18. Splitting a running Swift Cluster
● Ensuring no data is lost
○ Move only 1 replica at a time
○ Small steps to limit the impact
○ Check for data corruption
○ Check data location
○ Rollback in case of failure
● Limiting the impact on performance
○ Availability of cluster resources
○ Load incurred by cluster being split
○ Small steps to limit the impact
○ Control nodes accessed by users
Natively available in Swift
19. Splitting a running Swift Cluster
● Ensuring no data is lost
○ Move only 1 replica at a time
○ Small steps to limit the impact
○ Check for data corruption
○ Check data location
○ Rollback in case of failure
● Limiting the impact on performance
○ Availability of cluster resources
○ Load incurred by cluster being split
○ Small steps to limit the impact
○ Control nodes accessed by users
Small steps
New in Swift 2.2 !!
20. Adding a new region
Add a new region smoothly by limiting the amount of data moved
● really possible since Swift 2.2
● Final weight in new region should be at least ⅓ of the total cluster weight
Example of process:
1. Add devices to new region with a very low weight
2. Increase devices’ weights to store 5% of data in the new region
3. Progressively increase by steps of 5% the amount of data in the new region
More details: http://www.florentflament.com/blog/splitting-swift-cluster.html
22. Erasure coding
● Coming real soon now
● Instead of N copies of each object:
○ apply EC to object, split into multiple fragments, for example 14
○ store them on different disks/nodes
○ objects can be rebuild from 10 fragments
■ Tolerates loss of 4 fragments
● higher durability
■ Only ~ 40% overhead (compared to 200%)
● much cheaper
23. Durability calculation
● More detailed calculation
○ Number of disks, servers, partitions
● Add erasure coding
● Include in Swift documentation?
● Community effort
○ Discussion started last Swift hackathon
■ NTT, Swiftstack, IBM, Seagate, Red Hat / eNovance
○ Ad-Hoc session on Thursday/Friday - join us!
24. Summary
● High availability, even if large parts of the cluster are not accessible
● Automatic failure correction ensures high durability, and depending on
your cluster configuration excels known industry standards
● Swift 2.2 (Juno release)
○ Even smoother and predictable cluster upgrades
○ Storage Policies allow fine grained data placement control
● Erasure Coding increase durability even more while lowering costs