Using OpenStack Swift for Extreme Data Durability

Using OpenStack Swift for
extreme data durability
Florent Flament, Cloudwatt
Christian Schwede, eNovance
OpenStack Summit Paris, November 2014

Intro - Cloudwatt
● Florent Flament
● Dev & Fireman @ Cloudwatt
● Fixing & tuning of OpenStack (Cinder, Keystone, Nova, Swift)
● Email: florent.flament@cloudwatt.com
● IRC: florentflament on #openstack-dev (Freenode)
● Twitter: @florentflament_
● Blogs: http://dev.cloudwatt.com / http://www.florentflament.com

Intro - eNovance
● Christian Schwede
● Developer @ eNovance / Red Hat
● Mostly working on Swift, testing, automation and developer tools
● Swift Core
● IRC: cschwede in #openstack-swift
● christian@enovance.com / cschwede@redhat.com
● Twitter: @cschwede_de

Proxy
Node
Proxy
Node
Network
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk

Proxy
Node
Proxy
Node
Network
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Zone 0 Zone 1 Zone 2

Proxy
Node
Proxy
Node
Network
Zone 0 Zone 1
Region 0 (⅔ of the data)
Zone 2
Region 1 (⅓ of the data)
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Zone 0
Disk
Disk
Disk

Ring : the Map of data
● One file per type of data. Ring files map each copy of a
data to a physical device through partitions.
● An object’s partition number is computed from the hash
of the object’s name.
● A Ring file is: a (replica, partition) to device ID table, a
devices table and a number of hash bits.
● Visualize a Ring: https://github.com/victorlin/swiftsense

Concrete example of Ring
Replica & Partition to Device ID table Devices table
0 1 2 3 0 1 2 3
1 2 3 0 1 2 3 0
2 3 0 1 2 3 0 1
Partition number
0
1
2
Replica number
0 1 2 3 4 5 6 7
ID Host Port Device
0 192.168.0.10 6000 sdb1
1 192.168.0.10 6000 sdc1
2 192.168.0.11 6000 sdb1
3 192.168.0.11 6000 sdc1
Bit count (partition power) = 3
→ 23 = 8 partitions

Storage policies
● Included in the Juno release (Swift > 2.0.0)
● Applied on a per-container basis
● Flexibility to use multiple rings, for example:
○ Basic: 2 replicas on spinning disks, single datacenter
○ Strong: 3 replicas in three different datacenters around the globe
○ Fast: 3 replicas on SSDs and much more powerful proxies

Object durability
● Disk failures: pd ~ 2-5% per year
● Unrecoverable bit read errors: pb = 10-15 ⋅ 8 ⋅ objectsize
Failure Failure Failure
3 replicas 2 replicas 1 replica Data loss
Replication Replication Replication
● Durability in the range of 10-11 nines with 3 replicas (99.99999999%)
● http://enovance.github.io/swift-durability-calculator/

Recover from a disk failure
Set failed device weight to 0, rebalance, push new ring
Failed

Object availability & durability
Zone 0 Zone 1
Region 0 (⅔ of the data)
Zone 2
Region 1 (⅓ of the data)
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Zone 0
Disk
Disk
Disk

Maintainability by Simplicity
● Standalone `swift-ring-builder` tool to manipulate the Ring
○ Uses `builders` files to keep architectural information on the cluster
○ Smartly assigns partitions to devices
○ Generates Ring files easily checked
● Processes on Swift nodes focus on ensuring that files are stored
uncorrupted at the appropriate location

Splitting a running Swift Cluster
● Ensuring no data is lost
○ Move only 1 replica at a time
○ Small steps to limit the impact
○ Check for data corruption
○ Check data location
○ Rollback in case of failure
● Limiting the impact on performance
○ Availability of cluster resources
○ Load incurred by cluster being split
○ Control nodes accessed by users
Natively available in Swift

Splitting a running Swift Cluster
● Ensuring no data is lost
○ Move only 1 replica at a time
○ Check for data corruption
○ Check data location
○ Rollback in case of failure
● Limiting the impact on performance
○ Availability of cluster resources
○ Load incurred by cluster being split
○ Control nodes accessed by users
Small steps
New in Swift 2.2 !!

Adding a new region
Add a new region smoothly by limiting the amount of data moved
● really possible since Swift 2.2
● Final weight in new region should be at least ⅓ of the total cluster weight
Example of process:
1. Add devices to new region with a very low weight
2. Increase devices’ weights to store 5% of data in the new region
3. Progressively increase by steps of 5% the amount of data in the new region
More details: http://www.florentflament.com/blog/splitting-swift-cluster.html

Erasure coding
● Coming real soon now
● Instead of N copies of each object:
○ apply EC to object, split into multiple fragments, for example 14
○ store them on different disks/nodes
○ objects can be rebuild from 10 fragments
■ Tolerates loss of 4 fragments
● higher durability
■ Only ~ 40% overhead (compared to 200%)
● much cheaper

Durability calculation
● More detailed calculation
○ Number of disks, servers, partitions
● Add erasure coding
● Include in Swift documentation?
● Community effort
○ Discussion started last Swift hackathon
■ NTT, Swiftstack, IBM, Seagate, Red Hat / eNovance
○ Ad-Hoc session on Thursday/Friday - join us!

Summary
● High availability, even if large parts of the cluster are not accessible
● Automatic failure correction ensures high durability, and depending on
your cluster configuration excels known industry standards
● Swift 2.2 (Juno release)
○ Even smoother and predictable cluster upgrades
○ Storage Policies allow fine grained data placement control
● Erasure Coding increase durability even more while lowering costs

Using OpenStack Swift for Extreme Data Durability

Using OpenStack Swift for Extreme Data Durability

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using OpenStack Swift for Extreme Data Durability

Similar to Using OpenStack Swift for Extreme Data Durability (20)

Recently uploaded

Recently uploaded (20)

Using OpenStack Swift for Extreme Data Durability