ASAUDIT April 2016 New

Block and Object Store
Thursday, 21 April 2016
Senior Technical Specialist
Computing Platforms – Technical Support Services
University of Cape Town
Stefan Coetzee

Ceph
Robust Services built on RADOS

RADOS Gateway Features
• REST-based object storage proxy
• Uses RADOS to store objects
• Stripes large RESTful objects across many RADOS objects
• API supports buckets & accounts
• Usage accounting for billing
• Compatible with S3 and Swift applications
• Zones and regions
• Topologies similar to S3 and others
• Global bucket and user/account namespace
• Cross data center synchronization
• Asynchronously replicate buckets between regions
• Read affinity
• Serve local data from local DC
• Dynamic DNS to send clients to closest DC

RBD Features
• Stripe images across entire cluster (pool)
• Read-only snapshots
• Copy-on-write clones
• Broad integration
• Qemu
• Linux Kernel
• iSCSI (STGT, LIO)
• Openstack, CloudStack, Nebula, Ganeti, Proxmox
• Incremental backup (relative to snapshots)

Ceph
RADOS – Reliable Autonomic Distributed Object Store

RADOS
• Flat object namespace within each pool
• Rich object API (librados)
• Strong consistency
• Infrastructure aware, dynamic topology
• Hash-based placement (CRUSH)
• Direct client to server data path

RADOS Components
OSDs
• 10s to 1000s in a cluster
• One per disk ( or per SSD, RAID group …)
• Serve stored objects to clients
• Intelligently peer for replication & recovery
Monitors
• Maintain cluster membership and state
• Provide consensus for distributed
decision-making
• Small, odd number (eg 5)
• Not part of data path

Ceph
Data Placement – CRUSH (Controller Replication
Under Scalable Hashing)

CRUSH: Declustered placement
• Each PG independently maps to a
pseudorandom set of OSDs
• PGs that map to the same OSD generally
have replicas that do not
• When an OSD fails, each PG it stored will
generally be re-replicated by a different OSD
• Highly parallel recovery
• Avoid single-disk recovery bottleneck

CRUSH: Dynamic Data Placement
• Pseudo-random placement algorithm
• Fast calculation, no lookup
• Repeatable, deterministic
• Statically uniform distribution
• Stable mapping
• Limited data migration on change
• Rule-based configuration
• Infrastructure topology aware
• Adjustable replication
• Weighted devices (different sizes)

2 ways to cache
• Within each OSD
• Combing SSD & HDD under each OSD
• Make localized promote/demote decisions
• Leverage existing tools
• dm-cache, bcache, FlashCache
• Controller based caching
• Cache on separate devices/nodes
• Different hardware for different tiers
• Slow nodes for cold data
• High performance nodes for hot data
• Scale each tier independently

RADOS Tiering Principles
• Each tier is a RADOS pool
• May be replicated or erasure coded
• Tiers are durable
• eg replicated across SSDs in multiple hosts
• Each tier has it's own CRUSH policy
• eg map cache pool to SSDs devices/hosts only
• librados adapts to tiering topology
• Transparently direct requests accordingly
• No changes to RBD, RGW, CephFS, etc.

Read (Cache Miss with Promote)

Tiering Agent
• Each PG has an internal tiering agent
• Manages PG based on administrator defined policy
• Flush dirty objects
• When pool reaches target dirty ratio
• Tries to select cold objects
• Marks objects clean when they have been written back to the
base pool
• Evict (delete) clean objects
• Greater "effort" as cache pool approaches target size

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
6
7
5
3
4
2
1
0
E S T
6
7
5
3
4
2
1
0
E S T
6
7
5
3
4
2
1
0
E S T
6
7
5
3
4
2
1
0
E S T
6
7
5
3
4
2
1
0
E S T
6
7
5
3
4
2
1
0
E S T
6
7
5
3
4
2
1
0
0
2
1
3
5
4
6
8
7
9
11
10
Front
0
2
1
3
5
4
6
8
7
9
11
10
Front
0
2
1
3
5
4
6
8
7
9
11
10
Front
0
2
1
3
5
4
6
8
7
9
11
10
Front
0
2
1
3
5
4
6
8
7
9
11
10
Front
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
E S T
PowerEdge R620 6
7
5
3
4
2
1
0
E S T
6
7
5
3
4
2
1
0
6
7
5
3
4
2
1
0
E S T
6
7
5
3
4
2
1
0
E S T
6
7
5
3
4
2
1
0
E S T
6
7
5
3
4
2
1
0
E S T
6
7
5
3
4
2
1
0
E S T
6
7
5
3
4
2
1
0
E S T
6
7
5
3
4
2
1
0
E S T
PowerEdge R620 6
7
5
3
4
2
1
0
0
2
1
3
5
4
6
8
7
9
11
10
Front
0
2
1
3
5
4
6
8
7
9
11
10
Front
0
2
1
3
5
4
6
8
7
9
11
10
Front
0
2
1
3
5
4
6
8
7
9
11
10
Front
0
2
1
3
5
4
6
8
7
9
11
10
Front
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
0
2
1
3
5
4
6
8
7
9
11
10
Front
0
2
1
3
5
4
6
8
7
9
11
10
Front
0
2
1
3
5
4
6
8
7
9
11
10
Front
0
2
1
3
5
4
6
8
7
9
11
10
Front
0
2
1
3
5
4
6
8
7
9
11
10
Front
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
Dell R730xd
• 2x Intel Xeon E5-2630 (8 Core)
• 384GB DDR4
• 2x 3.5" 400GB SSD
• 6GB LSI SAS
• 4x 10GB Intel Ethernet
Dell MD3060e
• 40x 6TB SATA

Infrastructure Costs
• Dell R730xd
• R 115 000
• MD 3060e
• R 310 000

Benchmarks
• Write
• 16 writes per second
• 4MB Blocks
• 600 MB/s sustained throughput
• Rand Read
• 16 Reads per second
• 4MB Blocks
• 2600 MB/s sustained throughput

//TODO:
• Introduce caching layer
• Better align PG to OSD count

ASAUDIT April 2016 New

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (12)

Ähnlich wie ASAUDIT April 2016 New

Ähnlich wie ASAUDIT April 2016 New (20)

ASAUDIT April 2016 New