TUT18972: Unleash the power of Ceph across the Data Center

Unleash the Power of Ceph
Across the Data Center
TUT18972: FC/iSCSI for Ceph
Ettore Simone
Senior Architect
Alchemy Solutions Lab
ettore.simone@alchemy.solutions

2
Agenda
• Introduction
• The Bridge
• The Architecture
• Use Cases
• How It Works
• Some Benchmarks
• Some Optimizations
• Q&A
• Bonus Tracks

4
About Ceph
“Ceph is a distributed object store and file system
designed to provide excellent performance, reliability
and scalability.” (http://ceph.com/)
FUT19336 - SUSE Enterprise Storage Overview and Roadmap
TUT20074 - SUSE Enterprise Storage Design and Performance

5
Ceph timeline
Open
Source
2006
OpenStack
Integration
2011
Production
Ready
Q3 2012
Xen
Integration
2013
SUSE
Enterprise
Storage 2.0
Q4 2015
2004
Project
Start at
UCSC
2010
Mainline
Linux
Kernel
Q2 2012
Launch of
Inktank
2012
CloudStack
Integration
Q1 2015
SUSE
Storage 1.0

6
Some facts
Common data centers storage solutions are built
mainly on top of Fibre Channel (yes, and NAS too).
Source: Wikibon Server SAN Research Project 2014

7
Is the storage mindset changing?
New/Cloud
‒ Micro-services Composed Applications
‒ NoSQL and Distributed Database (lazy commit, replication)
‒ Object and Distributed Storage
SCALE-OUT
Classic
‒ Traditional Application → Relational DB → Traditional Storage
‒ Transactional Process → Commit on DB → Commit on Disk
SCALE-UP

8
Is the storage mindset changing? No.
New/Cloud
‒ Micro-services Composed Applications
‒ NoSQL and Distributed Database (lazy commit, replication)
‒ Object and Distributed Storage
Natural playground of Ceph
Classic
‒ Traditional Application → Relational DB → Traditional Storage
‒ Transactional Process → Commit on DB → Commit on Disk
Where we want to introduce Ceph!

9
Is the new kid on the block so noisy?
Ceph is cool but I cannot rearchitect my storage!
And what about my shiny big disk arrays?
I have already N protocols, why another one?
<Add your own fear here>

10
SAN
SCSI
over FC
Our goal
How to achieve a non disruptive introduction of Ceph
into a traditional storage infrastructure?
NAS
NFS/SMB/iSCSI
over Ethernet
RBD
over Ethernet
Ceph

11
How to let happily coexist Ceph in your
datacenter with the existing neighborhood
(traditional workloads, legacy servers, FC switches etc...)

13
FC/iSCSI gateway
iSCSI
‒ Out-of-the-box feature of SES 2.0
‒ TUT16512 - Ceph RBD Devices and iSCSI
Fiber Channel
‒ That's the point we will focus today

14
Back to our goal
How to achieve a non disruptive introduction of Ceph
into a traditional storage infrastructure?
RBDSAN NAS

15
Linux-IO Target (LIO™)
Is the most common open-source SCSI target in
modern GNU/Linux distros:
FC
FCoE
FireWire
iSCSI
iSER
SRP
loop
vHost
FABRIC BACKSTORELIO
FILEIO
IBLOCK
RBD
pSCSI
RAMDISK
TCMU
Kernel space

17
Technical Reference for Entry Level
Dedicated nodes connect Ceph to Fiber Channel

18
Hypothesis for High Throughput
All OSDs nodes connect Ceph to Fiber Channel

20
Pool and OSD geometry
x
x
x
x
x
x
x
x
x
x
x
x
x
x

22
Multipath I/O (MPIO)
devices {
device {
vendor "(LIO-ORG|SUSE)"
product "*"
path_grouping_policy "multibus"
path_checker "tur"
features "0"
hardware_handler "1 alua"
prio "alua"
failback "immediate"
rr_weight "uniform"
no_path_retry "fail"
rr_min_io 100
}
}

23
Automatically classify the OSD
Classify by NODE;OSD;DEV;SIZE;WEIGHT;SPEED
# ceph-disk-classify
osd01 0 sdb 300G 0.287 15K
osd01 1 sdc 300G 0.287 15K
osd01 2 sdd 200G 0.177 SSD
osd01 3 sde 1.0T 0.971 7.2K
osd01 4 sdf 1.0T 0.971 7.2K
osd02 5 sdb 300G 0.287 15K
osd02 6 sdd 200G 0.177 SSD
osd02 7 sde 1.0T 0.971 7.2K
osd01 8 sdf 1.0T 0.971 7.2K
osd03 9 sdb 300G 0.287 15K
…

24
Avoid standard CRUSH location
Default:
osd crush location = root=default host=`hostname -s`
Using an helper script:
osd crush location hook = /path/to/script
Or entirely manual:
osd crush update on start = false
…
# ceph osd crush [add|set] 39 0.971 root=root-7.2K
host=osd08-7.2K

26
Smooth transition
Native migration of SAN/LUN to RBD/Volumes help
migration/conversion/coexisting:
Traditional Workloads Private Cloud
CephSAN GW
New Workloads

27
Smooth transition
CephSAN GW
New Workloads

28
Smooth transition
CephSAN GW
New Workloads

29
Smooth transition
CephSAN GW
New Workloads

30
Storage replacement
No drama at the End of Life/Support of traditional
storages:
CephGW
New Workloads

31
D/R and Business Continuity
CephGW
Site A Site B
Ceph GW

33
Ceph and Linux-IO
SCSI commands from fabrics are addressed by LIO
core, configured using targetcli or directly via sysfs,
and proxied to the interested block device through the
relative backstore module.
CLIENTS
CEPHCLUSTER
/sys/kernel/config/target
← user space →
← kernel space →

34
Enable QLocig in target mode
# modprobe qla2xxx qlini_mode="disabled"
CLIENTS
CEPHCLUSTER
← user space →

35
Identify and enable HBAs
# cat
/sys/class/scsi_host/host*/device/fc_host/h
ost*/port_name |
sed -e 's/../:&/g' -e 's/:0x://'
# targetcli qla2xxx/ create ${WWPN}
CLIENTS
CEPHCLUSTER
← user space →

36
Map RBDs and create backstores
# rbd map -p ${POOL} ${VOL}
# targetcli backstores/rbd create name="$
{POOL}-${VOL}" dev="${DEV}"
CLIENTS
CEPHCLUSTER
/dev/rbd0
← user space →

37
Create LUNs connected to RBDs
# targetcli qla2xxx/${WWPN}/luns create
/backstores/rbd/${POOL}-${VOL}
CLIENTS
CEPHCLUSTER
/dev/rbd0
← user space →
LUN0

38
“Zoning” to filter access with ACLs
# targetcli qla2xxx/${WWPN}/acls create $
{INITIATOR} true
CLIENTS
CEPHCLUSTER
/dev/rbd0
← user space →
LUN0

40
First of all...
This solution is NOT a drop in replacement for SAN nor
NAS (at the moment at least!).
The main focus is to identify how to minimize the
overhead from native RBD to FC/iSCSI.

41
Raw performance/estimation on 15K
Physical Disk IOPS: Ceph IOPS:
‒ 4K RND Read = 193 x 24 = 4.632
‒ 4K RND Write = 178 x 24 / 3 = 1.424 / 3 = 475
Physical Disk Throughput: Ceph Throughput:
‒ 512K RND Read = 108 MB/s x 24 = 2.600
‒ 512K RND Write = 105 MB/s x 24 / 3 = 840 / 2 = 420 MB/s
NOTE:
‒ 24 OSD and 3 Replicas per Pool
‒ No SSD for journal (so ~1/3 IOPS and ~1/2 of bandwidth for
writes)

43
64K SEQ Read
64K SEQ Write
0 500 1000 1500 2000 2500 3000
Estimated
RBD
MAP/AIO
MAP/LIO
QEMU/LIO
ThroughputinMB/s
4K RND Read
4K RND Write
0 1000 2000 3000 4000 5000 6000
Estimated
RBD
MAP/AIO
MAP/LIO
QEMU/LIO
IOPS
Compared performance on 15K
NOTE:
‒ SEQ 64K on RBD Client → RND 512K on Ceph OSD

46
Where we are working on
Centralized management with GUI/CLI
‒ Deploy MON/OSD/GW nodes
‒ Manage Nodes/Disk/Pools/Map/LIO
‒ Monitor cluster and node status
Reaction on failures
Using librados/librbd with tcmu for backstore

47
Central Management Console
• Intel Virtual Storage Manager
• Ceph Calamari
• inkScope

48
More integration with existing tools
Extend LRBD do accept multiple Fabric:
‒ iSCSI (native support)
‒ FC
‒ FCoE
Linux-IO:
‒ Use of librados via tcmu

50
I/O scheduler matter!
On OSD nodes:
‒ deadline on physical disk (cfq if ionice scrub thread)
‒ noop on RAID disk
‒ read_ahead_kb=2048
On Gateway nodes:
‒ noop on mapped RBD
On Client nodes:
‒ noop or deadline on multipath device

51
Reduce I/O concurrency
• Reduce OSD scrub priority:
‒ I/O scheduler cfq
‒ osd_disk_thread_ioprio_class = idle
‒ osd_disk_thread_ioprio_priority = 7

52
Design optimizations
• SSD on monitor nodes for LevelDB: decrease CPU,
memory usage and time during recovery
• SSD Journal decrease I/O latency: 3x IOPS and better
throughput

54
lab@alchemy.solutions
Thank you.

Corporate Headquarters
Maxfeldstrasse 5
90409 Nuremberg
Germany
+49 911 740 53 0 (Worldwide)
www.suse.com
Join us on:
www.opensuse.org
55

57
Business Continuity architecture
Low latency connected sites:
WARNING: To improve availability a third site to place a
quorum node are highly encouraged.

58
Disaster Recovery architecture
High latency or disconnected sites:
As in OpenStack Ceph plug-in for Cinder Backup:
# rbd export-diff pool/image@end --from-snap start - |
ssh -C remote rbd import-diff – pool/image

59
KVM Gateways
• VT-x Physical passthrough of QLogic
• RBD Volumes as VirtIO devices
• Linux-IO iblock backstore

60
VT-x PCI passthrough 1/2
Install KVM and tools
Boot with intel_iommu=on
# lspci -D | grep -i QLogic | awk '{ print $1 }'
0000:24:00:0
0000:24:00:1
# readlink /sys/bus/pci/devices/0000:24:00.
{0,1}/driver
../../../../bus/pci/drivers/qla2xxx
../../../../bus/pci/drivers/qla2xxx
# modprobe -r qla2xxx

61
VT-x PCI passthrough 2/2
# virsh nodedev-detach pci_0000_24_00_{0,1}
Device pci_0000_24_00_0 detached
Device pci_0000_24_00_1 detached
# virsh edit VM
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x24' slot='0x0' function='0x0'/>
</source>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x24' slot='0x0' function='0x1'/>
</source>
</hostdev>
# virsh start VM

62
KVM hot-add RBD 1/2
# ceph auth get-or-create client.libvirt mon 'allow r'
osd 'allow rwx'
[client.libvirt]
key = AQBN3S9W0Z2gKxAAnua2fIlcSVSZ/c7pqHtTwA==
# cat secret.xml
<secret ephemeral='no' private='no'>
<usage type='ceph'>
<name>client.libvirt secret</name>
</usage>
</secret>
# virsh secret-define --file secret.xml
Secret 363aad3c-d13c-440d-bb27-fd58fca6aac2 created
# virsh secret-set-value --secret 363aad3c-d13c-440d-
bb27-fd58fca6aac2 --base64
AQBN3S9W0Z2gKxAAnua2fIlcSVSZ/c7pqHtTwA==

63
KVM hot-add RBD 2/2
# cat disk.xml
<disk type='network' device='disk'>
<source protocol='rbd' name='pool/vol'>
<host name='mon01' port='6789'/>
</source>
<auth username='libvirt'>
<secret type='ceph' uuid='363aad3c-d13c-440d-bb27-
fd58fca6aac2'/>
</auth>
<target dev='vdb' bus='virtio'/>
</disk>
# virsh attach-device --persistent VM disk.xml
Device attached successfully

64
/usr/local/sbin/ceph-disk-classify
# Enumerate OSDs
ceph osd ls |
while read OSD; do
# Extract IP/HOST from Cluster Map
IP=`ceph osd find $OSD | tr -d '"' | grep 'ip:' | awk -F: '{ print $2 }'`
NODE=`getent hosts $IP | sed -e 's/.* //'`
test -n "$NODE" || NODE=$IP
# Evaluate mount point for osd.<N> (so skip Journals and not used ones)
MOUNT=`ssh -n $NODE ceph-disk list 2>/dev/null | grep "osd.$OSD" | awk '{ print $1 }'`
DEV=`echo $MOUNT | sed -e 's/[0-9]*$//' -e 's|/dev/||'`
# Calculate Disk size and FS size
SIZE=`ssh -n $NODE cat /sys/block/$DEV/size`
SIZE=$[SIZE*512]
DF=`ssh -n $NODE df $MOUNT | grep $MOUNT | awk '{ print $2 }'`
# Weight is the size in TByte
WEIGHT=`printf '%3.3f' $(bc -l<<<$DF/1000000000)`
SPEED=`ssh -n $NODE sginfo -g /dev/$DEV | sed -n -e 's/^Rotational Rates*//p'`
test "$SPEED" = '1' && SPEED='SSD'
# Output
echo $NODE $OSD $DEV `numfmt --to=si $SIZE` $WEIGHT $SPEED
done

66
A Vagrant LAB for Ceph and iSCSI
• 3 all-in-one nodes (MON+OSD+iSCSI Target)
• 1 admin Calamari and iSCSI Initiator with MPIO
• 3 disks per OSD node
• 2 replicas
• Placement Groups: 3*3*100/2 = 450 → 512

67
Ceph Initial Configuration
Login into ceph-admin and create initial ceph.conf
# ceph-deploy install ceph-{admin,1,2,3}
# ceph-deploy new ceph-{1,2,3}
# cat <<-EOD >>ceph.conf
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 512
osd_pool_default_pgp_num = 512
EOD

68
Ceph Deploy
Login into ceph-admin and create the Ceph cluster
# ceph-deploy mon create-initial
# ceph-deploy osd create ceph-{1,2,3}:sd{b,c,d}
# ceph-deploy admin ceph-{admin,1,2,3}

69
LRBD “auth”
"auth": [
{
"authentication": "none",
"target": "iqn.2015-09.ceph:sn"
}
]

70
LRBD “targets”
"targets": [
{
"hosts": [
{
"host": "ceph-1", "portal": "portal1"
},
{
},
{
}
],
"target": "iqn.2015-09.ceph:sn"
}
]

71
LRBD “portals”
"portals": [
{
"name": "portal1",
"addresses": [ "10.20.0.101" ]
},
{
"name": "portal2",
"addresses": [ "10.20.0.102" ]
},
{
"name": "portal3",
"addresses": [ "10.20.0.103" ]
}
]

72
LRBD “pools”
"pools": [
{
"pool": "rbd",
"gateways": [
{
"target": "iqn.2015-09.ceph:sn",
"tpg": [
{
"image": "data",
"initiator": "iqn.1996-04.suse:cl"
}
]
}
]
}
]

Unpublished Work of SUSE LLC. All Rights Reserved.
This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC.
Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their
assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated,
abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE.
Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.
General Disclaimer
This document is not to be construed as a promise by any participating company to develop, deliver, or market a
product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making
purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and
specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The
development, release, and timing of features or functionality described for SUSE products remains at the sole discretion
of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time,
without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this
presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-
party trademarks are the property of their respective owners.

TUT18972: Unleash the power of Ceph across the Data Center

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie TUT18972: Unleash the power of Ceph across the Data Center

Ähnlich wie TUT18972: Unleash the power of Ceph across the Data Center (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

TUT18972: Unleash the power of Ceph across the Data Center