4. 4
What is GlusterFS?
GlusterFS is opensource software to create a scale-out distributed
filesystem on top of commodity x86_86 servers.
– It aggregates local storage of many servers into a single logical volume.
– You can extend the volume just by adding more servers.
GlusterFS runs on top of Linux. You can use it wherever you can use Linux.
– Physical/virtual machines in your data center.
– Linux VM on public clouds.
GlusterFS provides a wide variety of APIs.
– FUSE mount (using the native client.)
– NFSv3 (supporting the distributed NFS lock.)
– CIFS (using libgfapi native access from Samba.)
– REST API (compatible with OpenStack Swift.)
– Native application library (providing POSIX-like system calls.)
5. 5
Brief history of GlusterFS
2005 2011 2012 2013 2014
GlusterFS 3.3
GlusterFS 3.4
GlusterFS 3.5
http://www.slideshare.net/johnmarkorg/gluster-where-weve-been-a-history
Red Hat acquisition
of Gluster Inc.
The early days
of Gluster Inc.
6. 6
Architecture overview
The standard filesystem (typically xfs) of each storage node is used as a backend device
of the logical volume.
– Each file in the volume is physically stored in one of the storage nodes' filesystem, just as the
same plain file seen from the client.
Hash value of the file name is used to decide the node to store it.
– The metadata server storing the file location is not used in GlusterFS.
file01 file02 file03
・・・ Storage nodes
file01, file02, file03
GlusterFS client
The volume is seen as a single filesystem
mounted on a local directory tree.
Files are distributed across
local filesystem of storage nodes.
GlusterFS volume
7. 7
Hierarchy structure consisted of Node / Brick / Volume
・・・
Volume vol01
Filesystem mounted on /data
/data/brick02
/data/brick01 Brick
(Just a directory)
・・・
/data/brick02
/data/brick01
・・・
/data/brick02
/data/brick01
A volume is created as a
"bundle" of bricks which are
provided by storage nodes.
Node01
A single node can provide multiple bricks to create multiple volumes.
You don't need to use the same number of bricks nor the same directory name on each node.
You can add/remove bricks to extend/reduce the size of volumes.
Node02 Node03
10. 10
DHT: Distributed Hash Table
Distributed has table is:
– A rule for deciding a brick to store the file based on filename's hash value.
– More precisely, it's just a table of bricks and corresponding hash ranges.
file01
Brick1
Hash range 0〜99
Calculate the hash
value of filename.
Brick2
Hash range 100〜199
・・・
127
Stored in the brick which is
responsible for this hash value.
Brick3
Hash range 200〜299
The actual hash length is 32bit.
0x00000000 〜 0xFFFFFFFF
Brick1 Brick2 Brick3 ・・・
Hash
range
0〜99 100〜199 200〜299
DHT (Distributed Hash Table)
11. 11
DHT structure in GlusterFS
Hash tables are created for each directory in a single volume.
– Two files with the same name (in different directories) are placed in different bricks.
– By assigning different hash ranges for different directories, files are more evenly distributed.
The hash range of each brick (directory) is recorded in the extended attribute of the
directory.
Brick1
[root@gluster01 ~]# getfattr -d -m . /data/brick01/dir01
getfattr: Removing leading '/' from absolute path names
# file: data/brick01/dir01
trusted.gfid=0shk2IwdFdT0yI1K7xXGNSdA==
trusted.glusterfs.10d3504b-7111-467d-8d4f-d25f0b504df6.xtime=0sT+vTRwADqyI=
trusted.glusterfs.dht=0sAAAAAQAAAAB//////////w==
Brick1 Brick2 Brick3 ・・・
/dir01 0〜99 100〜199 200〜299 ・・・
/dir02 100〜199 400〜499 300〜399 ・・・
/dir03 500〜599 200〜299 100〜199 ・・・
・・・
Brick2 Brick3 ・・・
12. 12
How GlusterFS client recognizes the hash table
# mount -t glusterfs gluster01:/vol01
Volume "vol01" is provided
by gluster01〜gluster04
gluster01 gluster02 gluster03 gluster04
13. 13
How GlusterFS client recognizes the hash table
# cat /vol01/dir01/file01
gluster01 gluster02 gluster03 gluster04
The hash range of dir01
is xxx.
The hash range of dir01
is yyy.
14. 14
How GlusterFS client recognizes the hash table
# cat /vol01/dir01/file01
gluster01 gluster02 gluster03 gluster04
The hash range of dir01
is xxx.
The hash range of dir01
is yyy.
Construct the whole hash table
for dir01 on memory!
Brick1 Brick2 Brick3 ・・・
dir01 0〜99 100〜199 200〜299
15. 15
Translator modules
GlusterFS works with multiple translator modules.
– There are modules running on clients and modules running on servers.
Each module has its own role.
– Translator modules are built as shared library.
– Original modules can be added as a plug-in.
[root@gluster01 ~]# ls -l /usr/lib64/glusterfs/3.3.0/xlator/
total 48
drwxr-xr-x 2 root root 4096 Jun 16 15:25 cluster
drwxr-xr-x 2 root root 4096 Jun 16 15:25 debug
drwxr-xr-x 2 root root 4096 Jun 16 15:25 encryption
drwxr-xr-x 2 root root 4096 Jun 16 15:25 features
drwxr-xr-x 2 root root 4096 Jun 16 15:25 mgmt
drwxr-xr-x 2 root root 4096 Jun 16 15:25 mount
drwxr-xr-x 2 root root 4096 Jun 16 15:25 nfs
drwxr-xr-x 2 root root 4096 Jun 16 15:25 performance
drwxr-xr-x 2 root root 4096 Jun 16 15:25 protocol
drwxr-xr-x 2 root root 4096 Jun 16 15:25 storage
drwxr-xr-x 2 root root 4096 Jun 16 15:25 system
drwxr-xr-x 3 root root 4096 Jun 16 15:25 testing
DHT, replication, etc.
quota, file lock, etc.
caching, read ahead, etc.
physical I/O
16. 16
Typical combination of translator modules
io-stats
md-cache
quick-read
io-cache
read-ahead
write-behind
dht
replicate-1 replicate-2
server
brick
marker
index
io-threads
locks
access-control
posix
server
brick
marker
index
io-threads
locks
access-control
posix
server
brick
marker
index
io-threads
locks
access-control
posix
server
brick
marker
index
io-threads
locks
access-control
posix
client-1 client-2 client-3 client-4
Client modules(*1)
Server modules(*2)
Brick
Recording statistics information
Metadata caching
Data caching
Handling DHT
Replication
Communication with servers
Communication with clients
Activating I/O thereads
File locking
ACL management
Physical access to bricks
Brick Brick Brick
(*1) Defined in /var/lib/glusterd/vols/<Vol>/<Vol>-fuse.vol (*2) Defined in /var/lib/glusterd/vols/<Vol>/<Vol>.<Node>.<Brick>.vol
17. 17
The past wish list for GlusterFS
Volume Snapshot (master branch)
File Snapshot (GlusterFS3.5)
On-wire compression / decompression (GlusterFS3.4)
Disk Encryption (GlusterFS3.4)
Journal based distributed GeoReplication (GlusterFS3.5)
Erasure coding (Not yet...)
Integration with OpenStack
etc...
http://www.gluster.org/
19. 19
Four locations you need storage system in OpenStack
Swift
Nova Compute
Glance
Application
Data
OS
Cinder
Object Store
Template
Image
Typcally, original distributed
object store using commodity
x86_86 servers is used.
Typcally, external hardware
storege (iSCSI) is used
Typcally, local storage of
compute nodes is used.
Typcally, Swift or NFS
storage is used.
20. Using GlusterFS for Glance backend
GlusterFS Cluster
GlusterFS
Volume
GlusterFS manages scalability, redundancy
and consistency.
Glance Server
Just use GlusterFS volume instead of local storage. So simple.
This is actually being used in many production clusters.
21. 21
Nova Compute
Cinder
VM instance
/dev/vdb
Virtual disk
Linux KVM
/dev/sdX
iSCSI LUN
Storage box
Create LUNs
iSCSI SW
Initiator
iSCSI Target
In typical configuration, block volumes are created as LUNs in iSCSI storage boxes.
Cinder operates on the management interface of the storage through the
corresponding driver.
Nova Compute attaches it to the host Linux using the software initiator, then it's
attached to the VM instance through KVM hypervisor.
How Nova and Cinder works together
22. 22
Cinder also provides the NFS driver which uses NFS server as a storage backend.
– The driver simply mounts the NFS exported directly and create disk image files
in it. Compute nodes use NFS mount to access the image files.
Virtual disk
NFS server
NFS mount
・・・
NFS mount
・・・
Nova ComputeVM instance
/dev/vdb
Linux KVM
Cinder
Using NFS driver
23. 23
There is a driver for GlusterFS distributed filesystem, too.
– Currently it uses FUSE mount mechanism. This will be replaced with more optimized
mechanism (libgfapi) which bypasses the FUSE layer.
Cinder
GlusterFS cluster
FUSE mount
FUSE mount
・・・
Virtual disk
・・・
Nova ComputeVM instance
/dev/vdb
Linux KVM
Using GlusterFS driver for Cinder
24. 24
The same can work for Nova Compute. You can store running VM's OS
image on locally mounted GlusterFS volume.
GlusterFS cluster
FUSE mount
・・・
Virtual disk
・・・
Nova ComputeVM instance
/dev/vda
Linux KVM
GlusterFS shared volume for Nova Compute
Template
Image
25. 25
The FUSE mount/file based architecture is not well suited to workload
for VM disk images (small random I/O).
How can we imporve it?
The challenge in Cinder/Nova Compute integration
26. 26
The FUSE mount/file based architecture is not well suited to workload
for VM disk images (small random I/O).
How can we imporve it?
The challenge in Cinder/Nova Compute integration
http://www.inktank.com/
Using Ceph?
CENSORED
27. 27
"libgfapi" is an application library with which user applications can
directly access GlusterFS volume via native protocol.
– It reduces the overhead of FUSE architecture.
GlusterFS way for qemu integration
Now qemu is integrated with libgfapi
so that it can directly access disk
image files placed in GlusterFS
volume.
– This feature is available since Havana
release.
FUSE mount
libgfapi
28. Architecture of Swift Account Servers
Maintain mappings
between
accounts and containers
Container Servers
Object Servers
Maintain lists and ACLs
of objects
in each container.
Store object contents
in file system.
Proxy Servers
Handling REST
request from clients
Authentication Server
DB
DB
File System
29. Architecture of GlusterFS with Swift API
Proxy / Account / Container / Object
“all in one” server & GlusterFS client
Authentication Server
GlusterFS Cluster
One volume is used
for one account
Account/Container/Object Server
modules retrieve required information
directly from locally mounted volumes.
GlusterFS
Volume
Volume for each account is locally mounted at:
/mnt/gluster-object/AUTH_<account name>
GlusterFS manages scalability, redundancy
and consistency.
31. Using libgfapi with RHEL6/CentOS6
Install development tools, and libgfapi library from EPEL repository.
Build your application with libgfapi.
That's all!
Pseudo-Posix I/O system calls are listed in the header file.
– https://github.com/gluster/glusterfs/blob/release-3.5/api/src/glfs.h
– file stream and mmap are not there :-(
# yum install http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
# yum groupinstall "Development Tools"
# yum install glusterfs-api-devel
# gcc hellogluster.c -lgfapi
# ./a.out
32. "Hello, World!" with libgfapi
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <glusterfs/api/glfs.h>
int main (int argc, char** argv) {
const char *gfserver = "gluster01";
const char *gfvol = "testvol01";
int ret;
glfs_t *fs;
glfs_fd_t *fd;
fs = glfs_new(gfvol);
glfs_set_volfile_server(fs, "tcp", gfserver, 24007);
ret = glfs_init (fs);
if (ret) {
printf( "Failed to connect server/volume: %s/%sn", gfserver, gfvol );
exit(ret);
}
char *greet = "Hello, Gluster!n";
fd = glfs_creat(fs, "greeting.txt", O_RDWR, 0644);
glfs_write(fd, greet, strlen(greet), 0);
glfs_close(fd);
return 0;
}
type struct representing the volume (filesystem) "testvol01"
Connecting to the volume.
Opening a new file on the volume.
Write and close the file.