Spectrum Scale Best Practices by Olaf Weiser

Spectrum Scale
news & other …
...beautiful things
SpectrumScale 2017
- Ehingen-
olaf.weiser@de.ibm.com
IBM Deutschland
SpectrumScale ESCC Germany

IBM Systems
Agenda
Handling node failures
CCR restore
Performance & Auto-tuning
NFSv4 ACL
ESS for SAP HANA workloads
ubiquity

IBM Systems
Spectrum Scale – being available
Q Q
Q
Q
Cluster is up n running, when
→ majority of Quorum nodes are up n running
→ CCR: configuration changes rely on availability of Q nodes
→ a Quorum node is a special node role

IBM Systems
Q Q
Q
Q
→ a cluster manager (CM) is a special node
CM
Q

IBM Systems
Q Q
Q
Q
→ the cluster manager (CM) is a special node
→ a File system manager (FM) is a special node
CM
Q
FM

IBM Systems
Simple node failure
Q Q
Q
Q
CM
Q
FM
→ the cluster manager (CM) is a special node
→ a File system manager (FM) is a special node

IBM Systems
Simple node failure
CM
FM
→ failureDetectionTime (default 35 seconds)
→ leaseRecoveryWait (default 35 seconds)
→ leaseDuration is set equal to failureDetectionTime
→ missedPingTimeout is set equal to recoveryWait minus a few seconds*
*to allow time for the cluster manager to run the node failure protocol before the recoveryWait runs out.
a) The last time the failed node renewed its lease
b) The cluster manager detects that the lease has
expired, and starts pinging the node
c) The cluster manager decides that the node is
dead and runs the node failure protocol
d) The file system manager starts log recovery

IBM Systems
failureDetectionTime
[root@beer1 beer]# tail -f /var/adm/ras/mmfs.log.latest
[….]
2017-03-01_08:31:28.934+0100: [N] Node 10.0.1.13 (beer3) lease renewal is overdue. Pinging to check if it is alive
Definition: The number of seconds it will take the GPFS cluster manager to detect that a node has not renewed it's disk lease.
If a node does does not renew it's disk lease in failureDetectionTime seconds, the GPFS cluster manger will start to ping the node
to determine if the node has failed.
Default Value: 35 - Minimum and Maximum Value: 10 and 300
hint:
clusters with a large number of nodes, FDT may be increased to reduce the number of lease renewal messages received by the GPFS cluster
manager. Example: 5000 nodes / 35 seconds = 142 lease renewals / second.
From experience, if the value of failureDetectionTime is increased, it is sometimes increased to 60 or 120 seconds.
Notes: GPFS must be down on all nodes to change the value of failureDetectionTime
mmchconfig failureDetectionTime = xxx
mmfs.log.latest

IBM Systems
FailureDetectionTime (cont.)
https://www.ibm.com/developerworks/community/blogs/storageneers/entry/Modern_Storage_Options_for_VMware?lang=en
..be careful to lower the FDT value ….
→ means , lease renewal is around every 5 seconds
→ network issues (Spanning Tree Protocol or MISTP ) can take several seconds
Have in mind:

IBM Systems
Cluster manager node failure
Q Q
Q
Q
CM
Q FM
a) The last time the old cluster manager answered a lease renewal request from one of the
other quorum nodes
b) The last time a quorum node sent a new lease request to the old cluster manager.
This is also the last time the old cluster manager could have renewed its own lease.
c) A quorum node detects that it is unable to renew its lease and starts pinging the old cluster mgr
d) The quorum node decides that the old cluster manager is dead and runs an election to take
over as new cluster manager.
e) The election completes and the new cluster manager runs the node failure protocol.
f) The file system manager starts log recovery

IBM Systems
Network issues between a some other nodes
Q Q
Q
Q
CM
Q FM
2017-03-04_10:08:49.510+0100: [N] Request sent to 10.0.1.11 (beer1) to expel 10.0.1.12 (beer2) from cluster beer1
2017-03-04_10:08:49.512+0100: [N] This node will be expelled from cluster beer1 due to expel msg from 10.0.1.13 (beer3)
on node2 (beer2)/* We have evidence that both nodes are still up. In this case, give
preference to
1. quorum nodes over non-quorum nodes
2. local nodes over remote nodes
3. manager-capable nodes over non-manager-capable nodes
4. nodes managing more FSs over nodes managing fewer FSs
5. NSD server over non-NSD server
Otherwise, expel whoever joined the cluster more recently.
After all these criteria are applied, give a chance to the user
script
to reverse the decision.
*/

IBM Systems
Active active cluster, side loss resistant
https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adv_actact.htm
– primary/secondary cluster configuration
master obsolete with CCR
– this configuration survives side loss
– don‘t forget to configure enough disk / fg
for file system descriptor quorum

IBM Systems
File system descriptor Quorum
Number of FGs
Number of FGs
Lost
FS Remain,
Mounted ?
>=5
3 N
2 Y
3
2 N
1 Y
2 1 Depends *
1 1 N
* If the FG that was lost contains 2 of the 3 file system descriptor tables, then the FS is
unmounted.
FGs = Failure Groups

IBM Systems
File system descriptor Quorum
Number of FGs
Number of FGs
Lost
FS Remain,
Mounted ?
>=5
3 N
2 Y
3~4
2 N
1 Y
2 1 Depends *
1 1 N
* If the FG that was lost contains 2 of the 3 file system descriptor tables, then the FS is
unmounted.
FGs = Failure Groups
If the automated failover didn't work cause you don‘t have a 3rd site and a manual
intervention is needed to deal with the site failure, you can simply exclude
the descOnly disk from fsdesc quorum consideration using following command:
+++ eliminate failed disk (because of FS descriptor quorum )
(beer1/root) /nim > mmfsctl prodfs exclude -d "nsdp1;nsdp2"
mmfsctl: 6027-1371 Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.

IBM Systems
Small clusters with tiebreaker – limited minimal configuration
beer1 beer2
Cluster quorum maintained by tiebreaker:
[root@beer1 ~]# mmlsconfig
Configuration data for cluster beer1:
-------------------------------------
clusterName beer1
clusterId 497768088122175956
autoload no
dmapiFileHandleSize 32
minReleaseLevel 4.2.3.0
ccrEnabled yes
cipherList AUTHONLY
tiebreakerDisks nsd1
adminMode central
File systems in cluster beer1:
------------------------------
– designed for small clusters
– single node survives
– enable with mmchconfig
mmchconfig tiebreakerdisks=nsd1…
– back to default (node quorum) with
mmchconfig tiebreakerdisks=no
If both nodes can access the tiebreaker
– the current cluster manager wins

IBM Systems
Small clusters with tiebreaker – limited minimal configuration
beer1 beer2
[root@beer1 ~]# mmlsmgr
file system manager node
---------------- ------------------
beer 10.0.1.11 (beer1)
Cluster manager node: 10.0.1.11 (beer1)
[root@beer1 ~]#
On node2 (beer2)
[N] Node 10.0.1.11 (beer1) lease renewal is overdue. Pinging to check if it is alive
[I] Lease overdue and disk tie-breaker in use. Probing cluster beer1
[I] Waiting for challenge 12 (node 1, sequence 44) to be responded during disk election
2017-05-05_15:26:46.642+0200: [N] Challenge response received; canceling disk election
2017-05-05_15:26:46.642+0200: [E] Attempt to run leader election failed with error 11.
2017-05-05_15:26:46.642+0200: [E] Lost membership in cluster beer1. Unmounting file systems.
2017-05-05_15:25:33.612+0200: [N] Disk lease period expired 1.360 seconds ago in cluster beer1. Attempting to reacquire the
2017-05-05_15:25:44.639+0200: [I] Waiting for challenge 13 (node 1, sequence 45) to be responded during disk election
2017-05-05_15:26:15.648+0200: [I] Waiting for challenge 14 (node 1, sequence 46) to be responded during disk election

IBM Systems
CCR / SDR restore in case of node failure – manual recover
germany france uk
– (0) my cluster example:
[root@germany ~]# mmlscluster
GPFS cluster information
========================
GPFS cluster name: europe.germany
GPFS cluster id: 497768088122175956
GPFS UID domain: europe.germany
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR
Node Daemon node name IP address Admin node name Designation
------------------------------------------------------------------
1 germany 10.0.1.11 germany quorum-manager
2 france 10.0.1.12 france quorum-manager
3 uk 10.0.1.13 uk quorum-manager
[root@germany ~]#

IBM Systems
– (1) – restore the node / reinstall the node
– (2) – check status
/var/mmfs empty
[root@uk ~]# cd /var/mmfs
-bash: cd: /var/mmfs: No such file or directory
– (3) – install gpfs rpms [root@uk 4.2.2.0]# rpm -ihv gpfs.base-4.2.2-0.x86_64.rpm gpfs.docs-4.2.2-0.noarch.rpm
gpfs.ext-4.2.2-0.x86_64.rpm gpfs.gpl-4.2.2-0.noarch.rpm gpfs.gskit-8.0.50-57.x
86_64.rpm gpfs.license.std-4.2.2-0.x86_64.rpm gpfs.msg.en_US-4.2.2-0.noarch.rpm
Preparing... ################################# [100%]
[...]
– (4) – mmbuildgpl / check status
[root@uk 4.2.2.0]# ll /var/mmfs/
total 0
drwxr-xr-x. 2 root root 64 Mar 4 10:44 ces
drwxr-xr-x. 2 root root 6 Mar 4 10:43 etc
drwxr-xr-x. 4 root root 40 Mar 4 10:43 gen
drwxr-xr-x. 2 root root 6 Mar 4 10:43 mmbackup
drwx------. 2 root root 6 Mar 4 10:43 mmpmon
drwxr-xr-x. 2 root root 73 Mar 4 10:43 mmsysmon
drwx------. 4 root root 34 Mar 4 10:43 ssl
drwxr-xr-x. 3 root root 26 Mar 4 10:47 tmp directories must exist
germany france uk

IBM Systems
– (5) – status gpfs on the failed node
[root@uk 4.2.2.0]# mmgetstate
mmgetstate: This node does not belong to a GPFS cluster.
mmgetstate: Command failed. Examine previous error messages to determine cause.
[root@uk 4.2.2.0]#
– (6) – Status from healthy node
[root@germany ~]# mmgetstate -a
uk: mmremote: determineMode: Missing file /var/mmfs/gen/mmsdrfs.
uk: mmremote: This node does not belong to a GPFS cluster.
mmdsh: uk remote shell process had return code 1.
Node number Node name GPFS state
------------------------------------------
1 germany active
2 france active
3 uk unknown
[root@germany ~]#
germany france uk

IBM Systems
– (7) – sdrrestore on the failed node
[root@uk ~]# mmsdrrestore -p germany -R /usr/bin/scp
Sat Mar 4 10:56:46 CET 2017: mmsdrrestore: Processing node uk
genkeyData1
mmsdrrestore: Node uk successfully restored.
[root@uk ~]#
– (8) – startup mmfsd & check status
[root@germany ~]# mmgetstate -a
Node number Node name GPFS state
----------------------------------------
1 germany active
2 france active
3 uk active
[root@germany ~]#
mmlsnode
GPFS nodeset Node list
------------- -------------------------------------------------------
europe germany france uk
[root@uk ~]#
germany france uk

IBM Systems
Performance changes from release to release ..
gpfs 3.5 ~ 10.000 file creates/s or 3 GB/s
SpectrumSc 4.1 ~ 12.000 file creates/s or 5 GB/s
SpectrumSc 4.2 ~ 25.000 file creates/s or 7-8 GB/s
SpectrumSc 4.2.1 ~ 40.000 file creates/s
client‘s performance
b3h0201 [data] # gpfsperf read seq /gpfs/test/data/tmp1/file100g -n 100g -r 8m -th 8 -fsync
gpfsperf read seq /gpfs/test/data/tmp1/file100g
recSize 8M nBytes 100G fileSize 100G
nProcesses 1 nThreadsPerProcess 8
file cache flushed before test
not using direct I/O
offsets accessed will cycle through the same file segment
not using shared memory buffer
not releasing byte-range token after open
fsync at end of test
Data rate was 10318827.72 Kbytes/sec, thread utilization 0.806,
bytesTransferred 107374182400

IBM Systems
Performance changes from release to release ..
gpfs 3.5 ~ 10.000 file creates/s or 3 GB/s
SpectrumSc 4.1 ~ 12.000 file creates/s or 5 GB/s
SpectrumSc 4.2 ~ 25.000 file creates/s or 7-8 GB/s
SpectrumSc 4.2.1 ~ 40.000 file creates/s or 10 GB/s
Current GA
client‘s performance
You would‘nt believe me… so try it yourself

IBM Systems
Data availability – replication and mmrestripefs
/u/gpfs0
mmrestripe can be used to ….
– rebalance data
– rewrite replicas
– change default replication factor
– reviewed and heavily improved since 4.2
– even more enhancements in plan
– mmadddisk / mmdeldisk / mmchdisk
[root@beer1 ~]# mmlsfs beer --rapid-repair
flag value description
------------------- ------------------------ -----------------------------------
--rapid-repair Yes rapidRepair enabled?
[root@beer1 ~]#

IBM Systems
Data availability – replication and mmrestripefs
/u/gpfs0
mmrestripe can be used to ….
– rebalance data
– rewrite replicas
– change default replication factor
– reviewed and heavily improved since 4.2
– even more enhancements in plan
– mmadddisk / mmdeldisk / mmchdisk
Be carefule in clusters with multiple nodes
SpectrumScale is multi-threaded and can over run your environment
– consider QOS
– many improvements in the code, so consider to upgrade soon
– or consider the following rule (next page)

IBM Systems
Data availability – replication and mmrestripefs (cont.)
pitWorkerThreadsPerNode default (0)
internally calculated by:
MIN(16, (numberOfDisks_in_filesystem * 4) / numberOfParticipatingNodes_in_mmrestripefs + 1)
[…]
mmrestripefs: The total number of PIT worker threads of
all participating nodes has been exceeded to safely
restripe the file system. The total number of PIT worker
threads, which is the sum of pitWorkerThreadsPerNode
of the participating nodes, cannot exceed 31.
[…]

IBM Systems
Data availability – replication and mmrestripefs (cont.)
pitWorkerThreadsPerNode default (0)
internally calculated by:
MIN(16, (numberOfDisks_in_filesystem * 4) / numberOfParticipatingNodes_in_mmrestripefs + 1)
with Releases 4.2.1 and 4.2.2: limit of 31 threads
– you‘ll get a warning
(– !!! with lower PTFs .. no warning )
– adjust PITworker with mmchconfig will force recycle mmfsd
– adjust mmrestripe command by -N node1,node2….
with Releases 4.2.3 (and above) will allow > 31 threads

IBM Systems
Automatic tuning - workerThreads
SA23-1452-06 Administration and Programming Reference

IBM Systems
[root@beer1 ~]# mmfsadm dump config | grep "^ ."
. flushedDataTarget 32
. flushedInodeTarget 32
. logBufferCount 3
. logWrapThreads 12
. maxAllocRegionsPerNode 4
. maxBackgroundDeletionThreads 4
. maxBufferCleaners 24
. maxFileCleaners 24
. maxGeneralThreads 512
. maxInodeDeallocPrefetch 8
. parallelWorkerThreads 16
. prefetchThreads 72
. sync1WorkerThreads 24
. sync2WorkerThreads 24
. syncBackgroundThreads 24
. syncWorkerThreads 24
. worker3Threads 8
[root@beer1 ~]#
workerThreads
[root@beer1 ~]# mmfsadm dump version | head -3
Dump level: verbose
Build branch "4.2.2.0 ".
[root@beer1 ~]# mmlsconfig
Configuration data for cluster beer1:
-------------------------------------
clusterName beer1
[...]
workerThreads 96
Automatic tuning - workerThreads

IBM Systems
Auto tuning – client side – ignorePrefetchLUNCount / pagepool
[root@beer1 gpfs]# mmfsadm dump config | grep -e ignorePrefetchLUNCount
ignorePrefetchLUNCount 0
[root@beer1 gpfs]#
Best practice:
→ set when using GNR based NSDs
→ set when using large LUNs from powerful storage back ends
[root@beer1 gpfs]# mmfsadm dump config | grep -i prefetchPct -w -e pagepool
prefetchPct 20
pagepool …..

IBM Systems
Auto tuning – NSDServer side – pagepool and NSDserver Threads
… if using ESS.... everything is preconfigured… …
[root@beer1 gpfs]# mmfsadm dump config | grep -i pagepool
nsdBufSpace (% of PagePool) 30
nsdRAIDBufferPoolSizePct (% of PagePool) 50
[root@beer1 gpfs]# mmfsadm dump config | grep -i -e worker -e smallthread | grep -i nsd[M,S]
nsdMaxWorkerThreads 512
nsdMinWorkerThreads 16
nsdSmallThreadRatio 0
[root@beer1 gpfs]#
if your backend is not an ESS …

IBM Systems
Agenda
Handling node failures
CCR restore
Auto tuning
NFSv4 ACL
ESS for SAP HANA workloads
ubiquity

IBM Systems
Spectrum Scale - NFSv4 ACLs
POSIX ACLs
NFSv4 ACLs
– Finer-grained control of user access for files and directories
– better NFS security
– improved interoperability with CIFS
– removal of the NFS limitation of 16 groups per user
– defined in RFC3530
http://www.ietf.org/rfc/rfc3530.txt
CIFs/Windows ACL
POSIX ACLs

IBM Systems
SpectrumScale – Windows and Unix client access
NFS-client
nfs4 nfs4
GPFS-client
UNIX
GPFS
nfs4*POSIX / nfs4*ACL type:
Win-native-client
Client side:
windows
Ganesha NFS server
CES

IBM Systems
ACLs in GPFS
GPFS
ACL File
– ACLs in GPFS stored in a hidden file
– POSIX ACLs / NFSv4 ACL format supported in parallel (mmlsfs -k)
– files having the same ACL , have the same hash value
[…]
extendedAcl 50
[…]

IBM Systems
NFSv4 ACL – understanding special names
[root@tlinc04 fs1]# mmgetacl file1
#NFSv4 ACL
#owner:root
#group:root
special:owner@:rw-c:allow
(X)READ/LIST (X)WRITE/CREATE (X)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL (X)READ_ATTR (X)READ_NAMED
(-)DELETE (-)DELETE_CHILD (X)CHOWN (-)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE_NAMED
special:group@:r---:allow
(X)READ/LIST (-)WRITE/CREATE (-)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL (X)READ_ATTR (X)READ_NAMED
(-)DELETE (-)DELETE_CHILD (-)CHOWN (-)EXEC/SEARCH (-)WRITE_ACL (-)WRITE_ATTR (-)WRITE_NAMED
special:everyone@:----:allow
(-)READ/LIST (-)WRITE/CREATE (-)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL (X)READ_ATTR (X)READ_NAMED
– NFS V4 provides for a set of special names that are not associated with a specific
local UID or GID.
– representing/similar translated Unix ModeBits
- special:owner@
- special:group@
- special:everyone@
[root@tlinc04 fs1]# ls -l file1
-rw-r----- 1 root root 6 Sep 9 20:34 file1

IBM Systems
NFSv4 ACL – regular entry
[root@beer1 fs1]# mmgetacl /x/beer/fs1/subdir1
#NFSv4 ACL
#owner:root
#group:root
special:owner@:rw-c:allow:FileInherit:DirInherit
(-)DELETE (-)DELETE_CHILD (X)CHOWN (-)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE_NAMED
special:group@:r---:allow:FileInherit:DirInherit
special:everyone@:r---:allow:FileInherit:DirInherit
user:laff:rwxc:allow:FileInherit:DirInherit
(X)DELETE (X)DELETE_CHILD (X)CHOWN (X)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE_NAMED
[root@beer1 fs1]# touch /x/beer/fs1/subdir1/file.laff.from.root
[root@beer1 fs1]# ls -l /x/beer/fs1/subdir1/file.laff.from.root
-rw-r--r--. 1 root root 0 Mar 7 15:47 /x/beer/fs1/subdir1/file.laff.from.root
[root@beer1 fs1]# su - laff -c "echo " hallo " >> /x/beer/fs1/subdir1/file.laff.from.root "
[root@beer1 fs1]# cat /x/beer/fs1/subdir1/file.laff.from.root
hallo

IBM Systems
Spectrum Scale – handling ACLs ( 1 / 3)
(1) by default chmod overwrites NFSv4 ACLs
[root@beer1 fs1]# chmod g+w subdir1
[root@beer1 fs1]# mmgetacl /x/beer/fs1/subdir1
#NFSv4 ACL
#owner:root
#group:root
special:owner@:rw-c:allow
(-)DELETE (X)DELETE_CHILD (X)CHOWN (-)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE_NAMED
special:group@:rw--:allow
(-)DELETE (X)DELETE_CHILD (-)CHOWN (-)EXEC/SEARCH (-)WRITE_ACL (-)WRITE_ATTR (-)WRITE_NAMED
special:everyone@:r---:allow
[root@beer1 fs1]#

IBM Systems
Spectrum Scale – handling ACLs (2 /3 )
(2) old way: ( older releases … )
mmlsconfig
[…]
AllowDeleteAclOnChmod 1
[…]
→ enables you to decide to accept / reject chmod on files with NFSv4
obsolete

IBM Systems
Spectrum Scale – handling ACLs ( 3 /3 )
– since current R 4.x
– supports fileset level permission change

IBM Systems
Spectrum Scale – NFSv4 example: allow permission change
[root@beer1 fs1]# mmputacl -i /tmp/acl subdir1
[root@beer1 fs1]# mmgetacl subdir1
[...]
(X)READ/LIST (X)WRITE/CREATE (X)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL (X)READ_ATTR (X)READ_NAM
(X)DELETE (X)DELETE_CHILD (X)CHOWN (X)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE_NAMED
[root@beer1 fs1]# mmchfileset beer fs1 --allow-permission-change chmodAndUpdateAcl
Fileset fs1 changed.
[root@beer1 fs1]#
[root@beer1 fs1]# chmod g+w subdir1
[root@beer1 fs1]# mmgetacl subdir1
[...]
(X)READ/LIST (X)WRITE/CREATE (X)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL (X)READ_ATTR (X)RE
(X)DELETE (X)DELETE_CHILD (X)CHOWN (X)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE

IBM Systems
SpectrumScale & ESS for SAP HANA
HANA studio HDB
1.) all benefits from SpectrumScale (replication, filesets , management, provisioning)
2.) e2e integration SpectrumScale snapshots in HANA studio
3.) HANA DB workload – DIO intensive write workload

IBM Systems
GPFS client
DBfile from SAP in huge blocks (16M)
Page
Pool
GNR node1
Page
Pool
GNR node2
NVR
SAS raid adapter
/dev/sda10/dev/sda10
… small writes ...
Logtip 2WR
loghome
ack
sync
write

IBM Systems
GPFS client
DBfile from SAP in huge blocks (16M)
Page
Pool
GNR node1
Page
Pool
GNR node2
NVR
SAS raid adapter
/dev/sda10/dev/sda10
… small writes ...
Logtip 2WR
loghome
ack
sync
write
full track
full track write

IBM Systems
ESS and pagepool – now available with 1 TB memory
– If the client team is ordering the new ESS GL2S, GL4S or GL6S that
was announced on April 11, the sales configurator allows clients to
select up to 1TB memory on the 5148-22L ESS server.
– If the client team wants to order 1TB of memory on the ESS 5146
models that use 8247-22L server, there is a manual way over
your sales channel. (There is no RPQ currently set-up for this.)

IBM Systems
mmchconfig disableDIO=yes,aioSyncDelay=10 -N hananode
dioReentryThreshold
– performance optimization, when writing a new file sequentially with DIO
– once a block is allocated, next set of writes can be executed as DIO until end of block
– drop out of DIO into buffered to allocate next block generates a non trivial overhead
– better performance, when staying in buffered mode
– dioReentryThreshold=n means... wait until n blocks worth of I/Os
(that could have been executed as DIO before we actually switch back to DIO mode)
disableDIO
– DIO is always just a hint
– acc. POSIX , DIO versus O_SYNC in addition of O_DIRECT
– if set, GPFS will execute all DIO requests as buffered I/O
– this parameter does not cheat anything

IBM Systems
SpectrumScale and Ubiquity

IBM Systems
IBM research: Docker Adoption Behavior in 2016
• 30% increase in Docker
adoption in one year
• Docker is mostly used by large
companies with a large number
of hosts
• The number of containers running
in production quintuples (= 5x) 9
months after initial deployment
70% of Enterprises Use or Plan to Use Docker (Rightscale 2017 report)
| 54

IBM Systems
SpectrumScale and Ubiquity – added in 2016 / 2017

IBM Systems
SpectrumScale and Ubiquity – added in 2016 / 2017
2016

IBM Systems
SpectrumScale and Ubiquity
● Decoupled from SpectrumScale release
● Published / available on github
● it‘s now : open source
https://github.com/IBM/ubiquity

IBM Systems
Docker and SpectrumScale integration

IBM Systems
Docker and SpectrumScale integration
/path/filefoo
FilesetXYZ /x/beer/fs2/filefoo
Directory /x/beer/fs2/filefoo

IBM Systems
Ubiquity Storage Volume Service with Spectrum Scale
• Support native Spectrum Scale (POSIX) and CES NFS
• Support 2 types of volumes:
– Fileset volumes
• Support optional quota and setting Linux userid/group permissions
• Support both independent or dependent filesets
– Lightweight volumes
• Practically no limit
• Implemented as individual subdirectories in a fileset
• Current admin commands can set other features
• Can map existing dirs/filesets into Volumes
• Support ‘ssh’ to call remote admin commands
• Planned Items
– Support Spectrum Scale REST-API
– Support additional options for Spectrum Scale features
Ubiquity DVP
Mounters
SScale
Docker nodes
Kubelet nodes Kubernetes API
Engine Swarm Compose
(POD)
Web server
ssh (mmcli)
DB
Ubiquity Dynamic
Provisioner
Ubiquity
Service
Mounters
SScale
Ubiquity FlexVolume

IBM Systems
Ubiquity Storage Volume Service with Spectrum Scale
• Support native Spectrum Scale (POSIX) and CES NFS
• Support 2 types of volumes:
– Fileset volumes
• Support optional quota and setting Linux userid/group permissions
• Support both independent or dependent filesets
– Lightweight volumes
• Practically no limit
• Implemented as individual subdirectories in a fileset
• Current admin commands can set other features
• Can map existing dirs/filesets into Volumes
• Support ‘ssh’ to call remote admin commands
• Planned Items
– Support Spectrum Scale REST-API
– Support additional options for Spectrum Scale features
Ubiquity DVP
Mounters
SScale
Docker nodes
Kubelet nodes Kubernetes API
(POD)
Web server
ssh (mmcli)
DB
Ubiquity Dynamic
Provisioner
Ubiquity
Service
Mounters
SScale
Ubiquity FlexVolume
Now Available at
Ubiquity Service https://github.com/ibm/ubiquity
Ubiquity Docker Plugin https://github.com/IBM/ubiquity-docker-plugin
Ubiquity K8s DP and FV https://github.com/IBM/ubiquity-k8s
Available as an alpha release to gain experience with users and their use cases
Support on a best effort basis
Now Available at
Ubiquity Service https://github.com/ibm/ubiquity
Ubiquity Docker Plugin https://github.com/IBM/ubiquity-docker-plugin
Ubiquity K8s DP and FV https://github.com/IBM/ubiquity-k8s
Available as an alpha release to gain experience with users and their use cases
Support on a best effort basis

IBM Systems
Spectrum Scale and docker
→ creating fileset volumes
→ light weight volumes
docker volume create -d spectrum-scale --name demo1 --opt filesystem=beer
File set volumes
docker volume create -d spectrum-scale --name demo5 --opt type=lightweight
--opt fileset=LtWtVolFileset --opt filesystem=beer
Light wight volumes

IBM Systems
Docker - example
SpectrumScale‘s view container‘s (dockers) view

IBM Systems
Docker and MultiTenancy

IBM Systems
Docker and MultiTenancy
• Spectrum Scale commands not accessible
• Changes to image
– Private to that image
– Can be saved or discarded by admin
• Changes to external volumes
– Can only access its volumes (and no other)
– Volumes can be any file path
– Userids can be the same in container as in FS
• Linux user namespaces can also do mapping
– Root can access any file ‘in volume’
– ACLs work as per usual
• POSIX ACLs can be set from inside container
– SELinux can label volumes and only allow access from specific
containers
/fileset-orange
/fileset-green
/my/rand/dir

IBM Systems
Ubiquity Storage Volume Service Vision
Ubiquity Dynamic
Provisioner
Ubiquity Plugin
Docker nodes
Kubelet nodes
Ubiquity
FlexVolume
Kubernetes API
DS8000
Docker
Datacenter
Mounter
BackendsMounter
Backends
3rd
party
Storage
Single Volume Service for all of IBM Storage and Beyond
Ubiquity Volume Service

IBM Systems
Live demo and evaluation platform – “rent a lab”
– ESCC (Kelsterbach) hosts a fully equipped environment
– for tests and evaluation purposes
– feel free to contact:
ESCCCOMM@de.ibm.com
… let‘s do a live demo ...

72
olaf.weiser@de.ibm.com
IBM Deutschland
SpectrumScale Support Specialist

Spectrum Scale Best Practices by Olaf Weiser

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Spectrum Scale Best Practices by Olaf Weiser

Ähnlich wie Spectrum Scale Best Practices by Olaf Weiser (20)

Mehr von Sandeep Patil

Mehr von Sandeep Patil (10)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Spectrum Scale Best Practices by Olaf Weiser