4. Some relevant Scale Basics
âThe Clusterâ
⢠Cluster â A group of operating system
instances, or nodes, on which an
instance of Spectrum Scale is deployed
⢠Network Shared Disk - Any block
storage (disk, partition, or LUN) given to
Spectrum Scale for its use
⢠NSD server â A node making an NSD
available to other nodes over the network
⢠GPFS filesystem â Combination of data
and metadata which is deployed over
NSDs and managed by the cluster
⢠Client â A node running Scale Software
accessing a filesystem
nsd
nsd
srv
nsd nsd
nsd
srv
nsd
srv
nsd
srv
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
CCCCC
5. Some relevant Scale Basics
MultiCluster
⢠A Cluster can share some or all of its
filesystems with other clusters
⢠Such cluster can be referred to as
âStorage Clusterâ
⢠A cluster that donât have any local
filesystems can be referred to as âClient
Clusterâ
⢠A Client cluster can connect to 31
clusters ( outbound)
⢠A Storage cluster can be mounted by
unlimited number of client clusters
( Inbound) â 16383 really.
⢠Client cluster nodes âjoinsâ the storage
cluster upon mount
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCC
6. Some relevant Scale Basics
Node Roles
⢠While the general concept in Scale is âall nodes were made equalâ some has special roles
Config Servers
ďŹ
Holds cluster
configuration files
ďŹ
4.1 onward supports
CCR - quorum
ďŹ
Not in the data path
Cluster Manager
ďŹ
One of the quorum
nodes
ďŹ
Manage leases,
failures and recoveries
ďŹ
Selects filesystem
managers
ďŹ
mm*mgr -c
Filesystem Manager
ďŹ
One of the âmanagerâ
nodes
ďŹ
One per filesystem
ďŹ
Manage filesystem
configurations ( disks)
ďŹ
Space allocation
ďŹ
Quota management
Token Manager/s
ďŹ
Multiple per filesystem
ďŹ
Each manage portion
of tokens for each
filesystem based on
inode number
7. Some relevant Scale Basics
Node Internal Structure
User Mode Daemon:
I/O Optimization, Communications, Clustering Kernel Module
Pagepool (Buffer Pool)
Pinned Memory
Disk Descriptor
File System Descriptor
File metadata
File Data
Application
Shared Storage
Communication
TCP/IP
RDMA
Sockets
GPFS Commands
/usr/lpp/mmfs/bin
Configuration
CCR
Other Memory
Non-pinned Memory
Other Nodes
Shell
9. Scale Memory Pools
⢠At a high level, Scale is using several
memory pools:
⢠Pagepool
data
⢠Shared Segment
âmemory metadataâ, shared ( U+K)
⢠External/heap
Used by âexternalâ parts of Scale ( AFM
queue, Ganesha, SAMBA etc.)
External/Heap
SharedSegment
Pagepool
10. Scale Memory Pools
⢠At a high level, Scale is using several
memory pools:
⢠Pagepool
data
⢠Shared Segment
âmemory metadataâ, shared ( U+K)
⢠External/heap
Used by âexternalâ parts of Scale ( AFM
queue, Ganesha, SAMBA etc.)
External/Heap
SharedSegment
Pagepool
11. Scale Memory Pools
Pagepool
⢠Pagepool size is statically defined using
the pagepool config option. It is allocated
on daemon startup and can be changed
using mmchconfig with the -i option
( NUMA, fragmentation)
⢠Pagepool stores data only
⢠All pagepool content is referenced by the
sharedSegment ( data is worthless without
metadataâŚ)
⢠Not all the sharedSegment points to the
pagepool
SharedSegment
Pagepool
12. Scale Memory Pools
SharedSegment
⢠The shared segment has two major pools
⢠Pool2 - âShared Segmentâ
-
All references to data stored in
the pagepool ( bufferDesc,
Openfiles, IndBlocks,
CliTokens) etc.
-
Statcache ( compactOpenfile)
⢠Pool3 - âToken Memoryâ
-
Tokens on token servers
-
Some client tokens ( BR)
Pool2
âUNPINNEDâ
Pool3
TokenMemory
13. Scale Memory Pools
SharedSegment
⢠Whatâs the difference between
maxFilesToCache and maxStatCache?
â maxFilesToCache:
How many files data and metadata
can we cache in the pagepool.
Implies how many concurrent open
files we can have
â maxStatCache:
If a file needs to be evicted from the
pagepool due to MFTC above, we
can still take its stat structure,
compact it and store it here. So we
wonât need an IOP for those stat
calls
maxStatCache
maxFilesToCache
15. Memory Calculations
âSo how much memory do I need?â
⢠Pagepool - âit depends...â:
How much DATA do you need to cache?
How much RAM can you spend?
⢠Shared Segment - âit depends...â
How many files do you need to cache?
How many inodes do you need to cache?
What is your access pattern ( BR)
SharedSegment
Pagepool
16. Memory Calculations
SharedSegment
0 10000 20000 30000 40000 50000 60000
100
200
300
400
500
Memory Use
per cached files
No Of Files
MB
⢠Pool 2 - âShared Segmentâ
â Scope: Almost completely local
( not much dependency on other
nodes)
â Relevant parameters:
â
maxFilesToCache
â
maxStatCache
â Cost:
â
Each MFTC =~ 10k
â
Each MSC =~ 480b
For example, using MFTC of 50k and MSC of 20k and
opening 60k files
17. Memory Calculations
SharedSegment
⢠Pool 3 - âToken Memoryâ
â Scope: âStorage clusterâ manager nodes
â Relevant parameters:
â
MFTC, MSC, number of nodes, âhot objectsâ
â Cost:
â
TokenMemPerFile ~= 464b, ServerBRTreeNode = 56b
â
BasicTokenMem = (MFTC+MSC)*NoOfNodes*TokenMemPerFile
â
ClusterBRTokens = NoOfHotBlocksRandomAccess * ServerBRTreeNode
â
TotalTokenMemory = BasicTokenMem + ClusterBRTokens
â
PerTokenServerMem = TotalTokenMemory / NoOfTokenServers
* The number of the byte range tokens, in the worse case scenario, would be one byte range per block for all the blocks being accessed randomly
18. Memory Calculations
SharedSegment
⢠Pool 3 - âToken Memoryâ
â Scope: âStorage clusterâ manager nodes
â Relevant parameters:
â
MFTC, MSC, number of nodes, âhot objectsâ
â Cost:
â
TokenMemPerFile ~= 464b, ServerBRTreeNode = 56b
â
BasicTokenMem = (MFTC+MSC)*NoOfNodes*TokenMemPerFile
â
ClusterBRTokens = NoOfHotBlocksRandomAccess * ServerBRTreeNode
â
TotalTokenMemory = BasicTokenMem + ClusterBRTokens
â
PerTokenServerMem = TotalTokenMemory / NoOfTokenServers
* The number of the byte range tokens, in the worse case scenario, would be one byte range per block for all the blocks being accessed randomly
Overall number of cached objects in the cluster
( different nodes might cache different numbers) * 0.5K
+ monitor token memory usage
19. Memory Calculations
SharedSegment
0 10000 20000 30000 40000 50000 60000
200
400
600
800
1000
1200
Token Server Memory Use
Per cached file
SrvTM SrvTM*25
NoOfFiles
MB
⢠Pool 3 - âShared Segmentâ
â As can be seen in the graph, single
client donât usually large memory
foot print ( blue line)
â Using 25 clients, already brings us
closer to 1G
â There is no difference between the
âcostâ of MFTC and MSC
For example, using MFTC of 50k and MSC of 20k and
opening 60k files
22. Is that all?
Related OS memory usage
0 10000 20000 30000 40000 50000 60000
100
200
300
400
500
Memory Utilization
Shared2 dEntryOSz inodeOvSz
No Of Files
MB
⢠âScale donât use the operating system
cacheâ...well...sort of
â Directory Cache ( a.k.a dcache) - ~192b
â VFS inode cache - ~1K
⢠But is it really important? The SharedSeg
takes much more memory