Axa Assurance Maroc - Insurer Innovation Award 2024
Presentazione laurea 1.2 matteo concas
1. Network filesystems in
heterogeneous cloud applications
Supervisor: Massimo Masera (Univesità di Torino, INFN)
Company Tutor: Stefano Bagnasco (INFN, TO)
Tutor: Dario Berzano (INFN, TO)
candidate: Matteo Concas
2. Computing @LHC: how is the GRID
structured?
ATLAS - FZK - Catania
(Karlsruhe)
- Torino (~1 GB/s)
CMS - CNAF*
(Bologna)
15 PB/year - Bari
of raw data
ALICE - ... - ...
- Legnaro
-IN2P3
LHCb (Lyon)
Tier-0 Tier-1 Tier-2
Data are distributed over a federated network called Grid, which is hierarchically organized in Tiers.
3. Computing infrastructure @INFN
Torino
*V.M. = virtual machine
Grid node.
job submission
Batch processes: submitted jobs are
V.M.*
queued and executed as soon as
V.M. there are enough free resources.
data retrieval
V.M. Output is stored on Grid storage
V.M. asynchronously.
V.M.
V.M. Alice Proof facility.
continuous 2-way Interactive processes: all resources
communication are allocated at the same time. Job
splitting is dynamic and results are
returned immediately to the client.
Legacy Tier-2 re
mo
Data Storage te Generic virtual farms.
log
in VMs can be added dynamically and
Data storage cloud storage removed as needed. End user
Data storage cloud storage doesn't know how is his/her farm is
Data storage cloud storage
physically structured.
New generation
cloud storage
5. Introduction: Distributed storage
● Aggregation of several storages:
○ Several nodes and disks seen as one pool in the
same LAN (Local Area Network)
○ Many pools aggregated geographically through
WAN → cloud storage (Wide Area Network)
○ Concurrent access by many clients is optimized
→ “closest” replica
Client 1
Site 1 LAN Client ...
Client i
Geo-replication WAN
Client ...
Site 2 LAN
Client m-1
Client m
Network filesystems are the backbone of these infrastructures
6. Why distributing the storage?
● Local disk pools:
○ several disks: no single hard drive can be big enough →
aggregate disks
○ several nodes: some number crunching, and network,
required to look up and serve data → distribute the load
○ client scalability → serve many clients
○ on local pools, filesystem operations (r, w, mkdir, etc.) are
synchronous
● Federated storage (scale is geographical):
○ single site cannot contain all data
○ moving job processing close to their data, not vice versa
→ distributed data ⇔ distributed computing
○ filesystem operations are asynchronous
7. Distributed storage solutions
● Every distributed storage has:
○ a backend which aggregates disks
○ a frontend which serves data over a network
● Many solutions:
○ Lustre, GPFS, GFS → popular in the Grid world
○ stackable, e.g.: aggregate with Lustre, serve with
NFS
● NFS is not a distributed storage → does not aggregate,
only network
8. Levels of aggregation in Torino
● Hardware aggregation (RAID) of hard drives → virtual block devices
(LUN: logical unit number)
● Software aggregation of block devices → each LUN is aggregated
using Oracle Lustre:
○ separated server to keep "file information" (MDS: metadata server)
○ one or more servers attached to the block devices (OSS: object
storage servers)
○ quasi-vertical scalability → "master" server (i.e., MDS) is a bottleneck,
can add more (hard & critical work!)
● Global federation → the local filesystem is exposed through xrootd:
○ Torino's storage is part of a global federation
○ used by the ALICE experiment @ CERN
○ a global, external "file catalog" knows whether a file is in Torino or not
9. What is GlusterFS
● Open source, distributed network filesystem claiming to scale
up to several petabytes and handling many clients
● Horizontal scalability → distributed workload through "bricks"
● Reliability:
○ elastic management → maintenance operations are
online
○ can add, remove, replace without stopping service
○ rebalance → when adding a new "brick", fill to ensure
even distribution of data
○ self-healing on "replicated" volumes → form of automatic
failback & failover
12. Preliminary studies
● Verify compatibility of GlusterFS precompiled
packages (RPMs) on CentOS 5 and 6 for the
production environment
● Packages not available for development
versions: new functionalities tested from source
code (e.g. Object storage)
● Test on virtual machines (first on local
VirtualBox then on INFN Torino OpenNebula
cloud) http://opennebula.org/
13. Types of benchmarks
● Generic stress benchmarks conducted on:
○ Super distributed prototype
○ Pre-existing production volumes
● Specific stress benchmark conducted on
some type of GlusterFS volumes
(e.g. replicated volume)
● Application specific tests:
○ High energies physics analysis running on ROOT
PROOF
14. Note
● Tests conducted in two different
circumstances:
a. storage built for the sole purpose of testing:
the volume would be less performing than
infrastructure ones for the benchmarks
b. volumes of production were certainly subject to
interferences due to concurrent processes
"Why perform these tests?"
15. Motivations
● Verify consistency of the "release notes":
→ test all the different volume types:
○ replicated
○ striped
○ distributed
● Test GlusterFS in a realistic environment
→ build a prototype as similar as possible to
production infrastructure
16. Experimental setup
● GlusterFS v3.3 turned out to be stable after tests
conducted both on VirtualBox and OpenNebula VMs
● Next step: build an experimental "super distributed"
prototype: a realistic testbed environment consisting of:
○ #40 HDDs [500 GB each]→ ~20 TB (1 TB≃10^12 B)
○ GlusterFS installed on every hypervisor
○ Each hypervisor mounted 2 HDDs → 1 TB each
○ all the hypervisors were connected each other (LAN)
● Software used for benchmarks: bonnie++
○ very simple to use read/write benchmark for disks
○ http://www.coker.com.au/bonnie++/
17. Striped volume
● used in high concurrency environments accessing
large files (in our case ~10 GB);
● useful to store large data sets, if they have to be
accessed from multiple instances.
(source: www.gluster.
org)
18. Striped volume / results
Average Std. Deviation Average Std. Deviation Average Std. Deviation
Sequential Write Sequential Write Sequential Sequential Sequential Read Sequential Read
per Blocks per Blocks Rewrite Rewrite [MB/s] per Blocks per Blocks
[MB/s] [MB/s] [MB/s] [MB/s] [MB/s]
striped 38.6 1.3 23.0 3.6 44.7 1.3
19. Striped volume / comments
Machine RAM
Size of written size, although
Each test is Software used
files [MB] (at GlusterFS doesn't
repeated 10 is bonnie++
least double have any sort of
times v1.96
the RAM size) file cache
> for i in {1..10}; do bonnie++ -d$SOMEPATH -s5000 -r2500 -f; done;
● Has the second best result in write (per blocks),
and the most stable one (lowest stddev)
20. Replicated volume:
● used where high-availability and high-reliability are
critical
● main task → create forms of redundancy: more
important the data availability than high performances in
I/O
● requires a great use of resources, both disk space and
CPU usage (especially
during the self-healing
procedure)
(source: www.gluster.
org)
21. Replicated volume:
● Self healing feature: given "N" redundant
servers, if at maximum (N-1) crash → services
keep running on the volume ⇝ servers restored
→ get synchronized with the one(s) that didn't
crash
● Self healing feature was tested turning off
servers (even abruptly!) during I/O processes
22. Replicated / results
Average Std. Deviation Average Std. Deviation Average Std. Deviation
Sequential Sequential Sequential Sequential Sequential Sequential
Write per Write per Rewrite Rewrite [MB/s] Read per Read per
Blocks [MB/s] Blocks [MB/s] [MB/s] Blocks [MB/s] Blocks [MB/s]
replicated 35.5 2.5 19.1 16.1 52.2 7.1
23. Replicated / comments
● Low rates in write and the best result in read →
writes need to be synchronized, read throughput
benefits from multiple sources
● very important in building stable volumes in critical
nodes
● "Self healing" feature worked fine: uses all available
cores during resynchronization process, and it does
it online (i.e. with no service interruption, only
slowdowns!)
24. Distributed volume:
● Files are spread across the bricks in a fashion that
ensures uniform distribution
● Pure distributed volume only if redundancy is not
required or lies elsewhere (e.g. RAID)
● If no redundancy, disk/server failure can result in
loss of data, but only
some bricks are
affected, not the
whole volume!
(source: www.gluster.
org)
25. Distributed / results
Average Std. Deviation Average Std. Deviation Average Std. Deviation
Sequential Sequential Sequential Sequential Sequential Sequential
Write per Write per Rewrite Rewrite [MB/s] Read per Read per
Blocks [MB/s] Blocks [MB/s] [MB/s] Blocks [MB/s] Blocks [MB/s]
distributed 39.8 5.4 22.3 2.8 52.1 2.2
26. Distributed / comments
● Best result in write and the second best result in
input → high performances volume
● Since volume is not striped, and no high client
concurrency was used, we don't exploit the full
potentialities of GlusterFS → done in subsequent
tests
Some other tests were also conducted on different
mixed types of volumes (e.g. striped+replicated)
28. Production volumes
● Tests conducted on two volumes used at INFN
Torino computing center: the VM images repository
and the disk where running VMs are hosted
● Tests executed without production services
interruption → expect results to be slightly
influenced by contemporary computing activities
(even if they were not network-intensive)
29. Production volumes:
Imagerepo
Images Repository
virtual-machine-img1
virtual-machine-img2
virtual-machine-img-3
...
virtual-machine-img-n
Network
mount mount mount mount
Hypervisor 1 Hypervisor 2 Hypervisor 3 ... Hypervisor m
30. Production volumes: Vmdir
Service Service
Service hypervisor hypervisor Service
hypervisor hypervisor
I/O stream I/O stream
I/O stream I/O stream
GlusterFS volume
32. Production volumes / Results (2)
Average Std. Deviation Average Std. Deviation Average Std. Deviation
Sequential Sequential Sequential Sequential Sequential Sequential Read
Write per Write per Rewrite Output Read per per Blocks [MB/s]
Blocks [MB/s] Blocks [MB/s] [MB/s] Rewrite [MB/s] Blocks [MB/s]
Image
64.4 3.3 38.0 0.4 98.3 2.3
Repository
Running
47.6 2.2 24.8 1.5 62.7 0.8
VMs volume
● Imagerepo is a distributed volume (GlusterFS →1 brick)
● Running VMs volume is a replicated volume → worse
performances, but single point of failure eliminated by
replicating both disks and servers
● Both volumes are more performant than the testbed ones
→ better underlying hardware resources used
33. PROOF test
● PROOF: ROOT-based framework for interactive
(non-batch, unlike Grid) physics analysis, used
by ALICE and ATLAS, officially part of the
computing model
● Simulate a real use case → not artificial, with a
storage constituted of 3 LUN (over a RAID5) of
17 TB each in distributed mode
● many concurrent accesses: GlusterFS
scalability is extensively exploited
34. PROOF test / Results
Concurrent
MB/S
Processes
60 473
66 511
72 535
78 573
84 598
96 562
108 560
● Optimal range of concurrent accesses: 84-96
● Plateau beyond optimal range
35. Conclusions and possible
developments
● GlusterFS v3.3.1 was considered stable and
satisfying all the prerequisites needed from a
network filesystem.
→ upgrade was performed and currently in use!
● Make some more tests (e.g. in different use
cases)
● Look for next developments in GlusterFS v3.4.x
→ probably improvement and integration with
QEMU/KVM
http://www.gluster.org/2012/11/integration-with-kvmqemu
36. Thanks for your attention
Thanks to:
● Prof. Massimo Masera
● Stefano Bagnasco
● Dario Berzano
40. Striped + Replicated volume:
● it stripes data across replicated bricks in the
cluster;
● one should use striped replicated volumes in
highly concurrent environments where there
is parallel access of very large files and
performance is critical;
41. Striped + replicated / results
Average Std. Average Std. Average Std.
Sequential Deviation Sequential Deviation Sequential Deviation
Output per Sequential Output Sequential Input per Sequential
Blocks Output per Rewrite Output Blocks Input per
[MB/s] Blocks [MB/s] Rewrite [MB/s] Blocks
[MB/s] [MB/s] [MB/s]
striped+ 31.0 0.3 18.4 4.7 44.5 1.6
replicated
42. Striped + replicated / comments
● Tests conducted on these volumes covered always
one I/O process at time, so it's quite normal that a
volume type thought for highly concurrent
environments seems to be less performant.
● It keeps discrete I/O ratings.
43. Imagerepo / results
Average Std. Average Std. Average Std.
Sequential Deviation Sequential Deviation Sequential Deviation
Output per Sequential Output Sequential Input per Sequential
Blocks Output per Rewrite Output Blocks Input per
[MB/s] Blocks [MB/s] Rewrite [MB/s] Blocks
[MB/s] [MB/s] [MB/s]
imagerepo 98.3 3.3 38.0 0.4 64.4 2.3
44. Imagerepo / comments
● The input and output (per block) tests gave an high
value compared with the previous tests, this due to
the greater availability of resources.
● Imagerepo is the repository where are stored the
images of virtual machines ready to be cloned and
turned on in vmdir.
● It's very important that this repository is always up
in order to avoid data loss, so is recommended to
create a replicated repository.
45. Vmdir / results
Average Std. Average Std. Average Std.
Sequential Deviation Sequential Deviation Sequential Deviation
Output per Sequential Output Sequential Input per Sequential
Blocks Output per Rewrite Output Blocks Input per
[MB/s] Blocks [MB/s] Rewrite [MB/s] Blocks
[MB/s] [MB/s] [MB/s]
vmdir 47.6 2.2 24.8 1.5 62.7 0.8
46. vmdir / comments
● These result are worse than the imagerepo's ones
but still better than the first three (test-volume).
● It is a volume shared from two server towards 5
machines where are hosted the virtual machine
instances, so is very important that this volume
doesn't crash.
● It's the best candidate to be a
replicated+striped+distributed volume.