SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Improving File Transfer Service ▬ FTS3
September 2015
Author:
Hamza Zafar
Email: hamza.zafar@cern.ch
Supervisor(s):
Oliver Keeble
Alejandro Alvarez
CERN openlab Summer Student Report 2015
CERN openlab Summer Student Report 2015
Project Specification
The Experiments at CERN generate colossal amount of data. The data centre stores it and sends it
around the world for analysis. In the first run of LHC, 30 petabytes of data was produced
annually, larger amounts of data are expected to be produced during the second run of LHC [1].
To store and process this data, CERN relies on a grid infrastructure known as WLCG (Worldwide
LHC Computing Grid), which consists of 170 collaborating computing centres in 42 countries.
One of the major challenges at CERN is to globally replicate and distribute the data coming
colliders across the WLCG infrastructure. To address this problem, a file transfer service (FTS3)
is developed at CERN for bulk transfers of physics data. In this manner, real-time LHC data is
not only distributed and replicated across WLCG, but also made available to a community of
~8000 physicist around the globe.
Improving the file transfer service projects is geared towards effectively utilizing the available
networks resources as well as introducing new algorithms in FTS3 scheduler to make intelligent
decisions for scheduling file transfers.
CERN openlab Summer Student Report 2015
Abstract
This project deals with two aspects of improving the file transfer service – FTS3. The first one is
the selection of best source site for file transfers. Since files are replicated at different sites, the
selection of the best source site based on the networks throughput and success rate can have a
major impact on FTS3. The second one is maximizing the file throughput across WLCG network
by increasing the TCP buffer sizes. TCP is the only transport layer protocol used widely for data
transfers; it was originally designed with focus on reliability and long-term fairness. In high
bandwidth networks, the system administrators have to manually optimize/tune the TCP
configurations. Some of these configurations have a major impact on throughput. TCP buffer size
is one such setting, which sets a limit on TCP congestion window size. With the release of Linux
Kernel 2.6, a new feature “Linux TCP Auto-Tuning” was introduced, which selects the optimal
TCP buffer sizes based on system resource usage. Another way to increase the TCP buffer size is
to use setsocketopts system call. Since FTS3 implements gridFTP protocol, it gives us the
flexibility to set TCP buffer sizes manually. This project evaluates the pros and cons of different
techniques for setting TCP buffer sizes.
Keywords: FTS3, TCP , Linux TCP Auto-Tuning
CERN openlab Summer Student Report 2015
Table of Contents
Contents
1 Introduction ..............................................................................................................6
1.1 WLCG Architecture ...................................................................................................... 6
1.2 Use Cases of FTS3...................................................................................................... 6
1.3 FTS3 @ CERN ............................................................................................................ 7
2 Improving the FTS3 Scheduler.................................................................................7
2.1 Current Scheduling Behaviour for multiple replicas..................................................... 7
2.2 New Scheduling Behaviour for multiple replicas.......................................................... 8
2.3 Caching Database Queries........................................................................................ 10
3 Effect of TCP configurations on throughput............................................................10
3.1 TCP Optimal Buffer Size............................................................................................ 11
3.2 Performance Evaluation of FTS3 transfers with and without Linux auto-tuning........ 12
4 Bibliography ...........................................................................................................15
CERN openlab Summer Student Report 2015
List of Figures
Figure 1: Tiers in WLCG ................................................................................................................ 6
Figure 2: FTS3 workflow................................................................................................................ 7
Figure 3: Transfer request specifying multiple replicas.................................................................. 7
Figure 4: List of scheduling algorithms with their selection criteria............................................... 8
Figure 5: Transfer request specifying selection strategy................................................................. 8
Figure 6: Activity priorities for ATLAS ......................................................................................... 9
Figure 7: FTS3 caching layer ........................................................................................................ 10
Figure 8: Effect of large buffer sizes on throughput ..................................................................... 12
Figure 9: Graph for file transfer when Linux auto-tuning is in action .......................................... 12
Figure 10: Graph for file transfer when manually setting a 16MB Buffer (32 MB allocated)...... 13
Figure 11: Graph for file transfers when Linux auto-tuning is in action....................................... 14
Figure 12: Graph for file transfers when manually setting a 32MB Buffer (64 MB allocated) .... 14
CERN openlab Summer Student Report 2015
6 | P a g e
1 Introduction
FTS3 [2] is one of the projects critical for the data management at CERN [3]. It is the major
service for distributing the majority of LHC [4] data across WLCG [5] infrastructure. It provides
reliable bulk transfers of files from one WLCG site to another, while allowing participating sites
to control the network resource usage. FTS3 is a mature service, running for more than 2 years at
CERN.
1.1 WLCG Architecture
WLCG stands for WorldWide LHC Computing Grid; it is a collaboration of more than 170
computing centres in 42 countries. The mission of WLCG is to store, analyse and replicate LHC
data. Figure 1 shows the architecture of WLCG, it consist of three tiers 0, 1 and 2. These tiers are
made up of several computing centres. Tier 0 is the CERN’s datacentre, which is connected to 13
tier 1 sites with 10Gbps links. Tier 2 sites are connected using 1Gbps links. FTS3 service plays a
vital role in moving data across this complex mesh of computing centres.
1.2 Use Cases of FTS3
 An individual or small team can access the web interface to FTS to schedule transfers
between storage systems. They can browse the contents of the storage, invoke and
manage transfers, and leave FTS to do the rest.
 A team's data manager can use the FTS command line interface to schedule bulk transfers
between storage systems.
 A data manager can install an FTS service for local users. The service is equipped with
advanced monitoring and debugging capabilities which enable her to give support to her
users.
 Maintainers of frameworks which provide higher level functionality can delegate
responsibility for transfer management to FTS by integrating it using the various
programming interfaces available, including a REST API. The users thus continue to use
a familiar interface while profiting from the power of FTS transfer management.
Figure 1: Tiers in WLCG
CERN openlab Summer Student Report 2015
7 | P a g e
1.3 FTS3 @ CERN
Four major experiments at CERN are ATLAS, CMS, LHC(b) and ALICE. The first three
experiments use FTS for file transfer purposes. The figure 2 shows the general flow for using
FTS3 service. The clients --- CMS, ATLAS and LHC(b) --- send file transfer requests to FTS3,
gridFTP [6] protocol is used by FTS3 to initiate third party transfers on storage endpoints. FTS3
supervises the transfers between the storage endpoints and finally archives the job status. fts-
transfer-status command provided by FTS3 command line interface can be used by to inquire
about the job status. Alternatively, clients can also use the FTS3 web interface for transfer
management and monitoring. On average, FTS3 transfers 15 petabytes of data per month.
2 Improving the FTS3 Scheduler
To handle the incoming transfer jobs, FTS3 maintains a separate queue for each link between two
WLCG endpoints. Files are usually replicated across different sites in WLCG. If a client --- Atlas,
CMS, LHC(b) --- submits request for transferring a file with multiple replicas, then FTS3
scheduler is responsible for selecting the best source site. The figure 3 shows a transfer request
specifying source sites for multiple replicas.
2.1 Current Scheduling Behaviour for multiple replicas
For each site, FTS3 database maintains the count of pending files in the queue. It also contains
information about the throughput and success rate from previous transfers. In order to choose the
best replica, FTS3 scheduler queries the number of pending files. The site with the minimum
number of pending files in the queue is chosen as a source site. This approach is not efficient
because the factors like throughput and success rate are completely ignored.
Figure 2: FTS3 workflow
Figure 3: Transfer request specifying multiple replicas
CERN openlab Summer Student Report 2015
8 | P a g e
2.2 New Scheduling Behaviour for multiple replicas
We have developed a number of algorithms to address the short comings of FTS3 scheduling
decisions for multiple replicas. In addition to the number of pending files in the queue, the new
algorithms also consider the throughput and success rate when making a scheduling decision. The
algorithms are listed in table below.
Figure 4: List of scheduling algorithms with their selection criteria
FTS3 clients can now mention their selection strategy in the transfer request. In this way clients
can control the behaviour of FTS3 scheduler. The figure 5 shows a transfer request specifying the
selection strategy.
Algorithm Selection Criteria
orderly Selects the first site mentioned in transfer request
queue / auto Selects the source site with the least number of pending files
success Selects the source site with highest success rate
throughput Selects the source site with highest total throughput
file-throughput Selects the source site with highest per file throughput
pending-data Selects the source site with the minimum amount of pending data
waiting-time Selects the source site with the earliest waiting time
waiting-time-with-error Selects the source site with the earliest waiting time with error
duration Selects the source site with the earliest finish time
Figure 5: Transfer request specifying selection strategy
CERN openlab Summer Student Report 2015
9 | P a g e
- Calculation of Pending Data:
FTS3 maintains the priority configuration for client's activities. Figure 6 shows the
configurations for Atlas. If a newly submitted job has higher priority than the jobs
waiting in queue, then it takes precedence and the transfer is started immediately.
Therefore, in order to calculate the amount of pending data in queue, we aggregate the
amount of data for all jobs with priorities greater or equal the priority of newly submitted
job.
- Calculation of waiting time:
Waiting time is defined as the time a transfer request spends in the FTS3 queue. The
formula for calculating waiting time is as follow:
𝑊𝑎𝑖𝑡𝑖𝑛𝑔 𝑇𝑖𝑚𝑒 =
𝑃𝑒𝑛𝑑𝑖𝑛𝑔 𝐷𝑎𝑡𝑎
𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡
- Calculation of waiting time with error:
Error is the predicted amount of data that should be resent in case of transfer failures. It
is as follow:
𝐹𝑎𝑖𝑙𝑢𝑟𝑒 𝑅𝑎𝑡𝑒 = 100 − 𝑆𝑢𝑐𝑐𝑒𝑠𝑠 𝑅𝑎𝑡𝑒
𝐸𝑟𝑟𝑜𝑟 = 𝑃𝑒𝑛𝑑𝑖𝑛𝑔 𝐷𝑎𝑡𝑎 ∗
𝐹𝑎𝑖𝑙𝑢𝑟𝑒 𝑅𝑎𝑡𝑒
100
𝑊𝑎𝑖𝑡𝑖𝑛𝑔 𝑇𝑖𝑚𝑒 𝑊𝑖𝑡ℎ 𝐸𝑟𝑟𝑜𝑟 =
𝑃𝑒𝑛𝑑𝑖𝑛𝑔 𝐷𝑎𝑡𝑎 + 𝐸𝑟𝑟𝑜𝑟
𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡
Figure 6: Activity priorities for ATLAS
CERN openlab Summer Student Report 2015
10 | P a g e
- Calculation of Finish Time:
The algorithm named “duration” ranks the source sites based on total finish time. Finish
time is defined as the time it takes to complete the file transfer; it also includes the
waiting time in FTS3 queue.
𝐹𝑖𝑛𝑖𝑠ℎ 𝑇𝑖𝑚𝑒 = 𝑊𝑎𝑖𝑡𝑖𝑛𝑔 𝑇𝑖𝑚𝑒 𝑊𝑖𝑡ℎ 𝐸𝑟𝑟𝑜𝑟 +
𝑆𝑢𝑏𝑚𝑖𝑡𝑡𝑒𝑑 𝐹𝑖𝑙𝑒 𝑆𝑖𝑧𝑒
𝑃𝑒𝑟 𝐹𝑖𝑙𝑒 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡
2.3 Caching Database Queries
The new FTS3 scheduler depends on the number of pending files; throughput and success rate
information stored in FTS3 database, so it makes a lot of queries and eventually increases the
load on database. To overcome this problem, we have implemented a caching layer between
FTS3 scheduler and database. In order to maximize the performance we maintain a separate
cache for each thread. A cache entry expiration timer is configured for each cache entry. Default
time for cache entry expiration is 5 minutes. We also ended up with situations when cache had a
large number of expired entries. Therefore, to free the memory, we added a cache clean-up timer.
When the cache clean-up timer expires, FTS3 caching module deletes all the expired entries from
cache. It should be noted here that the maximum memory the cache could consume is estimated
to be less than a megabyte. The default time for cache clean-up timer is 30 minutes. In the future,
distributed memory caching e.g memcached [7] and Redis cache [8] can also be integrated in the
caching layer.
3 Effect of TCP configurations on throughput
TCP is the only transport layer protocol used widely for data transfers; it was originally designed
with focus on reliability and long-term fairness. In high bandwidth networks, the system
administrators have to manually optimize/tune the TCP configurations. Some of these
configurations have a major impact on throughput. TCP buffer size is one such setting, which
limits the size of TCP congestion window.
Since computer centres in WLCG infrastructure are linked with high bandwidth and high latency
networks. We are focused on effectively utilizing the available network resources. Our end goal is
Figure 7: FTS3 caching layer
CERN openlab Summer Student Report 2015
11 | P a g e
to increase the throughput for FTS3 file transfers. In this section we compare and contrast impact
of different TCP tuning methods on the throughput of FTS3 file transfers.
3.1 TCP Optimal Buffer Size
TCP maintains a “congestion window” to determine how many packets can be sent at one time.
This implies that larger the size of congestion window, higher the throughput. The kernel
enforces a limit on the maximum size of TCP congestion window. By default, on most Linux
distributions, the maximum limit is 4MB, which is still very small for high bandwidth links.
System admins can edit the /etc/sysctl.conf file change the default settings. The optimal TCP
buffer size can be calculated using the following formula
𝑂𝑝𝑡𝑖𝑚𝑎𝑙 𝐵𝑢𝑓𝑓𝑒𝑟 𝑆𝑖𝑧𝑒 = 2 ∗ 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ ∗ 𝑑𝑒𝑙𝑎𝑦
We can use the ping command to calculate the delay between two endpoints. Since ping
command returns the Round Trip Time (RTT), the above formula can be reduced to:
𝑂𝑝𝑡𝑖𝑚𝑎𝑙 𝐵𝑢𝑓𝑓𝑒𝑟 𝑆𝑖𝑧𝑒 = 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ ∗ 𝑅𝑇𝑇
For example, if the ping time is 200ms and the network bandwidth is 1Gbps, then the optimal
buffer is 25.6MB which is way larger than the default settings.
𝑂𝑝𝑡𝑖𝑚𝑎𝑙 𝐵𝑢𝑓𝑓𝑒𝑟 𝑆𝑖𝑧𝑒 =
1 𝐺𝑏𝑖𝑡
8 𝑏𝑖𝑡𝑠
∗ 0.200𝑠𝑒𝑐 = 25.6 𝑀𝐵
System admins have to add/modify the following setting in /etc/sysctl.conf file:
# increase max memory for sockets to 32MB
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
# increase Linux autotuning TCP buffer limit to 32MB
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
Now there are two ways to make use of the increased TCP buffer sizes:
- Linux TCP Auto-Tuning:
Linux TCP auto-tuning automatically adjusts the socket buffer sizes to balance the TCP
performance and system’s memory usage. TCP auto-tuning is enabled by default in
Linux release after version 2.6.6 and 2.4.16. For Linux auto-tuning the maximum send
and receive buffer limit is specified by net.ipv4.tcp_wmem and net.ipv4.tcp_rmem
parameters respectively.
- Manually setting the socket buffer sizes:
Application programmers can mention the socket buffer sizes using the setsocketopts
system call. net.core.wmem_max and net.core.rmem_max parameters impose an upper
limit on the amount of memory requested for send and receive buffers respectively. The
Linux kernel allocates double the amount of memory requested. It should be noted here
that setting the socket buffer sizes manually disables the Linux auto-tuning.
CERN openlab Summer Student Report 2015
12 | P a g e
3.2 Performance Evaluation of FTS3 transfers with and without Linux
auto-tuning
During our initial testing, we transferred files from CERN to Australia. Our calculations for the
optimal buffer sizes suggested a 37 MB buffer, whereas the configured maximum TCP buffer
size on our system was 4MB. Therefore, we increased the maximum buffer limit to 37 MB and
calculated the throughput. The Figure 8 shows a plot of throughput with system configured with a
4MB and a 37 MB TCP buffer. It should be noted here that the buffer sizes are not passed to fts-
transfer-submit (buffer sizes are not set using setsocketopt system call) command, in fact TCP
auto-tuning is taking care of socket memory allocation. It is evident that increasing the default
limits on maximum TCP buffer sizes, the congestion window can open more, which results in
higher throughput.
Figure 8: Effect of large buffer sizes on throughput
Now the question arises, should FTS3 rely on Linux auto-tuning or the users should pass the
buffer size to fts-transfer-submit command?
To answer this question, we transferred files from CERN to Tokyo. The number of streams was
set to 1. The endpoints at both sites were configured with 32 MB optimal buffer size. Receiver’s
advertised window and CNWD sizes were recorded from tcpdump and Linux ss utility
respectively. Figure 9 shows the results when Linux auto-tuning is in action and Figure 10 shows
the results by passing the buffer size to fts-transfer-submit command. When we manually set the
buffer sizes, Linux allocates twice the amount requested. Therefore, for a fair comparison with
auto-tuning, we request half the amount of memory available for Linux auto tuning i.e 32MB.
0
20
40
60
1
17
33
49
65
81
97
113
129
145
161
177
193
209
225
241
257
273
289
305
321
337
353
369
385
401
417
433
449
Throughput
(MB/sec)
Time (seconds)
4 MB Buffer 37MB Buffer
0
20
40
60
80
100
1 3 5 7 9 111315171921232527293133353739414345474951535557596163
DataMB
Time(seconds)
Rcv-window Throughput CWND
Figure 9: Graph for file transfer when Linux auto-tuning is in action
CERN openlab Summer Student Report 2015
13 | P a g e
0
20
40
60
80
100
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59
DataMB
Time (seconds)
Rcv-window Throughput CWND
With Linux auto-tuning the transfer time is 63 seconds, whereas the transfer time by manually
setting the buffer sizes is 59 seconds. Since both transfers reach the same maximum throughput,
we can conclude that manually setting the buffer sizes has no advantage over Linux auto-tuning.
Auto-Tuning is a much safer option to use as compared to manually setting the buffer sizes
because it can dynamically resize the TCP buffers based on network performance and system
resource usage.
We now shift our focus on comparing the effect of using multiple streams (with and without auto-
tuning). We transferred files from CERN to Tokyo with multiple numbers of streams. Figure 11
shows a graph when Linux auto-tuning is in action. Figure 12 shows the graph of transfers when
we are manually setting buffer sizes to 32MB. If we compare the file transfer with 1 stream (auto-
tuning vs manually setting buffer), we achieve higher throughput for manually setting buffer.
This is due to the fact that auto-tuning can increase the CNWD up to 32MB whereas when setting
the buffer size to 32MB , kernel allocates 64MB, then CNWD can increase up to 64MB. For 2
and 4 number of streams the throughput is almost the same (with and without auto-tuning). It is
also evident from the graphs that increasing the number of streams fills the pipe more quickly.
Figure 10: Graph for file transfer when manually setting a 16MB Buffer (32 MB allocated)
0
50
100
150
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
105
109
ThroughputMB/sec
Time(seconds)
1-Stream 2-Streams 4-Streams
CERN openlab Summer Student Report 2015
14 | P a g e
Figure 11: Graph for file transfers when Linux auto-tuning is in action
Figure 12: Graph for file transfers when manually setting a 32MB Buffer (64 MB allocated)
With this work we conclude that the operating system’s configured maximum buffer sizes are too
small for WLCG’s high bandwidth network, the kernel enforced limits on TCP buffer sizes
should be increased. Since, FTS3 supports 3rd
party file transfers, there is currently no mechanism
to get the RTTs by remotely pinging two storage endpoints, and hence there is no possibility of
calculating the bandwidth delay product, unless we have the access to the storage endpoint. In the
future, we would be able to get the RTT information from WLCG perfSONAR project [9]. We
have also seen that there is no difference on throughput whether we use Linux auto-tuning or set
the buffer sizes explicitly. All we have to do is to increase the maximum TCP buffer size on
storage endpoints and let the Linux auto-tuning decide optimal buffer sizes.
Future work includes the support of distributed memory caching techniques for FTS3 caching
layer and the performance evaluation of single file transfer with multiple streams vs. multiple file
transfers with single stream.
0
50
100
150
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88
ThroughputMB/sec
Time (seconds)
1-Stream 2-Stream 4-Stream
CERN openlab Summer Student Report 2015
15 | P a g e
4 Bibliography
[1] LHC Season 2. [Online]. http://home.web.cern.ch/about/updates/2015/06/lhc-season-2-first-
physics-13-tev-start-tomorrow
[2] M Salichos, M K Simon, O Keeble A A Ayllon, "FTS - New Data Movement Service for
WLCG".
[3] CERN. [Online]. http://home.web.cern.ch/
[4] Large Hadron Collider. [Online]. http://home.web.cern.ch/topics/large-hadron-collider
[5] WorldWide LHC Computing Grid. [Online]. wlcg.web.cern.ch
[6] gridFTP. [Online]. https://en.wikipedia.org/wiki/GridFTP
[7] Memchached. [Online]. http://memcached.org/
[8] Redis. [Online]. http://redis.io/
[9] WLCG perSONAR. [Online]. http://maddash.aglt2.org/maddash-webui/

Weitere ähnliche Inhalte

Was ist angesagt?

Transport Layer Services : Multiplexing And Demultiplexing
Transport Layer Services : Multiplexing And DemultiplexingTransport Layer Services : Multiplexing And Demultiplexing
Transport Layer Services : Multiplexing And DemultiplexingKeyur Vadodariya
 
An unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataAn unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataRamakrishna Prasad Sakhamuri
 
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...Vijay Prime
 
Performance analysis of routing protocols and tcp variants under http and ftp...
Performance analysis of routing protocols and tcp variants under http and ftp...Performance analysis of routing protocols and tcp variants under http and ftp...
Performance analysis of routing protocols and tcp variants under http and ftp...IJCNCJournal
 
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...balmanme
 
Balman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet BalmanBalman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet Balmanbalmanme
 
Multilevel priority packet scheduling scheme for wireless networks
Multilevel priority packet scheduling scheme for wireless networksMultilevel priority packet scheduling scheme for wireless networks
Multilevel priority packet scheduling scheme for wireless networksijdpsjournal
 
Time series data monitoring at 99acres.com
Time series data monitoring at 99acres.comTime series data monitoring at 99acres.com
Time series data monitoring at 99acres.comRavi Raj
 
Transport Layer In Computer Network
Transport Layer In Computer NetworkTransport Layer In Computer Network
Transport Layer In Computer NetworkDestro Destro
 
Ch 18 intro to network layer - section 3
Ch 18   intro to network layer - section 3Ch 18   intro to network layer - section 3
Ch 18 intro to network layer - section 3Hossam El-Deen Osama
 
Moolle fan-out control for scalable distributed data stores
Moolle  fan-out control for scalable distributed data storesMoolle  fan-out control for scalable distributed data stores
Moolle fan-out control for scalable distributed data storesSungJu Cho
 
Transfer reliability and congestion control strategies in opportunistic netwo...
Transfer reliability and congestion control strategies in opportunistic netwo...Transfer reliability and congestion control strategies in opportunistic netwo...
Transfer reliability and congestion control strategies in opportunistic netwo...IEEEFINALYEARPROJECTS
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom IndustryCloudera, Inc.
 

Was ist angesagt? (20)

Bg4101335337
Bg4101335337Bg4101335337
Bg4101335337
 
Transport Layer Services : Multiplexing And Demultiplexing
Transport Layer Services : Multiplexing And DemultiplexingTransport Layer Services : Multiplexing And Demultiplexing
Transport Layer Services : Multiplexing And Demultiplexing
 
An unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataAn unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigData
 
Transportlayer tanenbaum
Transportlayer tanenbaumTransportlayer tanenbaum
Transportlayer tanenbaum
 
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
 
Performance analysis of routing protocols and tcp variants under http and ftp...
Performance analysis of routing protocols and tcp variants under http and ftp...Performance analysis of routing protocols and tcp variants under http and ftp...
Performance analysis of routing protocols and tcp variants under http and ftp...
 
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
 
10
1010
10
 
Balman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet BalmanBalman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet Balman
 
Multilevel priority packet scheduling scheme for wireless networks
Multilevel priority packet scheduling scheme for wireless networksMultilevel priority packet scheduling scheme for wireless networks
Multilevel priority packet scheduling scheme for wireless networks
 
Week10 transport
Week10 transportWeek10 transport
Week10 transport
 
CS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMSCS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMS
 
Time series data monitoring at 99acres.com
Time series data monitoring at 99acres.comTime series data monitoring at 99acres.com
Time series data monitoring at 99acres.com
 
Transport Layer In Computer Network
Transport Layer In Computer NetworkTransport Layer In Computer Network
Transport Layer In Computer Network
 
Ch 18 intro to network layer - section 3
Ch 18   intro to network layer - section 3Ch 18   intro to network layer - section 3
Ch 18 intro to network layer - section 3
 
Crayl
CraylCrayl
Crayl
 
Moolle fan-out control for scalable distributed data stores
Moolle  fan-out control for scalable distributed data storesMoolle  fan-out control for scalable distributed data stores
Moolle fan-out control for scalable distributed data stores
 
Transport layer
Transport layerTransport layer
Transport layer
 
Transfer reliability and congestion control strategies in opportunistic netwo...
Transfer reliability and congestion control strategies in opportunistic netwo...Transfer reliability and congestion control strategies in opportunistic netwo...
Transfer reliability and congestion control strategies in opportunistic netwo...
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 

Ähnlich wie SummerStudentReport-HamzaZafar

Analytical average throughput and delay estimations for LTE
Analytical average throughput and delay estimations for LTEAnalytical average throughput and delay estimations for LTE
Analytical average throughput and delay estimations for LTESpiros Louvros
 
Discriminators for use in flow-based classification
Discriminators for use in flow-based classificationDiscriminators for use in flow-based classification
Discriminators for use in flow-based classificationDenis Zuev
 
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with PrioritiesA Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Prioritiesidescitation
 
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERS
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERSORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERS
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERSNexgen Technology
 
Orchestrating bulk data transfers across
Orchestrating bulk data transfers acrossOrchestrating bulk data transfers across
Orchestrating bulk data transfers acrossnexgentech15
 
Orchestrating Bulk Data Transfers across Geo-Distributed Datacenters
 Orchestrating Bulk Data Transfers across Geo-Distributed Datacenters Orchestrating Bulk Data Transfers across Geo-Distributed Datacenters
Orchestrating Bulk Data Transfers across Geo-Distributed Datacentersnexgentechnology
 
A Portal For Visualizing Grid Usage
A Portal For Visualizing Grid UsageA Portal For Visualizing Grid Usage
A Portal For Visualizing Grid UsageKristen Carter
 
Alice data acquisition
Alice data acquisitionAlice data acquisition
Alice data acquisitionBertalan EGED
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
A flexible phasor data concentrator design
A flexible phasor data concentrator designA flexible phasor data concentrator design
A flexible phasor data concentrator designsunder fou
 
A flexible phasor data concentrator design
A flexible phasor data concentrator designA flexible phasor data concentrator design
A flexible phasor data concentrator designsunder fou
 
Analysis of Link State Resource Reservation Protocol for Congestion Managemen...
Analysis of Link State Resource Reservation Protocol for Congestion Managemen...Analysis of Link State Resource Reservation Protocol for Congestion Managemen...
Analysis of Link State Resource Reservation Protocol for Congestion Managemen...ijgca
 
ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...
ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...
ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...ijgca
 
week_2Lec02_CS422.pptx
week_2Lec02_CS422.pptxweek_2Lec02_CS422.pptx
week_2Lec02_CS422.pptxmivomi1
 
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...Danny Alex Lachos Perez
 

Ähnlich wie SummerStudentReport-HamzaZafar (20)

cpc-152-2-2003
cpc-152-2-2003cpc-152-2-2003
cpc-152-2-2003
 
Grid computing & its applications
Grid computing & its applicationsGrid computing & its applications
Grid computing & its applications
 
Analytical average throughput and delay estimations for LTE
Analytical average throughput and delay estimations for LTEAnalytical average throughput and delay estimations for LTE
Analytical average throughput and delay estimations for LTE
 
Discriminators for use in flow-based classification
Discriminators for use in flow-based classificationDiscriminators for use in flow-based classification
Discriminators for use in flow-based classification
 
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with PrioritiesA Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
 
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERS
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERSORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERS
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERS
 
Orchestrating bulk data transfers across
Orchestrating bulk data transfers acrossOrchestrating bulk data transfers across
Orchestrating bulk data transfers across
 
Orchestrating Bulk Data Transfers across Geo-Distributed Datacenters
 Orchestrating Bulk Data Transfers across Geo-Distributed Datacenters Orchestrating Bulk Data Transfers across Geo-Distributed Datacenters
Orchestrating Bulk Data Transfers across Geo-Distributed Datacenters
 
A Portal For Visualizing Grid Usage
A Portal For Visualizing Grid UsageA Portal For Visualizing Grid Usage
A Portal For Visualizing Grid Usage
 
Alice data acquisition
Alice data acquisitionAlice data acquisition
Alice data acquisition
 
Cray xt3
Cray xt3Cray xt3
Cray xt3
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
A flexible phasor data concentrator design
A flexible phasor data concentrator designA flexible phasor data concentrator design
A flexible phasor data concentrator design
 
A flexible phasor data concentrator design
A flexible phasor data concentrator designA flexible phasor data concentrator design
A flexible phasor data concentrator design
 
Analysis of Link State Resource Reservation Protocol for Congestion Managemen...
Analysis of Link State Resource Reservation Protocol for Congestion Managemen...Analysis of Link State Resource Reservation Protocol for Congestion Managemen...
Analysis of Link State Resource Reservation Protocol for Congestion Managemen...
 
ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...
ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...
ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...
 
week_2Lec02_CS422.pptx
week_2Lec02_CS422.pptxweek_2Lec02_CS422.pptx
week_2Lec02_CS422.pptx
 
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...
 
I1102014953
I1102014953I1102014953
I1102014953
 

SummerStudentReport-HamzaZafar

  • 1. Improving File Transfer Service ▬ FTS3 September 2015 Author: Hamza Zafar Email: hamza.zafar@cern.ch Supervisor(s): Oliver Keeble Alejandro Alvarez CERN openlab Summer Student Report 2015
  • 2. CERN openlab Summer Student Report 2015 Project Specification The Experiments at CERN generate colossal amount of data. The data centre stores it and sends it around the world for analysis. In the first run of LHC, 30 petabytes of data was produced annually, larger amounts of data are expected to be produced during the second run of LHC [1]. To store and process this data, CERN relies on a grid infrastructure known as WLCG (Worldwide LHC Computing Grid), which consists of 170 collaborating computing centres in 42 countries. One of the major challenges at CERN is to globally replicate and distribute the data coming colliders across the WLCG infrastructure. To address this problem, a file transfer service (FTS3) is developed at CERN for bulk transfers of physics data. In this manner, real-time LHC data is not only distributed and replicated across WLCG, but also made available to a community of ~8000 physicist around the globe. Improving the file transfer service projects is geared towards effectively utilizing the available networks resources as well as introducing new algorithms in FTS3 scheduler to make intelligent decisions for scheduling file transfers.
  • 3. CERN openlab Summer Student Report 2015 Abstract This project deals with two aspects of improving the file transfer service – FTS3. The first one is the selection of best source site for file transfers. Since files are replicated at different sites, the selection of the best source site based on the networks throughput and success rate can have a major impact on FTS3. The second one is maximizing the file throughput across WLCG network by increasing the TCP buffer sizes. TCP is the only transport layer protocol used widely for data transfers; it was originally designed with focus on reliability and long-term fairness. In high bandwidth networks, the system administrators have to manually optimize/tune the TCP configurations. Some of these configurations have a major impact on throughput. TCP buffer size is one such setting, which sets a limit on TCP congestion window size. With the release of Linux Kernel 2.6, a new feature “Linux TCP Auto-Tuning” was introduced, which selects the optimal TCP buffer sizes based on system resource usage. Another way to increase the TCP buffer size is to use setsocketopts system call. Since FTS3 implements gridFTP protocol, it gives us the flexibility to set TCP buffer sizes manually. This project evaluates the pros and cons of different techniques for setting TCP buffer sizes. Keywords: FTS3, TCP , Linux TCP Auto-Tuning
  • 4. CERN openlab Summer Student Report 2015 Table of Contents Contents 1 Introduction ..............................................................................................................6 1.1 WLCG Architecture ...................................................................................................... 6 1.2 Use Cases of FTS3...................................................................................................... 6 1.3 FTS3 @ CERN ............................................................................................................ 7 2 Improving the FTS3 Scheduler.................................................................................7 2.1 Current Scheduling Behaviour for multiple replicas..................................................... 7 2.2 New Scheduling Behaviour for multiple replicas.......................................................... 8 2.3 Caching Database Queries........................................................................................ 10 3 Effect of TCP configurations on throughput............................................................10 3.1 TCP Optimal Buffer Size............................................................................................ 11 3.2 Performance Evaluation of FTS3 transfers with and without Linux auto-tuning........ 12 4 Bibliography ...........................................................................................................15
  • 5. CERN openlab Summer Student Report 2015 List of Figures Figure 1: Tiers in WLCG ................................................................................................................ 6 Figure 2: FTS3 workflow................................................................................................................ 7 Figure 3: Transfer request specifying multiple replicas.................................................................. 7 Figure 4: List of scheduling algorithms with their selection criteria............................................... 8 Figure 5: Transfer request specifying selection strategy................................................................. 8 Figure 6: Activity priorities for ATLAS ......................................................................................... 9 Figure 7: FTS3 caching layer ........................................................................................................ 10 Figure 8: Effect of large buffer sizes on throughput ..................................................................... 12 Figure 9: Graph for file transfer when Linux auto-tuning is in action .......................................... 12 Figure 10: Graph for file transfer when manually setting a 16MB Buffer (32 MB allocated)...... 13 Figure 11: Graph for file transfers when Linux auto-tuning is in action....................................... 14 Figure 12: Graph for file transfers when manually setting a 32MB Buffer (64 MB allocated) .... 14
  • 6. CERN openlab Summer Student Report 2015 6 | P a g e 1 Introduction FTS3 [2] is one of the projects critical for the data management at CERN [3]. It is the major service for distributing the majority of LHC [4] data across WLCG [5] infrastructure. It provides reliable bulk transfers of files from one WLCG site to another, while allowing participating sites to control the network resource usage. FTS3 is a mature service, running for more than 2 years at CERN. 1.1 WLCG Architecture WLCG stands for WorldWide LHC Computing Grid; it is a collaboration of more than 170 computing centres in 42 countries. The mission of WLCG is to store, analyse and replicate LHC data. Figure 1 shows the architecture of WLCG, it consist of three tiers 0, 1 and 2. These tiers are made up of several computing centres. Tier 0 is the CERN’s datacentre, which is connected to 13 tier 1 sites with 10Gbps links. Tier 2 sites are connected using 1Gbps links. FTS3 service plays a vital role in moving data across this complex mesh of computing centres. 1.2 Use Cases of FTS3  An individual or small team can access the web interface to FTS to schedule transfers between storage systems. They can browse the contents of the storage, invoke and manage transfers, and leave FTS to do the rest.  A team's data manager can use the FTS command line interface to schedule bulk transfers between storage systems.  A data manager can install an FTS service for local users. The service is equipped with advanced monitoring and debugging capabilities which enable her to give support to her users.  Maintainers of frameworks which provide higher level functionality can delegate responsibility for transfer management to FTS by integrating it using the various programming interfaces available, including a REST API. The users thus continue to use a familiar interface while profiting from the power of FTS transfer management. Figure 1: Tiers in WLCG
  • 7. CERN openlab Summer Student Report 2015 7 | P a g e 1.3 FTS3 @ CERN Four major experiments at CERN are ATLAS, CMS, LHC(b) and ALICE. The first three experiments use FTS for file transfer purposes. The figure 2 shows the general flow for using FTS3 service. The clients --- CMS, ATLAS and LHC(b) --- send file transfer requests to FTS3, gridFTP [6] protocol is used by FTS3 to initiate third party transfers on storage endpoints. FTS3 supervises the transfers between the storage endpoints and finally archives the job status. fts- transfer-status command provided by FTS3 command line interface can be used by to inquire about the job status. Alternatively, clients can also use the FTS3 web interface for transfer management and monitoring. On average, FTS3 transfers 15 petabytes of data per month. 2 Improving the FTS3 Scheduler To handle the incoming transfer jobs, FTS3 maintains a separate queue for each link between two WLCG endpoints. Files are usually replicated across different sites in WLCG. If a client --- Atlas, CMS, LHC(b) --- submits request for transferring a file with multiple replicas, then FTS3 scheduler is responsible for selecting the best source site. The figure 3 shows a transfer request specifying source sites for multiple replicas. 2.1 Current Scheduling Behaviour for multiple replicas For each site, FTS3 database maintains the count of pending files in the queue. It also contains information about the throughput and success rate from previous transfers. In order to choose the best replica, FTS3 scheduler queries the number of pending files. The site with the minimum number of pending files in the queue is chosen as a source site. This approach is not efficient because the factors like throughput and success rate are completely ignored. Figure 2: FTS3 workflow Figure 3: Transfer request specifying multiple replicas
  • 8. CERN openlab Summer Student Report 2015 8 | P a g e 2.2 New Scheduling Behaviour for multiple replicas We have developed a number of algorithms to address the short comings of FTS3 scheduling decisions for multiple replicas. In addition to the number of pending files in the queue, the new algorithms also consider the throughput and success rate when making a scheduling decision. The algorithms are listed in table below. Figure 4: List of scheduling algorithms with their selection criteria FTS3 clients can now mention their selection strategy in the transfer request. In this way clients can control the behaviour of FTS3 scheduler. The figure 5 shows a transfer request specifying the selection strategy. Algorithm Selection Criteria orderly Selects the first site mentioned in transfer request queue / auto Selects the source site with the least number of pending files success Selects the source site with highest success rate throughput Selects the source site with highest total throughput file-throughput Selects the source site with highest per file throughput pending-data Selects the source site with the minimum amount of pending data waiting-time Selects the source site with the earliest waiting time waiting-time-with-error Selects the source site with the earliest waiting time with error duration Selects the source site with the earliest finish time Figure 5: Transfer request specifying selection strategy
  • 9. CERN openlab Summer Student Report 2015 9 | P a g e - Calculation of Pending Data: FTS3 maintains the priority configuration for client's activities. Figure 6 shows the configurations for Atlas. If a newly submitted job has higher priority than the jobs waiting in queue, then it takes precedence and the transfer is started immediately. Therefore, in order to calculate the amount of pending data in queue, we aggregate the amount of data for all jobs with priorities greater or equal the priority of newly submitted job. - Calculation of waiting time: Waiting time is defined as the time a transfer request spends in the FTS3 queue. The formula for calculating waiting time is as follow: 𝑊𝑎𝑖𝑡𝑖𝑛𝑔 𝑇𝑖𝑚𝑒 = 𝑃𝑒𝑛𝑑𝑖𝑛𝑔 𝐷𝑎𝑡𝑎 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 - Calculation of waiting time with error: Error is the predicted amount of data that should be resent in case of transfer failures. It is as follow: 𝐹𝑎𝑖𝑙𝑢𝑟𝑒 𝑅𝑎𝑡𝑒 = 100 − 𝑆𝑢𝑐𝑐𝑒𝑠𝑠 𝑅𝑎𝑡𝑒 𝐸𝑟𝑟𝑜𝑟 = 𝑃𝑒𝑛𝑑𝑖𝑛𝑔 𝐷𝑎𝑡𝑎 ∗ 𝐹𝑎𝑖𝑙𝑢𝑟𝑒 𝑅𝑎𝑡𝑒 100 𝑊𝑎𝑖𝑡𝑖𝑛𝑔 𝑇𝑖𝑚𝑒 𝑊𝑖𝑡ℎ 𝐸𝑟𝑟𝑜𝑟 = 𝑃𝑒𝑛𝑑𝑖𝑛𝑔 𝐷𝑎𝑡𝑎 + 𝐸𝑟𝑟𝑜𝑟 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 Figure 6: Activity priorities for ATLAS
  • 10. CERN openlab Summer Student Report 2015 10 | P a g e - Calculation of Finish Time: The algorithm named “duration” ranks the source sites based on total finish time. Finish time is defined as the time it takes to complete the file transfer; it also includes the waiting time in FTS3 queue. 𝐹𝑖𝑛𝑖𝑠ℎ 𝑇𝑖𝑚𝑒 = 𝑊𝑎𝑖𝑡𝑖𝑛𝑔 𝑇𝑖𝑚𝑒 𝑊𝑖𝑡ℎ 𝐸𝑟𝑟𝑜𝑟 + 𝑆𝑢𝑏𝑚𝑖𝑡𝑡𝑒𝑑 𝐹𝑖𝑙𝑒 𝑆𝑖𝑧𝑒 𝑃𝑒𝑟 𝐹𝑖𝑙𝑒 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 2.3 Caching Database Queries The new FTS3 scheduler depends on the number of pending files; throughput and success rate information stored in FTS3 database, so it makes a lot of queries and eventually increases the load on database. To overcome this problem, we have implemented a caching layer between FTS3 scheduler and database. In order to maximize the performance we maintain a separate cache for each thread. A cache entry expiration timer is configured for each cache entry. Default time for cache entry expiration is 5 minutes. We also ended up with situations when cache had a large number of expired entries. Therefore, to free the memory, we added a cache clean-up timer. When the cache clean-up timer expires, FTS3 caching module deletes all the expired entries from cache. It should be noted here that the maximum memory the cache could consume is estimated to be less than a megabyte. The default time for cache clean-up timer is 30 minutes. In the future, distributed memory caching e.g memcached [7] and Redis cache [8] can also be integrated in the caching layer. 3 Effect of TCP configurations on throughput TCP is the only transport layer protocol used widely for data transfers; it was originally designed with focus on reliability and long-term fairness. In high bandwidth networks, the system administrators have to manually optimize/tune the TCP configurations. Some of these configurations have a major impact on throughput. TCP buffer size is one such setting, which limits the size of TCP congestion window. Since computer centres in WLCG infrastructure are linked with high bandwidth and high latency networks. We are focused on effectively utilizing the available network resources. Our end goal is Figure 7: FTS3 caching layer
  • 11. CERN openlab Summer Student Report 2015 11 | P a g e to increase the throughput for FTS3 file transfers. In this section we compare and contrast impact of different TCP tuning methods on the throughput of FTS3 file transfers. 3.1 TCP Optimal Buffer Size TCP maintains a “congestion window” to determine how many packets can be sent at one time. This implies that larger the size of congestion window, higher the throughput. The kernel enforces a limit on the maximum size of TCP congestion window. By default, on most Linux distributions, the maximum limit is 4MB, which is still very small for high bandwidth links. System admins can edit the /etc/sysctl.conf file change the default settings. The optimal TCP buffer size can be calculated using the following formula 𝑂𝑝𝑡𝑖𝑚𝑎𝑙 𝐵𝑢𝑓𝑓𝑒𝑟 𝑆𝑖𝑧𝑒 = 2 ∗ 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ ∗ 𝑑𝑒𝑙𝑎𝑦 We can use the ping command to calculate the delay between two endpoints. Since ping command returns the Round Trip Time (RTT), the above formula can be reduced to: 𝑂𝑝𝑡𝑖𝑚𝑎𝑙 𝐵𝑢𝑓𝑓𝑒𝑟 𝑆𝑖𝑧𝑒 = 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ ∗ 𝑅𝑇𝑇 For example, if the ping time is 200ms and the network bandwidth is 1Gbps, then the optimal buffer is 25.6MB which is way larger than the default settings. 𝑂𝑝𝑡𝑖𝑚𝑎𝑙 𝐵𝑢𝑓𝑓𝑒𝑟 𝑆𝑖𝑧𝑒 = 1 𝐺𝑏𝑖𝑡 8 𝑏𝑖𝑡𝑠 ∗ 0.200𝑠𝑒𝑐 = 25.6 𝑀𝐵 System admins have to add/modify the following setting in /etc/sysctl.conf file: # increase max memory for sockets to 32MB net.core.rmem_max = 33554432 net.core.wmem_max = 33554432 # increase Linux autotuning TCP buffer limit to 32MB net.ipv4.tcp_rmem = 4096 87380 33554432 net.ipv4.tcp_wmem = 4096 65536 33554432 Now there are two ways to make use of the increased TCP buffer sizes: - Linux TCP Auto-Tuning: Linux TCP auto-tuning automatically adjusts the socket buffer sizes to balance the TCP performance and system’s memory usage. TCP auto-tuning is enabled by default in Linux release after version 2.6.6 and 2.4.16. For Linux auto-tuning the maximum send and receive buffer limit is specified by net.ipv4.tcp_wmem and net.ipv4.tcp_rmem parameters respectively. - Manually setting the socket buffer sizes: Application programmers can mention the socket buffer sizes using the setsocketopts system call. net.core.wmem_max and net.core.rmem_max parameters impose an upper limit on the amount of memory requested for send and receive buffers respectively. The Linux kernel allocates double the amount of memory requested. It should be noted here that setting the socket buffer sizes manually disables the Linux auto-tuning.
  • 12. CERN openlab Summer Student Report 2015 12 | P a g e 3.2 Performance Evaluation of FTS3 transfers with and without Linux auto-tuning During our initial testing, we transferred files from CERN to Australia. Our calculations for the optimal buffer sizes suggested a 37 MB buffer, whereas the configured maximum TCP buffer size on our system was 4MB. Therefore, we increased the maximum buffer limit to 37 MB and calculated the throughput. The Figure 8 shows a plot of throughput with system configured with a 4MB and a 37 MB TCP buffer. It should be noted here that the buffer sizes are not passed to fts- transfer-submit (buffer sizes are not set using setsocketopt system call) command, in fact TCP auto-tuning is taking care of socket memory allocation. It is evident that increasing the default limits on maximum TCP buffer sizes, the congestion window can open more, which results in higher throughput. Figure 8: Effect of large buffer sizes on throughput Now the question arises, should FTS3 rely on Linux auto-tuning or the users should pass the buffer size to fts-transfer-submit command? To answer this question, we transferred files from CERN to Tokyo. The number of streams was set to 1. The endpoints at both sites were configured with 32 MB optimal buffer size. Receiver’s advertised window and CNWD sizes were recorded from tcpdump and Linux ss utility respectively. Figure 9 shows the results when Linux auto-tuning is in action and Figure 10 shows the results by passing the buffer size to fts-transfer-submit command. When we manually set the buffer sizes, Linux allocates twice the amount requested. Therefore, for a fair comparison with auto-tuning, we request half the amount of memory available for Linux auto tuning i.e 32MB. 0 20 40 60 1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 257 273 289 305 321 337 353 369 385 401 417 433 449 Throughput (MB/sec) Time (seconds) 4 MB Buffer 37MB Buffer 0 20 40 60 80 100 1 3 5 7 9 111315171921232527293133353739414345474951535557596163 DataMB Time(seconds) Rcv-window Throughput CWND Figure 9: Graph for file transfer when Linux auto-tuning is in action
  • 13. CERN openlab Summer Student Report 2015 13 | P a g e 0 20 40 60 80 100 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 DataMB Time (seconds) Rcv-window Throughput CWND With Linux auto-tuning the transfer time is 63 seconds, whereas the transfer time by manually setting the buffer sizes is 59 seconds. Since both transfers reach the same maximum throughput, we can conclude that manually setting the buffer sizes has no advantage over Linux auto-tuning. Auto-Tuning is a much safer option to use as compared to manually setting the buffer sizes because it can dynamically resize the TCP buffers based on network performance and system resource usage. We now shift our focus on comparing the effect of using multiple streams (with and without auto- tuning). We transferred files from CERN to Tokyo with multiple numbers of streams. Figure 11 shows a graph when Linux auto-tuning is in action. Figure 12 shows the graph of transfers when we are manually setting buffer sizes to 32MB. If we compare the file transfer with 1 stream (auto- tuning vs manually setting buffer), we achieve higher throughput for manually setting buffer. This is due to the fact that auto-tuning can increase the CNWD up to 32MB whereas when setting the buffer size to 32MB , kernel allocates 64MB, then CNWD can increase up to 64MB. For 2 and 4 number of streams the throughput is almost the same (with and without auto-tuning). It is also evident from the graphs that increasing the number of streams fills the pipe more quickly. Figure 10: Graph for file transfer when manually setting a 16MB Buffer (32 MB allocated) 0 50 100 150 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 ThroughputMB/sec Time(seconds) 1-Stream 2-Streams 4-Streams
  • 14. CERN openlab Summer Student Report 2015 14 | P a g e Figure 11: Graph for file transfers when Linux auto-tuning is in action Figure 12: Graph for file transfers when manually setting a 32MB Buffer (64 MB allocated) With this work we conclude that the operating system’s configured maximum buffer sizes are too small for WLCG’s high bandwidth network, the kernel enforced limits on TCP buffer sizes should be increased. Since, FTS3 supports 3rd party file transfers, there is currently no mechanism to get the RTTs by remotely pinging two storage endpoints, and hence there is no possibility of calculating the bandwidth delay product, unless we have the access to the storage endpoint. In the future, we would be able to get the RTT information from WLCG perfSONAR project [9]. We have also seen that there is no difference on throughput whether we use Linux auto-tuning or set the buffer sizes explicitly. All we have to do is to increase the maximum TCP buffer size on storage endpoints and let the Linux auto-tuning decide optimal buffer sizes. Future work includes the support of distributed memory caching techniques for FTS3 caching layer and the performance evaluation of single file transfer with multiple streams vs. multiple file transfers with single stream. 0 50 100 150 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 ThroughputMB/sec Time (seconds) 1-Stream 2-Stream 4-Stream
  • 15. CERN openlab Summer Student Report 2015 15 | P a g e 4 Bibliography [1] LHC Season 2. [Online]. http://home.web.cern.ch/about/updates/2015/06/lhc-season-2-first- physics-13-tev-start-tomorrow [2] M Salichos, M K Simon, O Keeble A A Ayllon, "FTS - New Data Movement Service for WLCG". [3] CERN. [Online]. http://home.web.cern.ch/ [4] Large Hadron Collider. [Online]. http://home.web.cern.ch/topics/large-hadron-collider [5] WorldWide LHC Computing Grid. [Online]. wlcg.web.cern.ch [6] gridFTP. [Online]. https://en.wikipedia.org/wiki/GridFTP [7] Memchached. [Online]. http://memcached.org/ [8] Redis. [Online]. http://redis.io/ [9] WLCG perSONAR. [Online]. http://maddash.aglt2.org/maddash-webui/