Jan 28th, 2013 - 10:00 am
UC Davis
Title: Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Abstract: Large bandwidth provided by today’s networks requires careful evaluation in order to eliminate system overheads and to bring anticipated high performance to the application layer. As a part of the Advance Network Initiative (ANI) project, we have conducted a large number of experiments in the initial evaluation of the 100Gbps network prototype.
We needed intense fine-tuning, both in network and application layers, to take advantage of the higher network capacity. Instead of explicit improvements in every application as we keep changing the underlying link technology, we require novel data movement mechanisms and abstract layers for end-to-end processing of data. Based on our experience in 100Gbps network, we have developed an experimental prototype, called MemzNet: Memory-mapped Zero-copy Network Channel. MemzNet def ines new data access methods in which applications map memory blocks for remote data, in contrast to the send/receive semantics. In one of the early demonstrations of 100Gbps network applications, we used the initial implementation of MemzNet that takes the approach of aggregating files into blocks and providing dynamic data channel management. We observed that MemzNet showed better results in terms of performance and efficiency,
than the current state-of-the-art file-centric data transfer tools for the transfer of climate datasets with many small files. In this talk, I will mainly describe our experience in 100Gbps tests and present results from the 100Gbps demonstration. I will briefly explain the ANI testbed environment and highlight future research plans.
Bio: Mehmet Balman is a researcher working as a computer engineer in the Computational Research Division at Lawrence Berkeley National Laboratory. His recent work
particularly deals with efficient data transfer mechanisms, high-performance network protocols, bandwidth reservation, network virtualization, scheduling and resource management for large-scale applications. He received his doctoral degree in computer science from Louisiana State University (LSU) in 2010. He has several years of industrial experience as system administrator and R&D specialist, at various software companies before joining LSU. He also worked as a summer intern in Los Alamos National Laboratory.
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
1. Experiences with 100Gbps
Network Applications
Analyzing Data Movements and
Identifying Techniques for
Next-generation High-bandwidth
Networks
Mehmet Balman
Computational Research Division
Lawrence Berkeley National Laboratory
2. Outline
Climate Data as a typical science scenario
• 100Gbps Climate100 demo
MemzNet: Memory-mapped zero-copy Network
Channel
Advance Network Initiative (ANI) Testbed
• 100Gbps Experiments
Future Research Plans
3. 100Gbps networking has finally arrived!
Applications’ Perspective
Increasing the bandwidth is not sufficient by itself; we need
careful evaluation of high-bandwidth networks from the
applications’ perspective.
1Gbps to 10Gbps transition
(10 years ago)
Application did not run 10 times
faster because there was more
bandwidth available
4. The need for 100Gbps
Modern science is Data driven and Collaborative in
nature
• The largest collaborations are most likely to
depend on distributed architectures.
• LHC (distributed architecture) data generation,
distribution, and analysis.
• The volume of data produced by of genomic
sequencers is rising exponentially.
• In climate science, researchers must analyze
observational and simulation data located at facilities
around the world
5. ANI
100Gbps
Demo
Late 2011 – early 2012
• 100Gbps demo by ESnet and
• Visualization of remotely located data
Internet2
(Cosmology)
• Application design issues and
host tuning strategies to scale to • Data movement of large datasets with
100Gbps rates many files (Climate analysis)
7. Data distribution for climate science
How scientific data movement and analysis between
geographically disparate supercomputing facilities can
benefit from high-bandwidth networks?
• Local copies
• data files are copied
into temporary
storage in HPC
centers for post-
processing and
further climate
analysis.
11. Climate Data over 100Gbps
• Data volume in climate applications is increasing
exponentially.
• An important challenge in managing ever increasing data
sizes in climate science is the large variance in file sizes.
• Climate simulation data consists of a mix of relatively small and
large files with irregular file size distribution in each dataset.
• Many small files
12. lots-of-small-files problem!
file-centric tools?
request
request a file data
send file
send data
request a file
send file
RPC
FTP
• Concurrent transfers
• Parallel streams
14. MemzNet: memory-mapped zero-copy
network channel
Front-‐end
Memory
network
threads
(access
blocks Memory
to
memory
Front-‐end
blocks
blocks) threads
(access
to
memory
blocks)
memory caches are logically mapped between client and
server
15. Advantages
• Decoupling I/O and network operations
• front-end (I/O processing)
• back-end (networking layer)
• Not limited by the characteristics of the file
sizes
• On the fly tar approach, bundling and sending many files
together
• Dynamic data channel management
Can increase/decrease the parallelism level both in the network
communication and I/O read/write operations, without closing and
reopening the data channel connection (as is done in regular FTP
variants).
16. Advantages
The synchronization of the memory cache is
accomplished based on the tag header.
• Application processes interact with the memory
blocks.
• Enables out-of-order and asynchronous send
receive
MemzNet is is not file-centric. Bookkeeping
information is embedded inside each block.
• Can increase/decrease the number of parallel
streams without closing and reopening the
data channel.
18. 100Gbps Demo
• CMIP3 data (35TB) from the GPFS filesystem at NERSC
• Block size 4MB
• Each block’s data section was aligned according to the
system pagesize.
• 1GB cache both at the client and the server
• At NERSC, 8 front-end threads on each host for reading
data files in parallel.
• At ANL/ORNL, 4 front-end threads for processing
received data blocks.
• 4 parallel TCP streams (four back-end threads) were
used for each host-to-host connection.
20. Framework for the Memory-mapped
Network Channel
memory caches are logically mapped between client and
server
21. MemzNet’s Features
Data files are aggregated and divided into simple blocks.
Blocks are tagged and streamed over the network. Each
data block’s tag includes information about the content
inside.
Decouples disk and network IO operations; so, read/write
threads can work independently.
Implements a memory cache managements system that is
accessed in blocks. These memory blocks are logically
mapped to the memory cache that resides in the remote site.
25. MemzNet’s Performance
100Gbps demo
GridFTP MemzNet
TCP buffer size is set to 50MB
ANI Testbed
26. Challenge?
• High-bandwidth brings new challenges!
• We need substantial amount of processing power and involvement
of multiple cores to fill a 40Gbps or 100Gbps network
• Fine-tuning, both in network and application layers, to
take advantage of the higher network capacity.
• Incremental improvement in current tools?
• We cannot expect every application to tune and improve every time
we change the link technology or speed.
27. MemzNet
• MemzNet: Memory-mapped Network Channel
• High-performance data movement
MemzNet is an initial effort to put a new layer between the
application and the transport layer.
• Main goal is to define a network channel so applications
can directly use it without the burden of managing/tuning
the network communication.
28. Framework for the Memory-mapped
Network Channel
memory caches are logically mapped between client and
server
30. Initial Tests
² Many TCP sockets oversubscribe the network
and cause performance degradation.
² Host system performance could easily be the
bottleneck.
• TCP/UDP buffer tuning, using jumbo frames, and
interrupt coalescing.
• Multi-core systems: IRQ binding is now
essential for maximizing host performance.
31. Many Concurrent Streams
(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single
NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps
pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a
different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).
32. Effects of many concurrent streams
ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min
intervals], TCP buffer size is 50M
33. Parallel Streams - 10Gbps
ANI testbed 10Gbps connection: Interface traffic vs the number of parallel
streams [1, 2, 4, 8, 16, 32 64 streams -5min intervals], TCP buffer size is
set to 50MB
34. Parallel Streams – 40Gbps
ANI testbed 40Gbps (4x10NICs, single host): Interface traffic vs the
number of parallel streams [1, 2, 4, 8, 16, 32 64 streams - 5min intervals],
TCP buffer size is set to 50MB
35. Parallel streams - 40Gbps
Figure 5: ANI testbed 40Gbps (4x10NICs, single host): Throughput vs the
number of parallel streams [1, 2, 4, 8, 16, 32 64 streams - 5min intervals],
TCP buffer size is set to 50MB
36. Host tuning
With proper tuning, we achieved 98Gbps using
only 3 sending hosts, 3 receiving hosts, 10 10GE
NICS, and 10 TCP flows
Image source: “Experiences with 100Gbps network applications” In Proceedings of
the fifth international workshop on Data-Intensive Distributed Computing, 2012
37. NIC/TCP Tuning
• We are using Myricom 10G NIC (100Gbps testbed)
• Download latest drive/firmware from vendor site
• Version of driver in RHEL/CentOS fairly old
• Enable MSI-X
• Increase txgueuelen
/sbin/ifconfig eth2 txqueuelen 10000!
• Increase Interrupt coalescence!
/usr/sbin/ethtool -C eth2 rx-usecs 100!
• TCP Tuning:
net.core.rmem_max = 67108864!
net.core.wmem_max = 67108864!
net.core.netdev_max_backlog = 250000
From “Experiences with 100Gbps network applications” In Proceedings of the fifth
international workshop on Data-Intensive Distributed Computing, 2012
38. 100Gbps = It’s full of frames !
• Problem:
• Interrupts are very expensive
• Even with jumbo frames and driver optimization, there is still
too many interrupts.
• Solution:
• Turn off Linux irqbalance (chkconfig irqbalance off)
• Use /proc/interrupt to get the list of interrupts
• Dedicate an entire processor core for each 10G interface
• Use /proc/irq/<irq-number>/smp_affinity to bind rx/tx queues to
a specific core.
From “Experiences with 100Gbps network applications” In Proceedings of the fifth
international workshop on Data-Intensive Distributed Computing, 2012
39. Host Tuning Results
45
40
35
30
25
Gbps
20
without tuning
15 with tuning
10
5
0
Interrupt Interrupt IRQ Binding IRQ Binding
coalescing coalescing (TCP) (UDP)
(TCP) (UDP)
Image source: “Experiences with 100Gbps network applications” In Proceedings of
the fifth international workshop on Data-Intensive Distributed Computing, 2012
40. Initial Tests
² Many TCP sockets oversubscribe the network
and cause performance degradation.
² Host system performance could easily be the
bottleneck.
• TCP/UDP buffer tuning, using jumbo frames, and
interrupt coalescing.
• Multi-core systems: IRQ binding is now essential for
maximizing host performance.
41. End-to-end Data Movement
• Tuning parameters
• Host performance
• Multiple streams on the host systems
• Multiple NICs and multiple cores
• Effect of the application design
43. MemzNet
• MemzNet: Memory-mapped Network Channel
• High-performance data movement
MemzNet is an initial effort to put a new layer between the
application and the transport layer.
• Main goal is to define a network channel so applications
can directly use it without the burden of managing/tuning
the network communication.
44. Framework for the Memory-mapped
Network Channel
memory caches are logically mapped between client and
server
45. Related Work
• Luigi Rizzo ’s netmap
• proposes a new API to send/receive data over
the network
• http://info.iet.unipi.it/~luigi/netmap/
• RDMA programming model
• MemzNet can be used for RDMA-enabled transfers
• memory-management middleware
• GridFTP popen + reordering ?
46. Future Work
• Integrate MemzNet with scientific data formats for
remote data analysis
• Provide a user library for filtering and subsetting
of data.
47. Testbed Results
http://www.es.net/RandD/100g-testbed/
• Proposal process
• Testbed Description, etc.
• http://www.es.net/RandD/100g-testbed/results/
• RoCE (RDMA over Ethernet) tests
• RDMA implementation (RFPT100)
• 100Gbps demo paper
• Efficient Data Transfer Protocols
• MemzNet (Memory-mapped Zero-copy Network Channel)
48. Acknowledgements
Collaborators:
Eric Pouyoul, Yushu Yao, E. Wes Bethel Burlen
Loring, Prabhat, John Shalf, Alex Sim, Arie
Shoshani, and Brian L. Tierney
Special Thanks:
Peter Nugent, Zarija Lukic , Patrick Dorn, Evangelos
Chaniotakis, John Christman, Chin Guok, Chris Tracy,
Lauren Rotman, Jason Lee, Shane Canon, Tina Declerck,
Cary Whitney, Ed Holohan, Adam Scovel, Linda Winkler,
Jason Hill, Doug Fuller, Susan Hicks, Hank Childs, Mark
Howison, Aaron Thomas, John Dugan, Gopal Vaswani
49. NDM 2013 Workshop (tentative)
The 3rd International Workshop on
Network-aware Data Management
to be held in conjunction with
the EEE/ACM International
Conference for High Performance
Computing, Networking, Storage and
Analysis (SC’13)
http://sdm.lbl.gov/ndm
Editing a book (CRC press): NDM research
frontiers (tentative)