MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data over 100Gbps networks
1. Streaming
Exa-‐scale
Data
over
100Gbps
Networks
Mehmet
Balman
Computa/onal
Research
Division
Lawrence
Berkeley
Na/onal
Laboratory
Collaborators:
Eric Pouyoul, Yushu Yao, E. Wes Bethel,
Burlen Loring, Prabhat, John Shalf, Alex Sim,
Arie Shoshani, Dean N. Williams, Brian L. Tierney
2. Outline
• A
recent
100Gbps
demo
by
ESnet
and
Internet2
at
SC11
• One
of
the
applica=ons:
• Data
movement
of
large
datasets
with
many
files
(Scaling
the
Earth
System
Grid
to
100Gbps
Networks)
3. Climate
Data
Distribution
• ESG
data
nodes
• Data
replica=on
in
the
ESG
Federa=on
• Local
copies
• data
files
are
copied
into
temporary
storage
in
HPC
centers
for
post-‐processing
and
further
climate
analysis.
4. Climate
Data
over
100Gbps
• Data
volume
in
climate
applica=ons
is
increasing
exponen=ally.
• An
important
challenge
in
managing
ever
increasing
data
sizes
in
climate
science
is
the
large
variance
in
file
sizes.
• Climate
simula=on
data
consists
of
a
mix
of
rela=vely
small
and
large
files
with
irregular
file
size
distribu=on
in
each
dataset.
• Many
small
files
5. Keep
the
data
channel
full
request
request a file data
send file
send data
request a file
send file
RPC
FTP
• Concurrent
transfers
• Parallel
streams
6. lots-‐of-‐small-‐<iles
problem!
<ile-‐centric
tools?
l Not
necessarily
high-‐speed
(same
distance)
- Latency
is
s=ll
a
problem
request a dataset
send data
100Gbps pipe 10Gbps pipe
7. Framework
for
the
Memory-‐mapped
Network
Channel
memory
caches
are
logically
mapped
between
client
and
server
9. Advantages
• Decoupling
I/O
and
network
opera=ons
• front-‐end
(I/O
processing)
• back-‐end
(networking
layer)
• Not
limited
by
the
characteris=cs
of
the
file
sizes
On
the
fly
tar
approach,
bundling
and
sending
many
files
together
• Dynamic
data
channel
management
Can
increase/decrease
the
parallelism
level
both
in
the
network
communica=on
and
I/O
read/write
opera=ons,
without
closing
and
reopening
the
data
channel
connec=on
(as
is
done
in
regular
FTP
variants).
11. The
SC11
100Gbps
Demo
• CMIP3
data
(35TB)
from
the
GPFS
filesystem
at
NERSC
• Block
size
4MB
• Each
block’s
data
sec=on
was
aligned
according
to
the
system
pagesize.
• 1GB
cache
both
at
the
client
and
the
server
• At
NERSC,
8
front-‐end
threads
on
each
host
for
reading
data
files
in
parallel.
•
At
ANL/ORNL,
4
front-‐end
threads
for
processing
received
data
blocks.
•
4
parallel
TCP
streams
(four
back-‐end
threads)
were
used
for
each
host-‐to-‐host
connec=on.
13. MemzNet:
memory-‐mapped
zero-‐copy
network
channel
Front-‐end
Memory
network
threads
(access
blocks Memory
to
memory
Front-‐end
blocks
blocks) threads
(access
to
memory
blocks)
memory
caches
are
logically
mapped
between
client
and
server
14. ANI Middleware Testbed
ANI
100Gbps
NERSC
To ESnet
ANL
10G
To ESnet
1GE
10G
nersc-asw1 Site Router
testbed
(nersc-mr2)
ANI 100G Network 1GE
anl-asw1
1 GE
nersc-C2940 ANL Site
switch Router
1 GE
100G anl-C2940
100G switch
1 GE
1 GE eth0
1 GE
nersc-app
100G
100G
nersc-diskpt-1 NICs: 1 GE
4x10GE (MM) 1 GE
2: 2x10G Myricom
eth2-5
1: 4x10G HotLava
1 GE
eth0
nersc-diskpt-1
10GE (MM)
nersc-diskpt-2 NICs: 10GE (MM)
1 GE
eth0
1: 2x10G Myricom 4x10GE (MM)
1: 2x10G Chelsio eth2-5
ANI 100G anl-app
1: 6x10G HotLava ANI 100G
eth0 Router anl-mempt-1 NICs:
Router eth2-5 eth0
4x10GE (MM) nersc-diskpt-2 4x 10GE (MM) 2: 2x10G Myricom
nersc-diskpt-3 NICs: 4x10GE (MM)
1: 2x10G Myricom eth2-5
anl-mempt-1
1: 2x10G Mellanox
eth0
1: 6x10G HotLava eth0 anl-mempt-2 NICs:
eth2-5
nersc-diskpt-3 2: 2x10G Myricom
4x10GE (MM)
anl-mempt-2
eth0
anl-mempt-3 NICs:
eth2-5
1: 2x10G Myricom
4x10GE (MM)
1: 2x10G Mellanox
Note: ANI 100G routers and 100G wave available till summer 2012;
Testbed resources after that subject funding availability. anl-mempt-3
Updated December 11, 2011
SC11
100Gbps
demo
15. Many
TCP
Streams
(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single
NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps
pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a
different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).
16. Effects
of
many
streams
ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16,
32 64 concurrent jobs - 5min intervals], TCP buffer size is 50M
17. MemzNet’s
Performance
SC11 demo
GridFTP MemzNet
TCP
buffer
size
is
set
to
50MB
ANI Testbed
19. Acknowledgements
Peter
Nugent,
Zarija
Lukic
,
Patrick
Dorn,
Evangelos
Chaniotakis,
John
Christman,
Chin
Guok,
Chris
Tracy,
Lauren
Rotman,
Jason
Lee,
Shane
Canon,
Tina
Declerck,
Cary
Whitney,
Ed
Holohan,
Adam
Scovel,
Linda
Winkler,
Jason
Hill,
Doug
Fuller,
Susan
Hicks,
Hank
Childs,
Mark
Howison,
Aaron
Thomas,
John
Dugan,
Gopal
Vaswani