Weitere ähnliche Inhalte Ähnlich wie Cisco usNIC: how it works, how it is used in Open MPI (20) Mehr von Jeff Squyres (20) Kürzlich hochgeladen (20) Cisco usNIC: how it works, how it is used in Open MPI1. Cisco
Userspace
NIC
(usNIC)
Jeff
Squyres
Cisco
Systems,
Inc.
November
7,
2013
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
1
2. Yes,
we
sell
servers
now
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
2
3. Record-‐seNng
Intel
I CS
servers
Cisco
Uvy
Bridge
1U
and
2U
servers
Ultra
low
Cisco
2
x
10Gb
VIC
latency
Ethernet
Yes,
really!
40Gb
top-‐of-‐rack
Cisco
10/40Gb
and
core
witches
switches
Nexus
s
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
3
4. Industry-‐leading
compute
without
compromise
Rack
HPC
performance
4
socket
+
giant
memory
UCS
C240
M3
Perfect
as
HPC
cluster
head
nodes
or
IO
nodes
(2
socket)
UCS
C420
M3
4-‐socket
rack
server
for
large-‐memory
compute
workloads
UCS
C220
M3
Blade
Ideal
for
HPC
compute-‐intensive
applicaXons
(2
socket)
UCS
B200
M3
Blade
form
factor,
2-‐socket
© 2013 Cisco and/or its affiliates. All rights reserved.
UCS
B420
M3
4-‐socket
blade
for
large-‐memory
compute
workloads
Cisco
UCS:
Many
Server
Form
Factors,
One
System
4
Cisco Public
4
5. Worldwide
X86
Server
Blade
Market
Share
UCS
impacBng
growth
of
established
vendors
like
HP
Legacy
offerings
flat-‐lining
or
in
decline
Cisco
growth
out-‐pacing
the
market
UCS
#2
and
climbing
Market
AppeXte
for
InnovaXon
Fuels
UCS
Growth
Customers
have
shiMed
19.3%
of
the
global
x86
blade
server
market
to
Cisco
and
over
26%
in
the
Americas
(Source:
IDC
Worldwide
Quarterly
Source:
IDC
Worldwide
Quarterly
Server
Tracker,
Q1
2013
Revenue
Share,
May
2013
Server
Tracker,
Q1
2013
Revenue
Share,
May
2013)
Demand
for
Data
Center
InnovaBon
Has
Vaulted
Cisco
Unified
CompuBng
System
(UCS)
to
the
#2
Leader
in
the
Fast-‐Growing
Segment
of
the
x86
Server
Market
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
5
6. 16
world
records
Best
CPU
Performance
Best
VirtualizaXon
&
Cloud
Performance
8
world
records
Best
Database
Performance
Best
Enterprise
ApplicaXon
Performance
Best
Enterprise
Middleware
Performance
Best
HPC
Performance
© 2013 Cisco and/or its affiliates. All rights reserved.
9
world
records
18
world
records
14
world
records
15
world
records
Cisco Public
6
7. One
wire
to
rule
them
all:
• Commodity
traffic
(e.g.,
ssh)
• Cluster
/
hardware
management
• File
system
/
IO
traffic
• MPI
traffic
10G
or
40G
with
real
QoS
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
7
8. Low
latency,
high
density
10
/
40Gb
switches
Low
latency
High
density
Nexus
3548
190ns
port-‐to-‐port
latency
(L2
and
L3)
Created
for
HPC
/
HFT
48
10Gb
/
12
40Gb
ports
Nexus
6004
1us
port-‐to-‐port
latency
384
10Gb
/
96
40Gb
ports
Cisco
Nexus:
Years
of
experience
rolled
into
dependable
soluBons
8
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
8
9. Spine
Leaf
CharacterisXcs
•
•
•
•
•
•
3
Hops
Low
OversubscripXon
–
Non-‐Blocking
<
~3.5
usecs
depending
on
config
and
workload
10G
or
40G
Capable
Spine:
4
to
16
Wide
Leaf:
Determined
by
Spine
Density
Spine
-‐
Leaf
Port
Scale
Latency
Spines
Leafs
10G
Fabric
6004
-‐
6001
18,432
x
10G
3:1
~
3
usecs
Cut-‐through
16
384
40G
Fabric
6004
-‐
6004
7,680
x
40G
5:1
~
3
usecs
Cut-‐through
16
96
Mixed
Fabric
6004
-‐
6001
4,680
x
10G
3:1
~
3
usecs
S&F
4
96
10G
Fabric
6004
-‐
3548
12,288
x
10G
3:1
~
1.5
usecs
Cut-‐through
16
384
40G
Fabric
6004
-‐
3548
1,152
x
40G
1:1
~
1.5
usecs
Cut-‐through
6
96
Mixed
Fabric
6004
-‐
3548
3,072
x
10G
3:1
~
1.5
usecs
S&F
4
96
…many
other
configuraBons
are
also
possible
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
9
10. CharacterisXcs
•
•
•
Spine2
•
•
Spine1
3
Hops
Pod
–
5
hops
DC
east-‐west
traffic
Low
OversubscripXon
–
Non-‐Blocking
<
~3.5
usecs
depending
on
config
and
workload
10G
or
40G
Capable
Two
spine
layers
Leaf
Spine2-‐Spine1-‐Leaf
Port
Scale
Latency
Spine2
Spine1
Leafs
10G
Fabric
6004
-‐
6004
-‐
6001
55,296
x
10G
3:1
~
3-‐5
usecs
Cut-‐through
48
16
x
6
192
40G
Fabric
6004
-‐
6004
-‐
6004
23,040
x
40G
5:1
~
3-‐5
usecs
Cut-‐through
48
16
48
Mixed
Fabric
6004
-‐
6004
-‐
6001
18,432
x
10G
3:1
~
3-‐5
usecs
S&F
32
4
x
8
48
10G
Fabric
6004
-‐
6004
-‐
3548
24,576
x
10G
2:1
~
1.5-‐3.5
usecs
Cut-‐through
32
16
x
4
192
40G
Fabric
6004
-‐
6004
-‐
3548
2,304
x
40G
1:1
~
1.5-‐3.5
usecs
Cut-‐through
24
6
x
8
48
Mixed
Fabric
6004
-‐
6004
-‐
3548
9,216x
10G
2:1
~
1.5-‐3.5
usecs
S&F
24
6
x
8
48
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
10
11. © 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
11
12. • Direct
access
to
NIC
hardware
from
Linux
userspace
OperaXng
System
bypass
Via
the
Linux
Verbs
API
(UD)
• UXlizes
Cisco
Virtual
Interface
Card
(VIC)
for
ultra-‐low
Ethernet
latency
2nd
generaXon
80Gbps
Cisco
ASIC
2
x
10Gbps
Ethernet
ports
2
x
40Gbps
coming
…soon…
PCI
and
mezzanine
form
factors
• Half-‐round
trip
(HRT)
ping-‐pong
latencies
(Intel
E5-‐2690
v2
servers):
Raw
back
to
back:
1.57μs
MPI
back
to
back:
1.85μs
Through
MPI+N3548:
2.05μs
© 2013 Cisco and/or its affiliates. All rights reserved.
These
numbers
keep
going
down
Cisco Public
12
13. TCP/IP
usNIC
ApplicaXon
Userspace
ApplicaXon
Userspace
sockets
library
Userspace
verbs
library
Kernel
TCP
stack
Bootstrapping
and
setup
General
Ethernet
driver
Verbs
IB
core
Cisco
VIC
driver
Cisco
USNIC
driver
Cisco
VIC
hardware
© 2013 Cisco and/or its affiliates. All rights reserved.
Send
and
receive
fast
path
Cisco
VIC
hardware
Cisco Public
13
14. MPI
MPI
directly
injects
L2
frames
to
the
network
© 2013 Cisco and/or its affiliates. All rights reserved.
Userspace
verbs
library
Cisco
VIC
hardware
MPI
receives
L2
frames
directly
from
the
VIC
Cisco Public
14
15. MPI
process
MPI
process
x86
Chipset
VT-‐d
I/O MMU
VIC
SR-IOV NIC
QP
QP
Queue pair
Classifier
Inbound
L2
frames
© 2013 Cisco and/or its affiliates. All rights reserved.
Outbound
L2
frames
Cisco Public
15
16. Physical
FuncXon
(PF)
MAC
address:
aa:bb:cc:dd:ee:ff
QP
QP
VF
QP
QP
VF
VF
VF
Physical
port
© 2013 Cisco and/or its affiliates. All rights reserved.
VF
VF
VIC
Physical
FuncXon
(PF)
MAC
address:
a
a:bb:cc:dd:ee:fe
QP
QP
VF
QP
QP
VF
VF
VF
VF
VF
Physical
port
Cisco Public
16
17. MPI
process
PF
(MAC)
VF
VF
VF
QP
QP
VF
VF
VF
Physical
port
© 2013 Cisco and/or its affiliates. All rights reserved.
VIC
PF
(MAC)
VF
VF
VF
QP
QP
VF
VF
VF
Physical
port
Intel IO MMU
MPI
process
Cisco Public
17
18. • Used
for
physical
ßà
virtual
memory
translaXon
• usnic
verbs
driver
programs
(and
deprograms)
the
IOMMU
VIC
Virtual
Virtual
Intel IO MMU
Physical
Virtual
Physical
Userspace
process
RAM
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
18
19. • For
the
purposes
of
this
talk,
let’s
assume
that
each
physical
port
has
one
Linux
ethX
device
• Each
ethX
device
corresponds
to
a
PF
• Each
usnic_Y
device
corresponds
to
an
ethX
device2
VIC
Physical
port
0
eth4
/
usnic_0
Physical
port
1
eth5
/
usnic_1
Physical
port
Physical
port
(fiber)
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
19
20. Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
PCI 8086:1521
L3 (20MB)
eth0
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
Core P#7
PU P#0
PU P#1
PU P#2
PU P#3
PU P#4
PU P#5
PU P#6
PU P#7
PU P#16
PU P#17
PU P#18
PU P#19
PU P#20
PU P#21
PU P#22
PU P#23
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
Intel
Xeon
E5-‐2690
(“Sandy
Bridge”)
2
sockets,
8
cores,
64GB
per
socket
usnic_0
PCI 1137:0043
eth5
VIC
ports
usnic_1
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
PCI 1000:005b
L3 (20MB)
sda
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
eth6
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
usnic_2
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
Core P#7
PU P#8
PU P#9
PU P#10
PU P#11
PU P#12
PU P#13
PU P#14
PU P#15
PU P#24
PU P#25
PU P#26
PU P#27
PU P#28
PU P#29
PU P#30
PU P#31
PCI 1137:0043
PCI 1137:0043
eth7
VIC
ports
usnic_3
Indexes: physical
© 2013 Cisco and/or its affiliates. All rights reserved.
Date: Thu Nov 7 10:58:23 2013
Cisco Public
20
21. PU P#23
eth3
Machine (128GB)
PCI 1137:0043
NUMANode P#0 (64GB)
eth4
Socket P#0
L3 (20MB)
PCI 8086:1521
eth0
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
usnic_0
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
PCI 1137:0043
L1i (32KB)
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
PU P#0
PU P#1
PU P#2
PU P#3
PU P#4
PU P#5
PU P#6
PU P#16
PU P#17
PU P#18
PU P#19
PU P#20
PU P#21
PU P#22
Core P#7
eth5
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PU P#7
PU P#23
usnic_1
PCI 8086:1521
eth3
PCI 1137:0043
PCI 102b:0522
eth4
usnic_0
PCI 1137:0043
eth5
PCI 1000:005b
usnic_1
sda
PCI 102b:0522
NUMANode P#1 (64GB)
L2 (256KB)
PCI 1137:0043
Socket P#1
L3 (20MB)
L1d (32KB)
eth6
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1i (32KB) (32KB)
L1d
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
usnic_2
L1d (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
PCI 1137:0043
Core P#7
PU
PU P#15 P#8
PU P#9
PU P#10
PU P#11
PU P#12
PU P#13
PU P#14
PU P#25
PU P#26
PU P#27
PU P#28
PU P#29
PU P#30
L1i (32KB)
Core P#7
PU P#24
PU P#15
eth7
PCI 1000:005b
sda
PCI 1137:0043
eth6
usnic_2
PCI 1137:0043
eth7
PU P#31
PU P#31
usnic_3
usnic_3
Indexes: physical
© 2013 Cisco and/or its affiliates. All rights reserved.
Date: Thu Nov 7 10:58:23 2013
Cisco Public
21
22. ApplicaXon
Open
MPI
layer
(OMPI)
Point-‐to-‐point
messaging
layer
(PML)
Byte
Transfer
Layer
(BTL)
OperaXng
System
Hardware
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
22
23. MPI_Send
/
MPI_Recv
(etc.)
OB1
PML
usnic
BTL
/dev/usnic_0
usnic
BTL
/dev/usnic_1
VIC
0
© 2013 Cisco and/or its affiliates. All rights reserved.
usnic
BTL
/dev/usnic_2
usnic
BTL
/dev/usnic_3
VIC
1
Cisco Public
23
24. • Byte
Transfer
Layer
• Point-‐to-‐point
transfer
plugins
in
OMPI
layer
• No
protocol
is
assumed
/
required
• “usnic”
BTL
usnic
BTL
/dev/usnic_2
• Uses
unreliable
datagram
(UD)
verbs
• Handles
all
fragmentaXon
and
re-‐assembly
(vs.
PML)
• Retransmissions
and
ACKs
handled
in
sovware
• Sliding
window
retransmission
scheme
• Direct
inject
/
direct
receive
of
L2
Ethernet
frames
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
24
25. • Priority
queue
for
small
and
control
packets
• Data
queue
for
up
to
MTU-‐sized
data
packets
Priority
QP
Data
QP
CQ
• Each
module
has
two
UD
queue
pairs
CQ
• One
BTL
module
for
each
usNIC
verbs
device
• Each
QP
has
its
own
CQ
• QPs
may
or
may
not
be
on
same
VF
• Overall
BTL
glue
polls
CQs
for
each
device
• First,
priority
CQs
• Then
data
CQs
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
25
26. • “raw”
latency
(no
MPI,
no
verbs)
is
1.57μs
• MPI
latency
back-‐to-‐back
on
Sandy
Bridge
1.85μs
• Verbs
responsible
for
about
80ns
of
the
difference
(not
related
to
MPI
API)
• All
the
rest
of
OMPI
is
only
about
200ns
Raw:
1.57μs
MPI:
200ns
Verbs:
80ns
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
26
27. • Deferred
and
piggy-‐backed
ACKs
Process
A
Msg
Process
B
ACK
N
Msg
Msg
Msg
Immediate
Deferred
Time
ACK
N+2
© 2013 Cisco and/or its affiliates. All rights reserved.
Msg
Msg
Msg
Msg+ACK
N+2
Deferred
+
piggybacked
Cisco Public
27
28. • Host
writes
WQ
structure
Writes
index
to
VIC
via
PIO
VIC
reads
WQ
descriptor
VIC
reads
buffer
from
RAM
VIC
sends
buffer
from
RAM
WQ
descriptor
Host
Write
WQ
in
Read
WQ
VIC
dex
ket
Read
pac
VIC
now
has
buffer
address
Send
on
wire
Buffer
to
send
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
28
29. • Host
writes
WQ
structure
Writes
index
+
encoded
buffer
address
to
VIC
via
PIO
VIC
reads
WQ
descriptor
VIC
reads
buffer
from
RAM
VIC
sends
buffer
from
RAM
WQ
descriptor
Buffer
to
send
© 2013 Cisco and/or its affiliates. All rights reserved.
Host
Write
WQ
in
VIC
dex+addr
Read
WQ
ket
Read
pac
Send
on
wire
Send
~400ns
sooner
Cisco Public
29
30. • Minimize
length
of
priority
receive
queue
• Using
2048
different
receive
buffers
200ns
worse
than
using
64
• Result
of
IOMMU
cache
effect
• We
scale
length
of
priority
RQ
with
number
of
processes
in
job
Use
this
much
Userspace
process
Virtual
VIC
Virtual
Intel IO MMU
Physical
Instead
of
this
much
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
30
31. • Use
fastpaths
wherever
possible
Be
friendly
to
the
opXmizer
and
instrucXon
cache
Made
a
noXceable
difference
(!)
if (fastpathable)!
do_it_inline();!
else!
call_slower_path();!
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
31
32. Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
PCI 8086:1521
L3 (20MB)
eth0
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
Core P#7
PU P#0
PU P#16
MPI
processes
running
on
these
cores…
PU P#1
PU P#2
PU P#3
PU P#4
PU P#5
PU P#6
PU P#18
PU P#19
PU P#20
PU P#21
PU P#22
PU P#23
eth1
PCI 8086:1521
eth2
PU P#7
PU P#17
PCI 8086:1521
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:0043
eth5
usnic_1
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
PCI 1000:005b
L3 (20MB)
sda
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
eth6
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
usnic_2
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
Core P#7
PU P#8
PU P#9
PU P#10
PU P#11
PU P#12
PU P#13
PU P#14
PU P#15
PU P#24
PU P#25
PU P#26
PU P#27
PU P#28
PU P#29
PU P#30
PU P#31
PCI 1137:0043
PCI 1137:0043
eth7
usnic_3
© 2013 Cisco and/or its affiliates. All rights reserved.
Indexes: physical
Date: Thu Nov 7 10:58:23 2013
Cisco Public
32
33. Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
PCI 8086:1521
L3 (20MB)
eth0
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
Core P#7
PU P#0
PU P#16
MPI
processes
running
on
these
cores…
PU P#1
PU P#2
PU P#3
PU P#4
PU P#5
PU P#6
PU P#18
PU P#19
PU P#20
PU P#21
PU P#22
PU P#23
eth1
PCI 8086:1521
eth2
PU P#7
PU P#17
PCI 8086:1521
PCI 8086:1521
eth3
PCI 1137:0043
eth4
Only
use
these
usNIC
devices
for
short
messages
usnic_0
PCI 1137:0043
eth5
usnic_1
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
PCI 1000:005b
L3 (20MB)
sda
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
eth6
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
usnic_2
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
Core P#7
PU P#8
PU P#9
PU P#10
PU P#11
PU P#12
PU P#13
PU P#14
PU P#15
PU P#24
PU P#25
PU P#26
PU P#27
PU P#28
PU P#29
PU P#30
PU P#31
PCI 1137:0043
PCI 1137:0043
eth7
usnic_3
© 2013 Cisco and/or its affiliates. All rights reserved.
Indexes: physical
Date: Thu Nov 7 10:58:23 2013
Cisco Public
33
34. Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
PCI 8086:1521
L3 (20MB)
eth0
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
Core P#7
PU P#0
PU P#16
MPI
processes
running
on
these
cores…
PU P#1
PU P#2
PU P#3
PU P#4
PU P#5
PU P#6
PU P#18
PU P#19
PU P#20
PU P#21
PU P#22
PU P#23
eth1
PCI 8086:1521
eth2
PU P#7
PU P#17
PCI 8086:1521
PCI 8086:1521
eth3
PCI 1137:0043
eth4
Use
ALL
usNIC
devices
for
long
messages
usnic_0
PCI 1137:0043
eth5
usnic_1
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
PCI 1000:005b
L3 (20MB)
sda
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
eth6
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
usnic_2
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
Core P#7
PU P#8
PU P#9
PU P#10
PU P#11
PU P#12
PU P#13
PU P#14
PU P#15
PU P#24
PU P#25
PU P#26
PU P#27
PU P#28
PU P#29
PU P#30
PU P#31
PCI 1137:0043
PCI 1137:0043
eth7
usnic_3
© 2013 Cisco and/or its affiliates. All rights reserved.
Indexes: physical
Date: Thu Nov 7 10:58:23 2013
Cisco Public
34
35. • Everything
above
the
firmware
is
open
source
• Open
MPI
DistribuXng
Cisco
Open
MPI
1.6.5
Upstream
in
Open
MPI
1.7.3
• Libibverbs
plugin
• Verbs
kernel
module
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
35
36. Hardware
• Cisco
UCS
C220
M3
Rack
Server
• Intel
E5-‐2690
Processor
2.9
GHz
(3.3
GHz
Turbo),
2
Socket,
8
Cores/Socket
• 1600
MHz
DDR3
Memory,
8
GB
x
16,
128
GB
installed
• Cisco
VIC
1225
with
Ultra
Low
Latency
Networking
usNIC
Driver
• Cisco
Nexus
3548
• 48
Port
10
Gbps
Ultra
Low
Latency
Ethernet
Networking
Switch
SoMware
• OS:
Centos
6.4,
Kernel:
2.6.32-‐358.el6.x86_64
(SMP)
• NetPIPE
(ver
3.7.1)
• Intel
MPI
Benchmarks
(ver
3.2.4)
• High
Performance
Linpack
(ver
2.1)
• Other:
Intel
C
Compiler
(ver
13.0.1),
Open
MPI
(ver
1.6.5),
Cisco
usNIC
(1.0.0.7x)
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
36
37. 1
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco
usNIC
Latency
8388611
6291459
4194307
3145731
2097155
1572867
1048579
786435
524291
393219
262147
196611
131075
98307
65539
49155
32771
24579
16387
12291
8195
6147
4099
3075
2051
1539
1027
771
515
387
259
195
131
99
67
51
35
27
19
12
10
4
1
Latency
(usecs)
10000
10000
1000
7500
100
5000
2.05
usecs
latency
for
small
messages
Throughput
(Mbps)
9.3
Gbps
Throughput
2500
0
Message
Size
(bytes)
Cisco
usNIC
Throughput
Cisco Public
37
38. PingPing
and
PingPong
Latency
track
together!
900
100
600
2.05
usecs
PingPong
Latency
2.10
usecs
PingPing
Latency
10
Throughput
(MB/s)
1200
1000
Latecny
(usecs)
10000
300
1
0
4
16
64
256
1024
4096
16384
65536
262144
1048576
4194304
Message
Size
(bytes)
PingPong
ThroughPut
(MB/s)
© 2013 Cisco and/or its affiliates. All rights reserved.
PingPing
Througput
(MB/s)
PingPong
Latency
(usecs)
PingPing
Latency
(usecs)
Cisco Public
38
39. Full
Bi-‐direcBonal
Performance
for
both
Exchange
and
SendRecv
1800
100
2.11
usecs
SendRecv
Latency
2.58
usecs
Exchange
Latency
1200
10
Throughput
(MB/s)
2400
1000
Latecny
(usecs)
10000
600
1
0
4
16
64
256
1024
4096
16384
65536
262144
1048576
4194304
Message
Size
(bytes)
SendRecv
Throughput
(MB/s)
© 2013 Cisco and/or its affiliates. All rights reserved.
Exchange
Throughput
(MB/s)
SendRecv
Latency
(usecs)
Exchange
Latency
(usecs)
Cisco Public
39
40.
GFLOPS
=
FLOPS/Cycle
x
Num
CPU
Cores
x
Freq
(GHz)
E5-‐2690
Max
GFLOPS
=
8
x
16
x
3.3
=
422
GFLOPS
12500
Single
Node
HPL
Score
(16
cores):
340.51
GFLOPS*
32
Node
HPL
Score
(512
cores):
9,773.45
GFLOPS
10000
Efficiency
based
on
Single
Machine
Score:
(9,773.45)/(340.51
x
32)
x
100
=
89.69%
GFlops
7500
5000
2500
0
GFlops
16
32
64
128
256
512
340.51
673.68
1271.14
2647.09
5258.27
9773.45
#
of
CPU
Cores
*
Score
may
improve
with
addiBonal
compiler
serngs
or
newer
compiler
versions
© 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Public
40