Weitere Ă€hnliche Inhalte Ăhnlich wie MOSSCon 2013, Cisco Open Source talk (20) Mehr von Jeff Squyres (20) KĂŒrzlich hochgeladen (20) MOSSCon 2013, Cisco Open Source talk1. Cisco Public 1© 2013 Cisco and/or its affiliates. All rights reserved.
Open Source for Cisco
High Performance
Computing
Dr. Jeffrey M. Squyres
2. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 2
1. Who am I?
2. Cisco and Open Source
3. My Open Source work at Cisco
4. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
Me
Technical Lead at
Cisco Systems
Server division, VIC group
5. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
I am not in
marketing
6. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 6
I cannot fix
your Linksys
router for you
(perhaps you should try
DD-WRT)
7. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
I write code
Lots of code
All day
Every day
8. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
9. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
10. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
Open MPI
Hardware Locality (hwloc)
OpenFabrics
Linux kernel
Vast majority
of my work
is here
Iâve made
minor contributions
to these other 3
11. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
12. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
Undergrad, grad Post doc
13. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
LAM/MPI
I inherited this
I founded this
14. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
PACX-MPI
LAM/MPI
LA-MPI
FT-MPI
Sun CT 6
Project founded
in 2003,
merging multiple
open source
MPI projects
15. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 15
PACX-MPI
LAM/MPI
LA-MPI
FT-MPI
Sun CT 6
Me
16. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
PACX-MPI
LAM/MPI
LA-MPI
FT-MPI
Sun CT 6
Me
17. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
PACX-MPI
LAM/MPI
LA-MPI
FT-MPI
Sun CT 6
18. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
Us
19. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
Differences
=
Good
20. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
You write
to your own
level of
expectations
21. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
You write
to their
level of
expectations
22. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
You write
your best code
24. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
More than just the 4 projects
I participate in
25. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
More than just the 4 projects
I participate in
26. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 26
After the clouds part,
you are left withâŠ
27. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 27
Major contribution
to
28. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
29. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
Why does Cisco
do Open Source?
30. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 30
Why does Cisco
do Open Source?
âą Stand on the shoulders of giants
âą Become part of the community
âą Contribute to tools / ecosystem
that we all use
âą Sell more products
31. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
32. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
We
are
elevated
We
elevate
others
33. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
Circle of
trust
you
34. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
Circle of
trust
This is where
you need
to be
35. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
Insert FOSS
project
name here
This is where
you need
to be
36. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
Pretend this is
a really gross
picture of a leech
37. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
Pretend this is
a really gross
picture of a leech
38. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
Pretend this is
a really gross
picture of a leech
Just say no
to leeches!
39. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
Letâs be clear hereâŠ
EVERYONE
(thatâs kinda the point, right?)
40. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
Letâs be clear hereâŠ
EVERYONE
(thatâs kinda the point, right?)
41. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 41
I donât contribute to every
piece of FOSS I use
Do you?
42. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 42
Those who can,
should
43. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 43
Big companies can contribute
44. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 44
Individuals can contribute
45. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 45
Small organizations can contribute
46. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 46
Those who can,
should
47. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 47
Those who can,
should
48. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 48
A giant
49. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 49
You, standing
on the giantâs
shouldersâŠ
50. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 50
âŠin the
circle
of trust
51. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 51
âŠcontributing
to the community
52. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 52
Cisco UCS
blade server
Cisco Nexus
7000 router
54. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 54
55. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 55
Using supercomputers to solve
real world problems that are
TOO BIG
for laptops, desktops,
or individuals servers
56. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 56
Supercomputer
=
(Many) Racks of (commodity)
high-end servers
(this is one definition; there are others)
57. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 57
Rack of
36 1U
servers
58. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 58
Computational problem
Input Output
Take your computational problemâŠ
59. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 59
âŠand split it up!
Computational problem
Input Output
60. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 60
Computational problem
Input Output
Distribute the input data
across a bunch of servers
61. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 61
Input Output
Use the network between servers
to communicate / coordinate
62. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 62
Input Output
Use the network between servers
to communicate / coordinate
63. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 63
Message Passing Interface (MPI)
middleware is used for this communication
Input Output
64. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 64
Computational problem
One
processor
hour
1 processor = âŠa long timeâŠ
65. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 65
Computational problem
One
processor
hour
One
processor
hour
One
processor
hour
21 processors = ~1 hour (!)
Disclaimer: scaling is rarely perfect
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
One
processor
hour
66. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 66
This communication
may happen a LOT
It therefore needs
to be FAST
67. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 67
Source
server
Destination
server
68. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 68
HPC application
MPI middleware
TCP stack
NIC driver
NIC hardware
HPC application
MPI middleware
TCP stack
NIC driver
NIC hardware
Port
A
Port
B
69. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 69
HPC application
MPI middleware
TCP stack
NIC driver
NIC hardware
HPC application
MPI middleware
TCP stack
NIC driver
NIC hardware
Port
A
Port
B
200 nanoseconds
299,792,458 m/s c
~8 microseconds
(modern hardware) ~8-40 microseconds
Total:
~17 â 81
microseconds
~40 microseconds
(older hardware)
70. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 70
YES
71. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 71
âą Intel Core i7 E5-2690 with turbo boost (3.5-3.8Ghz)
âSandy Bridgeâ 22nm processor
âą LinX v0.6.4 (Linpack v10.3.4.007) benchmark
Measures floating point operations per second
âą 81.34 Gflops
Thatâs 81,340,000,000 floating point operations per second
17ÎŒs = 137,757,800 floating point operations
81ÎŒs = 656,375,400 floating point operations
Conclusion: yes, we absolutely care about 17-81ÎŒs!
Source: http://www.anandtech.com/show/4503/sandy-bridge-memory-scaling-choosing-the-best-ddr3
72. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 72
HPC apps can do a LOT of computation
during network communication
Latency
73. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 73
Hardware is faster than software.
The sooner software can
hand off to hardware, the better.
74. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 74
HPC application
MPI middleware
TCP stack
NIC driver
NIC hardware
HPC application
MPI middleware
TCP stack
NIC driver
NIC hardware
Port
A
Port
B
200 nanoseconds
299,792,458 m/s 299,792,458 m/s
~8 microseconds
(modern hardware) ~8-40 microseconds
~40 microseconds
(older hardware)
75. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 75
HPC application
MPI middleware
TCP stack
NIC driver
NIC hardware
HPC application
MPI middleware
TCP stack
NIC driver
NIC hardware
Port
A
Port
B
200 nanoseconds
299,792,458 m/s 299,792,458 m/s
~8 microseconds
(modern hardware) ~8-40 microseconds
~40 microseconds
(older hardware)
Canât do much about the speed of light ï
76. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 76
HPC application
MPI middleware
TCP stack
NIC driver
NIC hardware
HPC application
MPI middleware
TCP stack
NIC driver
NIC hardware
Port
A
Port
B
200 nanoseconds
299,792,458 m/s 299,792,458 m/s
~8 microseconds
(modern hardware) ~8-40 microseconds
~40 microseconds
(older hardware)
Canât do much about the speed of light ï
Fastest Ethernet switches today are about 200ns
(theyâll probably get a little faster over time)
77. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 77
HPC application
MPI middleware
TCP stack
NIC driver
NIC hardware
HPC application
MPI middleware
TCP stack
NIC driver
NIC hardware
Port
A
Port
B
200 nanoseconds
299,792,458 m/s 299,792,458 m/s
~8 microseconds
(modern hardware) ~8-40 microseconds
~40 microseconds
(older hardware)
Canât do much about the speed of light ï
Fastest Ethernet switches today are about 200ns
(theyâll probably get a little faster over time)
8-40us is, by far, the biggest chunk of time
Reduce this!
78. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 78
HPC application
MPI middleware
TCP stack
NIC driver
NIC hardware
What if we can skip some of these layers?
Who needs TCP? Raw L2 Ethernet frames, baby!
Who needs the operating system driver?
79. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 79
HPC application
MPI middleware
TCP stack
NIC driver
NIC hardware
What if we can skip some of these layers?
Who needs TCP? Raw L2 Ethernet frames, baby!
Who needs the operating system driver?
Let MPI talk directly to the NIC hardware
80. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 80
Linux userspace
application
Linux kernel
Cisco VIC
hardware
Can I see
the hardware?
Please?
81. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 81
Linux userspace
application
Linux kernel
Cisco VIC
hardware
No.
82. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 82
Linux userspace
application
Linux kernel
Cisco VIC
hardware
Can I see
the OpenFabrics
hardware?
83. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 83
Linux userspace
application
Linux kernel
Cisco VIC
hardware
Sure!
84. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 84
Linux userspace
application
Linux kernel
Cisco VIC
hardware
Can I see
the OpenFabrics
hardware?
Yay!
85. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 85
âą Coalition of network vendors
âą Successfully upstreamed âOS bypass for networkingâ into Linux
âą http://www.openfabrics.org
86. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 86
HPC application
MPI middleware
VIC driver
Cisco VIC
Our project: enabling this MPI direct-to-
hardware communication on Cisco servers
with the Cisco Virtual Interface Card (VIC)
in Linux.
Everything above the firmware will be
open source.
87. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 87
Kernel
Cisco VIC hardware
TCP / IP stack
Cisco VIC driver
Userspace
Userspace sockets library
MPI library
Application
88. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 88
Kernel
Userspace verbs library
Cisco VIC hardware
MPI library
Userspace
Verbs IB core
Cisco USNIC driver
Bootstrapping
and setup
Send and receive
fast path
Application
Cisco code
here
89. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 89
Hardware Locality (hwloc)
90. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 90
âą Query your serverâs
topology
âą NUMA nodes
Including memory
âą Processor sockets
âą L3, L2, L1 caches
Instruction and data
âą Cores
âą Hyperthreads
âą PCI devices
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
91. Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
92. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 92
âą Output formats supported:
PDF, JPG, PNG, TIFF, FIG, âŠ
Text (for console windows)
Curses
XML
âą Great for feeding into scripts!
93. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 93
âą hwloc-bind socket:0.core:2 command
Bind command to core 2 on socket 0
âą hwloc-bind âget
Print a bitmap of your current bindings
âą hwloc-bind --get | hwloc-calc -p -H socket.core
Print something more readable than a bitmap
âą hwloc-bind --get | hwloc-calc -p -H socket.core.pu
Even show the hardware threads
âą hwloc-ps [-a]
Show where processes are bound
94. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 94
âą Get a tree data structure representing the topology
âą Many API calls for manipulating / traversing the tree
âą Typical actions:
Get, set processor and memory bindings
React to cache sizes
âą âŠeverything you can do in the CLI, and more
95. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 95
âą Verify the internal topology of your server
How much memory do you have?
Where is that memory?
What processor(s) are local to that memory?
How big are your L1, L2, L3 caches?
âą Verify your internal PCI devices
Distinguish ethX devices from each other
96. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 96
âą Bind services to specific cores
Ensure related services are on the
same NUMA node
Put non-essential services on core 0
(e.g., NTP)
âą Bind server-related services
Apache, Bind, NFS, âŠetc.
Increase performance by not letting
them migrate
Keeps memory local, less inter-
NUMA-node traffic
NTP
etc.
Apache
NFS
97. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 97Cisco Public 9797© 2013 Cisco and/or its affiliates. All rights reserved.
âOpen source is good.
Open source works.