Weitere Ă€hnliche Inhalte Ăhnlich wie ISC 12 BoF: InfiniBand? Problems? Do you care? (20) KĂŒrzlich hochgeladen (20) ISC 12 BoF: InfiniBand? Problems? Do you care?1. InfiniBand? Problems? Do you care?
Christian Kniep / Jan Wender
science + computing ag
IT services for sophisticated computer environments
TĂŒbingen | MĂŒnchen | Berlin | DĂŒsseldorf
2. Agenda
This is an interactive session!
âȘ Who is on the podium?
âȘ Living Histogram?
âȘ Getting some statistics
âȘ Living Histogram
âȘ Existing Monitoring Solutions
âȘ Discussion
âȘ Quick and Dirty Analysis
âȘ Conclusions
Page 2
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
4. science + computing at a glance
Founding Year 1989
Locations TĂŒbingen
MĂŒnchen
Berlin
DĂŒsseldorf
Employees 270
Shareholder Bull S.A. (100%)
Revenue 10/11 27 Mio. Euro
Partners Daikin Industries, Japan
NICE srl, Italien
Exa Corporation, USA
Platform Computing, Kanada
Page 4
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
5. Living Histogram?
Brian L. Joiner, International Statistical Review / Revue Internationale de Statistique, Vol. 43, No. 3. (Dec.,1975), pp. 339-340.
Page 5
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
6. Living Histogram
Size of Fabric
âȘ <10
âȘ <50
âȘ <500
âȘ >500
Page 6
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
7. Living Histogram
Switch Structure
âȘ Switch size
âȘ singular switch
(mlx4036, qlogic12300)
âȘ Modular switch
(mlx5600, qlogic12800)
âȘ Amount
âȘ few
âȘ many
Page 7
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
8. Living Histogram
Focus
âȘ Stability
⥠maintenance cost
âȘ High-Perfomance
⥠extremly optimized
Page 8
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
9. Living Histogram
Type of Use
âȘ Cluster Purpose
âȘ Single Purpose Cluster
âȘ Multi Purpose Cluster
âȘ Usage
âȘ One Job at a time
âȘ Multiple Jobs
Page 9
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
10. Living Histogram
Kind/Amount of Problems
âȘ Impact
âȘ minor
âȘ major
âȘ Amount
âȘ few
âȘ many
Page 10
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
11. Living Histogram
Problem solving
âȘ Iterative
⥠reseat / reboot
âȘ Analytic
⥠dig into the problem
⥠try to wipe it out
Page 11
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
12. Monitoring Solutions
stable (but not useful to admins?) unstable (individually carved)
âȘ infiniband-diags âȘ wrapper of infiniband-diags
âȘ ibcheckerrors âȘ INAM (Ohio-State-University)
âȘ ibdiagpath
âȘ QNIB
âȘ plugin to non-IB systems
âȘ .....
âȘ nagios
âȘ collectl
âȘ hardware vendor suites not listed stuff
âȘ Unified Fabric Manager (Mellanox) âȘ ...
âȘ InfiniBand Fabric Suites (QLogic)
Page 12
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
13. Monitoring Solutions
stable (but not useful to admins?) unstable (individually carved)
âȘ infiniband-diags âȘ wrapper of infiniband-diags
âȘ ibcheckerrors âȘ INAM (Ohio-State-University)
âȘ ibdiagpath
âȘ QNIB
âȘ plugin to non-IB systems
âȘ .....
âȘ nagios
âȘ collectl
âȘ hardware vendor suites not listed stuff
âȘ Unified Fabric Manager (Mellanox) âȘ ...
âȘ InfiniBand Fabric Suites (QLogic)
Page 13
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
14. Modular Switches
switchguid=0xac1(ac1)! # Spine 1
Switch! 36 "S-ac1"! # "A1" enhanced port 0 lid 11 lmc 0
[1]! "S-bc1"[1]! # "B1" lid 21 4xQDR
[2]! "S-bc2"[1]! # "B2" lid 22 4xQDR
[3]! "S-bc3"[1]! # "B3" lid 23 4xQDR
switchguid=0xac2(ac2)! # Spine 2
Switch! 36 "S-ac2"! # "A2" enhanced port 0 lid 12 lmc 0
[1]! "S-bc1"[2]! # "B1" lid 21 4xQDR
[2]! "S-bc2"[2]! # "B2" lid 22 4xQDR
[3]! "S-bc3"[2]! # "B3" lid 23 4xQDR
switchguid=0xbc1(bc1)! # Line 1
Switch 36 "S-bc1"! # "B1" enhanced port 0 lid 21 lmc 0
[1]! "S-ac1"[1]! # "A1" lid 11 4xQDR
[2] "S-ac2"[1] # "A2" lid 12 4xQDR
[3] "H-1"[1](f1) # "Host1" lid 101 4xQDR
switchguid=0xbc2(bc2)! # Line 2
Switch! 36 "S-bc2"! # "B2" enhanced port 0 lid 22 lmc 0
[1]! "S-ac1"[2]! # "A1" lid 11 4xQDR
[2] "S-ac2"[2] # "A2" lid 12 4xQDR
[3] "H-2"[1](f2) # "Host2" lid 102 4xQDR
switchguid=0xbc3(bc3)! # Line 3
Switch! 36 "S-bc3"! # "B3" enhanced port 0 lid 23 lmc 0
[1]! "S-ac1"[3]! # "A1" lid 11 4xQDR
[2] "S-ac2"[3] # "A2" lid 12 4xQDR
[3] "H-3"[1](f3) # "Host3" lid 103 4xQDR
Page 14
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
15. Modular Switches
switchguid=0xac1(ac1)! # Spine 1
Switch! 36 "S-ac1"! # "A1" enhanced port 0 lid 11 lmc 0 Chassis1
[1]! "S-bc1"[1]! # "B1" lid 21 4xQDR Spine1 Spine2
[2]! "S-bc2"[1]! # "B2" lid 22 4xQDR
[3]! "S-bc3"[1]! # "B3" lid 23 4xQDR
switchguid=0xac2(ac2)! # Spine 2
Switch! 36 "S-ac2"! # "A2" enhanced port 0 lid 12 lmc 0
[1]! "S-bc1"[2]! # "B1" lid 21 4xQDR
[2]! "S-bc2"[2]! # "B2" lid 22 4xQDR Line1 Line2 Line3
[3]! "S-bc3"[2]! # "B3" lid 23 4xQDR
switchguid=0xbc1(bc1)! # Line 1
Switch 36 "S-bc1"! # "B1" enhanced port 0 lid 21 lmc 0
[1]! "S-ac1"[1]! # "A1" lid 11 4xQDR Host1 Host2 Host3
[2] "S-ac2"[1] # "A2" lid 12 4xQDR
[3] "H-1"[1](f1) # "Host1" lid 101 4xQDR
switchguid=0xbc2(bc2)! # Line 2
Switch! 36 "S-bc2"! # "B2" enhanced port 0 lid 22 lmc 0
[1]! "S-ac1"[2]! # "A1" lid 11 4xQDR
[2] "S-ac2"[2] # "A2" lid 12 4xQDR
[3] "H-2"[1](f2) # "Host2" lid 102 4xQDR
switchguid=0xbc3(bc3)! # Line 3
Switch! 36 "S-bc3"! # "B3" enhanced port 0 lid 23 lmc 0
[1]! "S-ac1"[3]! # "A1" lid 11 4xQDR
[2] "S-ac2"[3] # "A2" lid 12 4xQDR
[3] "H-3"[1](f3) # "Host3" lid 103 4xQDR
Page 15
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
16. Modular Switches
switchguid=0xac1(ac1)! # Spine 1
Switch! 36 "S-ac1"! # "A1" enhanced port 0 lid 11 lmc 0 Chassis1
[1]! "S-bc1"[1]! # "B1" lid 21 4xQDR Spine1 Spine2
[2]! "S-bc2"[1]! # "B2" lid 22 4xQDR
[3]! "S-bc3"[1]! # "B3" lid 23 4xQDR
switchguid=0xac2(ac2)! # Spine 2
Switch! 36 "S-ac2"! # "A2" enhanced port 0 lid 12 lmc 0
[1]! "S-bc1"[2]! # "B1" lid 21 4xQDR
[2]! "S-bc2"[2]! # "B2" lid 22 4xQDR Line1 Line2 Line3
[3]! "S-bc3"[2]! # "B3" lid 23 4xQDR
switchguid=0xbc1(bc1)! # Line 1
Switch 36 "S-bc1"! # "B1" enhanced port 0 lid 21 lmc 0
[1]! "S-ac1"[1]! # "A1" lid 11 4xQDR Host1 Host2 Host3
[2] "S-ac2"[1] # "A2" lid 12 4xQDR
[3] "H-1"[1](f1) # "Host1" lid 101 4xQDR
switchguid=0xbc2(bc2)! # Line 2
Switch! 36 "S-bc2"! # "B2" enhanced port 0 lid 22 lmc 0
[1]! "S-ac1"[2]! # "A1" lid 11 4xQDR
[2] "S-ac2"[2] # "A2" lid 12 4xQDR Chassis1
[3] "H-2"[1](f2) # "Host2" lid 102 4xQDR
switchguid=0xbc3(bc3)! # Line 3
Switch! 36 "S-bc3"! # "B3" enhanced port 0 lid 23 lmc 0
[1]! "S-ac1"[3]! # "A1" lid 11 4xQDR
[2] "S-ac2"[3] # "A2" lid 12 4xQDR
[3] "H-3"[1](f3) # "Host3" lid 103 4xQDR Host1 Host2 Host3
Page 16
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
17. Monitoring Solutions
stable (but not useful to admins?) unstable (individually carved)
âȘ infiniband-diags âȘ wrapper of infiniband-diags
âȘ ibcheckerrors âȘ INAM (Ohio-State-University)
âȘ ibdiagpath
âȘ QNIB
âȘ plugin to non-IB systems
âȘ .....
âȘ nagios
âȘ collectl
âȘ hardware vendor suites not listed stuff
âȘ Unified Fabric Manager (Mellanox) âȘ ...
âȘ InfiniBand Fabric Suites (QLogic)
Page 17
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
18. Monitoring Solutions
stable (but not useful to admins?) unstable (individually carved)
âȘ infiniband-diags âȘ wrapper of infiniband-diags
âȘ ibcheckerrors âȘ INAM (Ohio-State-University)
âȘ ibdiagpath
âȘ QNIB
âȘ plugin to non-IB systems
âȘ .....
âȘ nagios
âȘ collectl
âȘ hardware vendor suites not listed stuff
âȘ Unified Fabric Manager (Mellanox) âȘ ...
âȘ InfiniBand Fabric Suites (QLogic)
Page 18
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
19. Monitoring Solutions
stable (but not useful to admins?) unstable (individually carved)
âȘ infiniband-diags âȘ wrapper of infiniband-diags
âȘ ibcheckerrors âȘ INAM (Ohio-State-University)
âȘ ibdiagpath
âȘ QNIB
âȘ plugin to non-IB systems
âȘ .....
âȘ nagios
âȘ collectl
âȘ hardware vendor suites not listed stuff
âȘ Unified Fabric Manager (Mellanox) âȘ ...
âȘ InfiniBand Fabric Suites (QLogic)
Page 19
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
20. Monitoring Solutions
stable (but not useful to admins?) unstable (individually carved)
âȘ infiniband-diags âȘ wrapper of infiniband-diags
âȘ ibcheckerrors âȘ INAM (Ohio-State-University)
âȘ ibdiagpath
âȘ QNIB
âȘ plugin to non-IB systems
âȘ .....
âȘ nagios
âȘ collectl
âȘ hardware vendor suites not listed stuff
âȘ Unified Fabric Manager (Mellanox) âȘ ...
âȘ InfiniBand Fabric Suites (QLogic)
Page 20
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
21. Monitoring Solutions
stable (but not useful to admins?) unstable (individually carved)
âȘ infiniband-diags âȘ wrapper of infiniband-diags
âȘ ibcheckerrors âȘ INAM (Ohio-State-University)
âȘ ibdiagpath
âȘ QNIB
âȘ plugin to non-IB systems
âȘ .....
âȘ nagios
âȘ collectl
âȘ hardware vendor suites not listed stuff
âȘ Unified Fabric Manager (Mellanox) âȘ ...
âȘ InfiniBand Fabric Suites (QLogic)
Page 21
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
22. Discussion - Quick Analysis
Fabricsize Type of use
âȘ small -> easy as pie? âȘ willing/forced to share
âȘ big -> crit. mass for Problemkind / -amount
real analysis? âȘ runs smoothly enough
Switch structure Problemsolving
âȘ what is your âȘ learncurve starts step
routing algorithm?
Focus
âȘ 80:20 rule?
performance
maintenance
Page 22
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
23. Discussion - Quick Analysis
Fabric size Type of use
âȘ small -> easy as pie? âȘ willing/forced to share
âȘ big -> crit. mass for Problem type / amount
real analysis? âȘ runs smoothly enough
Switch structure Problem solving
âȘ what is your âȘ learning curve starts steep
routing algorithm?
Focus
âȘ 80:20 rule?
performance
maintenance
Page 23
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
24. Discussion - Quick Analysis
Fabric size Type of use
âȘ small -> easy as pie? âȘ willing/forced to share
âȘ big -> crit. mass for Problem type / amount
real analysis? âȘ runs smoothly enough
Switch structure Problem solving
âȘ what is your âȘ learning curve starts steep
routing algorithm?
Focus
âȘ 80:20 rule?
performance
maintenance
Page 24
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
25. Discussion - Quick Analysis
Fabric size Type of use
âȘ small -> easy as pie? âȘ willing/forced to share
âȘ big -> crit. mass for Problem type / amount
real analysis? âȘ runs smoothly enough
Switch structure Problem solving
âȘ what is your âȘ learning curve starts steep
routing algorithm?
Focus 100
âȘ 80:20 rule? 75
performance 50
maintenance 25
Page 25
0
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
26. Discussion - Quick Analysis
Fabric size Type of use
âȘ small -> easy as pie? âȘ willing/forced to share
âȘ big -> crit. mass for Problem type / amount
real analysis? âȘ runs smoothly enough
Switch structure Problem solving
âȘ what is your âȘ learning curve starts steep
routing algorithm?
Focus
âȘ 80:20 rule?
performance
maintenance
Page 26
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
27. Discussion - Quick Analysis
Fabric size Type of use
âȘ small -> easy as pie? âȘ willing/forced to share
âȘ big -> crit. mass for Problem type / amount
real analysis? âȘ runs smoothly enough
Switch structure Problem solving
âȘ what is your âȘ learning curve starts steep
routing algorithm?
Focus
âȘ 80:20 rule?
performance
maintenance
Page 27
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
28. Discussion - Quick Analysis
Fabric size Type of use
âȘ small -> easy as pie? âȘ willing/forced to share
âȘ big -> crit. mass for Problem type / amount
real analysis? âȘ runs smoothly enough
Switch structure Problem solving
âȘ what is your âȘ learning curve starts steep
routing algorithm?
Focus
âȘ 80:20 rule?
performance
maintenance
Page 28
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
29. Discussion - Conclusions
Monitoring
âȘ what approach?
Do we scare you?
âȘ not intending to spread Fear, Uncertainty and Doubt
Our conclusions
Your conclusions
Page 29
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
30. Discussion - Conclusions
Monitoring
âȘ what approach?
Do we scare you?
âȘ not intending to spread Fear, Uncertainty and Doubt
Our conclusions
Your conclusions
Page 30
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
31. Discussion - Conclusions
Monitoring
âȘ what approach?
Do we scare you?
âȘ not intending to spread Fear, Uncertainty and Doubt
Our conclusions
Your conclusions
Page 31
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
32. Discussion - Conclusions
Monitoring
âȘ what approach?
Do we scare you?
âȘ not intending to spread Fear, Uncertainty and Doubt
Our conclusions
Your conclusions
Page 32
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
33. Discussion - Conclusions
Monitoring
âȘ what approach?
Do we scare you?
âȘ not intending to spread Fear, Uncertainty and Doubt
Our conclusions
Your conclusions
Page 33
BoF InfiniBand | 2012-06-19 © 2012 science + computing ag
34. Thank you for your attention and participation!
science + computing ag
www.science-computing.de
Telefon: +49 (0)7071 9457 - 0
E-Mail: info@science-computing.de