Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Computer Performance
Microscopy with SHIM
Kathryn McKinley
Microsoft Research
1
Steve Blackburn
Australian National Univer...
2
4 μops
Intel i7-4770, 3.4 GHz
0
0.5
1
1.5
2
2.5
3
3.5
4
IPC
Benchmark IPC
3
Lusearch is a DaCapo benchmark based on
the widely used open source search e...
Interrupt Driven Profilers
4
Sampling at default 1 KHz, maximum 100 KHz.
Method IPC
Lusearch
5
top 10 methods (74% total execution time)
IPC
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 3 4 5 6 7 8 9 10
d...
Sampling IPC
6
time
Two counters: C – cycles, R - retired instructions
R0 C0
IPC1 IPC2 IPC3
R1 C1 R2 C2 R3 C3
IPC = (Rt – ...
0
0.5
1
1.5
2
2.5
Sampling Lusearch IPC
7
SHIM 10 MHz
maximum 100 KHz
default 1 KHz
0
0.5
1
1.5
2
2.5
0
0.5
1
1.5
2
2.5
IP...
#define DEFAULT_MAX_SAMPLE_RATE 100000
/*
* perf samples are done in some very critical code paths (NMIs).
* If they take ...
insight
9
Hardware and Software
Generate Signals
10
hardware signals software signals
hardware
performance counters
A (x){
x.y = B()...
Signals
11
hardware signals software signals
hardware software
counters
tags
✓
✓
✓
✓
12
Observe Signals From
Another Hardware Context
SHIM design
13
Observe Global Counters
14
LLC misses per cycle
while (true):
for counter in LLC misses, cycles:
buf[i++] = readCounter(co...
0
4
15
while (true):
for counter in HT2 SHIM, Core, Cycles:
buf[i++] = readCounter(counter);
HT1
HT1 IPC
0
4
Core IPC
0
4
...
Correlate Hardware and Software Signals
16
while (true):
for counter in HT2 SHIM, Core, cycles:
buf[i++] = readCounter(cou...
Fidelity
17
Raw Samples
18
IPC (log scale)
% of
samples
(log scale)
Problem: Samples Are Not Atomic
19
time
Counters: C – cycles, R - retired
instructions
R0 C0
IPC1 IPC2 IPC3
R1 C1 R2 C2 R3...
Solution: Use Clock As Ground Truth
20
time
Cs
0R0C0Ce
0
IPC1 IPC2 IPC3
Cs
1R1C1Ce
1 Cs
2R2C2Ce
2 Cs
3R3C3Ce
3
✗✓ ✓
CPC1 =...
Filter Lusearch Samples
21
---- raw IPC
%ofsamples(logscale)
---- raw CPC
---- filtered IPC
---- filtered CPC in [0.99,1.0...
overheads
22
Software Signal
Other Core
23
0
0.5
1
1.5
2
2.5
3
3.5
4
30 cycles 1213 cycles
observe method and loop IDs.
Normalizedtowit...
Software Signal
Same Core
24
0
0.5
1
1.5
2
2.5
15 cycles 1505 cycles
NormalizedtowithoutSHIM
Overheads are from sharing th...
Hardware and Software Signals
Same Core
25
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
495 cycles
Correlate IPC with method and lo...
Reducing Overheads
• Bursty sampling
• SMT priorities
• Heterogeneous multicore
• Globally visible per-thread performance
...
Conclusion
• High frequency sampling is important
• SHIM observes signals directly, low overhead
• Cycles per cycle filter...
Backup Slides
28
100 KHz (10 μs)
High or low ?
29
10 μs is not bad
30
10 μs is not bad?
31
25 μs!Simple Address Book
*Name: Xi YANG
*Email: xi.yang@anu.edu.au
100 KHz (10 μs) won’t see this
32
The 25 μs life of the
address_book.SerializeToOstream(&output).
Sampling at 5 MHz, 608
c...
Nächste SlideShare
Wird geladen in …5
×

Computer Performance Microscopy with SHIM

482 Aufrufe

Veröffentlicht am

  • Als Erste(r) kommentieren

Computer Performance Microscopy with SHIM

  1. 1. Computer Performance Microscopy with SHIM Kathryn McKinley Microsoft Research 1 Steve Blackburn Australian National University Xi Yang Australian National University
  2. 2. 2 4 μops Intel i7-4770, 3.4 GHz
  3. 3. 0 0.5 1 1.5 2 2.5 3 3.5 4 IPC Benchmark IPC 3 Lusearch is a DaCapo benchmark based on the widely used open source search engine framework Lucene. Plenty of room here!
  4. 4. Interrupt Driven Profilers 4 Sampling at default 1 KHz, maximum 100 KHz.
  5. 5. Method IPC Lusearch 5 top 10 methods (74% total execution time) IPC 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4 5 6 7 8 9 10 default 1 KHz maximum 100 KHz SHIM 10 MHz
  6. 6. Sampling IPC 6 time Two counters: C – cycles, R - retired instructions R0 C0 IPC1 IPC2 IPC3 R1 C1 R2 C2 R3 C3 IPC = (Rt – Rt-1) / (Ct – Ct-1) IPC is a high frequency signal.
  7. 7. 0 0.5 1 1.5 2 2.5 Sampling Lusearch IPC 7 SHIM 10 MHz maximum 100 KHz default 1 KHz 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 IPC IPC IPC
  8. 8. #define DEFAULT_MAX_SAMPLE_RATE 100000 /* * perf samples are done in some very critical code paths (NMIs). * If they take too much CPU time, the system can lock up and not * get any real work done. This will drop the sample rate when profilers SHIM simulators HiFi handy online ✓✗ ✓ ✗ ✗ ✓✓ ✓ ✓ 8
  9. 9. insight 9
  10. 10. Hardware and Software Generate Signals 10 hardware signals software signals hardware performance counters A (x){ x.y = B(); x.z = C(); } A() B() C() time memory locations
  11. 11. Signals 11 hardware signals software signals hardware software counters tags ✓ ✓ ✓ ✓
  12. 12. 12 Observe Signals From Another Hardware Context
  13. 13. SHIM design 13
  14. 14. Observe Global Counters 14 LLC misses per cycle while (true): for counter in LLC misses, cycles: buf[i++] = readCounter(counter)
  15. 15. 0 4 15 while (true): for counter in HT2 SHIM, Core, Cycles: buf[i++] = readCounter(counter); HT1 HT1 IPC 0 4 Core IPC 0 4 HT2 SHIM IPC HT1 IPC = Core IPC – HT2 SHIM IPC HT2 Observe Local Counters
  16. 16. Correlate Hardware and Software Signals 16 while (true): for counter in HT2 SHIM, Core, cycles: buf[i++] = readCounter(counter); tid = thread on HT1 buf[i++] = tid.method; 0 1 2 3 4 HT1 IPC 0 1 2 3 4 Core IPC 0 1 2 3 4 HT2 SHIM IPC 1 2 3 A() B() C() HT1 HT2 HT1 stack
  17. 17. Fidelity 17
  18. 18. Raw Samples 18 IPC (log scale) % of samples (log scale)
  19. 19. Problem: Samples Are Not Atomic 19 time Counters: C – cycles, R - retired instructions R0 C0 IPC1 IPC2 IPC3 R1 C1 R2 C2 R3 C3 IPC = (Rt – Rt-1) / (Ct – Ct-1) ✗✓ ✓
  20. 20. Solution: Use Clock As Ground Truth 20 time Cs 0R0C0Ce 0 IPC1 IPC2 IPC3 Cs 1R1C1Ce 1 Cs 2R2C2Ce 2 Cs 3R3C3Ce 3 ✗✓ ✓ CPC1 = 1.0 +/- 1% CPC2 = 1.0 +/- 1% CPC3 != 1.0 +/- 1% CPC = (Ce t – Ce t-1) / (Cs t – Cs t-1) this should be 1! while (true): buf[i++] = readCycle();// read Cs for counter in HT2 SHIM, Core, cycles: buf[i++] = readCounter(counter); buf[i++] = readCycle();// read Ce tid = thread on HT1 buf[i++] = tid.method;
  21. 21. Filter Lusearch Samples 21 ---- raw IPC %ofsamples(logscale) ---- raw CPC ---- filtered IPC ---- filtered CPC in [0.99,1.01]
  22. 22. overheads 22
  23. 23. Software Signal Other Core 23 0 0.5 1 1.5 2 2.5 3 3.5 4 30 cycles 1213 cycles observe method and loop IDs. NormalizedtowithoutSHIM Overheads are from write invalidate transactions. 3MHz: more than an order of magnitude better than ‘maximum’ 113MHz: more than three orders of magnitude better than ‘maximum’
  24. 24. Software Signal Same Core 24 0 0.5 1 1.5 2 2.5 15 cycles 1505 cycles NormalizedtowithoutSHIM Overheads are from sharing the core resources. observe method and loop IDs.
  25. 25. Hardware and Software Signals Same Core 25 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 495 cycles Correlate IPC with method and loop IDs. NormalizedtowithoutSHIM
  26. 26. Reducing Overheads • Bursty sampling • SMT priorities • Heterogeneous multicore • Globally visible per-thread performance counters 26
  27. 27. Conclusion • High frequency sampling is important • SHIM observes signals directly, low overhead • Cycles per cycle filters samples • Opportunities for hardware analysis • Opportunities for hardware design 27 Questions? https://github.com/ShimProfiler/SHIM
  28. 28. Backup Slides 28
  29. 29. 100 KHz (10 μs) High or low ? 29
  30. 30. 10 μs is not bad 30
  31. 31. 10 μs is not bad? 31 25 μs!Simple Address Book *Name: Xi YANG *Email: xi.yang@anu.edu.au
  32. 32. 100 KHz (10 μs) won’t see this 32 The 25 μs life of the address_book.SerializeToOstream(&output). Sampling at 5 MHz, 608 cycles

×