Computer Performance
Microscopy with SHIM
Kathryn McKinley
Microsoft Research
1
Steve Blackburn
Australian National Univer...
2
4 μops
Intel i7-4770, 3.4 GHz
0
0.5
1
1.5
2
2.5
3
3.5
4
IPC
Benchmark IPC
3
Lusearch is a DaCapo benchmark based on
the widely used open source search e...
Interrupt Driven Profilers
4
Sampling at default 1 KHz, maximum 100 KHz.
Method IPC
Lusearch
5
top 10 methods (74% total execution time)
IPC
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 3 4 5 6 7 8 9 10
d...
Sampling IPC
6
time
Two counters: C – cycles, R - retired instructions
R0 C0
IPC1 IPC2 IPC3
R1 C1 R2 C2 R3 C3
IPC = (Rt – ...
0
0.5
1
1.5
2
2.5
Sampling Lusearch IPC
7
SHIM 10 MHz
maximum 100 KHz
default 1 KHz
0
0.5
1
1.5
2
2.5
0
0.5
1
1.5
2
2.5
IP...
#define DEFAULT_MAX_SAMPLE_RATE 100000
/*
* perf samples are done in some very critical code paths (NMIs).
* If they take ...
insight
9
Hardware and Software
Generate Signals
10
hardware signals software signals
hardware
performance counters
A (x){
x.y = B()...
Signals
11
hardware signals software signals
hardware software
counters
tags
✓
✓
✓
✓
12
Observe Signals From
Another Hardware Context
SHIM design
13
Observe Global Counters
14
LLC misses per cycle
while (true):
for counter in LLC misses, cycles:
buf[i++] = readCounter(co...
0
4
15
while (true):
for counter in HT2 SHIM, Core, Cycles:
buf[i++] = readCounter(counter);
HT1
HT1 IPC
0
4
Core IPC
0
4
...
Correlate Hardware and Software Signals
16
while (true):
for counter in HT2 SHIM, Core, cycles:
buf[i++] = readCounter(cou...
Fidelity
17
Raw Samples
18
IPC (log scale)
% of
samples
(log scale)
Problem: Samples Are Not Atomic
19
time
Counters: C – cycles, R - retired
instructions
R0 C0
IPC1 IPC2 IPC3
R1 C1 R2 C2 R3...
Solution: Use Clock As Ground Truth
20
time
Cs
0R0C0Ce
0
IPC1 IPC2 IPC3
Cs
1R1C1Ce
1 Cs
2R2C2Ce
2 Cs
3R3C3Ce
3
✗✓ ✓
CPC1 =...
Filter Lusearch Samples
21
---- raw IPC
%ofsamples(logscale)
---- raw CPC
---- filtered IPC
---- filtered CPC in [0.99,1.0...
overheads
22
Software Signal
Other Core
23
0
0.5
1
1.5
2
2.5
3
3.5
4
30 cycles 1213 cycles
observe method and loop IDs.
Normalizedtowit...
Software Signal
Same Core
24
0
0.5
1
1.5
2
2.5
15 cycles 1505 cycles
NormalizedtowithoutSHIM
Overheads are from sharing th...
Hardware and Software Signals
Same Core
25
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
495 cycles
Correlate IPC with method and lo...
Reducing Overheads
• Bursty sampling
• SMT priorities
• Heterogeneous multicore
• Globally visible per-thread performance
...
Conclusion
• High frequency sampling is important
• SHIM observes signals directly, low overhead
• Cycles per cycle filter...
Backup Slides
28
100 KHz (10 μs)
High or low ?
29
10 μs is not bad
30
10 μs is not bad?
31
25 μs!Simple Address Book
*Name: Xi YANG
*Email: xi.yang@anu.edu.au
100 KHz (10 μs) won’t see this
32
The 25 μs life of the
address_book.SerializeToOstream(&output).
Sampling at 5 MHz, 608
c...
Nächste SlideShare
Wird geladen in …5
×

Computer Performance Microscopy with SHIM

464 Aufrufe

Veröffentlicht am

0 Kommentare
1 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
464
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
5
Aktionen
Geteilt
0
Downloads
5
Kommentare
0
Gefällt mir
1
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie
  • I will introduce SHIM, a high freq profiler
  • many of you have this micro-architecture CPU in your laptop
  • need strong reasons for lusearch
    similar to Bing and Google.
  • intrinsic limitations of interrupt driven profilers.
  • if we increase the frequency 100x more, then we see very interesting pictures.


    20 for legends, keys, 28 font size for words, title 36
  • TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    Transaction, let’s see how we build this tool,
  • let’s start with the insights of SHIM
  • software signal: explicit signal and implicit signal
  • speak examples for the matrix
  • transition: HC could be a core or a Hyper Thread
  • have to explain it why HT2 IPC is stable

    talk about the size of profiling loop
  • software signa
  • Now we have shown the design of SHIM, we need one more thing, how can we trust those numbers.
  • Existing profilers share a same problem, low sampling rate.
    Low sampling rate -> 1) can’t observe fine granularity events, 2) can’t
  • after filitering with CPC metric, we can trust those samples, and they are in the valid range

    Thant is the completed design of our tool, we can check it out from github
  • We are going to show a few simple examples and overheads
  • change fonts

    method and loop IDs are very high frequency signals
  • SMT priority isn’t har
  • put url here
  • Existing profilers share a same problem, low sampling rate.
    Low sampling rate -> 1) can’t observe fine granularity events, 2) can’t
  • Existing profilers share a same problem, low sampling rate.
    Low sampling rate -> 1) can’t observe fine granularity events, 2) can’t
  • Existing profilers share a same problem, low sampling rate.
    Low sampling rate -> 1) can’t observe fine granularity events, 2) can’t
  • Computer Performance Microscopy with SHIM

    1. 1. Computer Performance Microscopy with SHIM Kathryn McKinley Microsoft Research 1 Steve Blackburn Australian National University Xi Yang Australian National University
    2. 2. 2 4 μops Intel i7-4770, 3.4 GHz
    3. 3. 0 0.5 1 1.5 2 2.5 3 3.5 4 IPC Benchmark IPC 3 Lusearch is a DaCapo benchmark based on the widely used open source search engine framework Lucene. Plenty of room here!
    4. 4. Interrupt Driven Profilers 4 Sampling at default 1 KHz, maximum 100 KHz.
    5. 5. Method IPC Lusearch 5 top 10 methods (74% total execution time) IPC 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4 5 6 7 8 9 10 default 1 KHz maximum 100 KHz SHIM 10 MHz
    6. 6. Sampling IPC 6 time Two counters: C – cycles, R - retired instructions R0 C0 IPC1 IPC2 IPC3 R1 C1 R2 C2 R3 C3 IPC = (Rt – Rt-1) / (Ct – Ct-1) IPC is a high frequency signal.
    7. 7. 0 0.5 1 1.5 2 2.5 Sampling Lusearch IPC 7 SHIM 10 MHz maximum 100 KHz default 1 KHz 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 IPC IPC IPC
    8. 8. #define DEFAULT_MAX_SAMPLE_RATE 100000 /* * perf samples are done in some very critical code paths (NMIs). * If they take too much CPU time, the system can lock up and not * get any real work done. This will drop the sample rate when profilers SHIM simulators HiFi handy online ✓✗ ✓ ✗ ✗ ✓✓ ✓ ✓ 8
    9. 9. insight 9
    10. 10. Hardware and Software Generate Signals 10 hardware signals software signals hardware performance counters A (x){ x.y = B(); x.z = C(); } A() B() C() time memory locations
    11. 11. Signals 11 hardware signals software signals hardware software counters tags ✓ ✓ ✓ ✓
    12. 12. 12 Observe Signals From Another Hardware Context
    13. 13. SHIM design 13
    14. 14. Observe Global Counters 14 LLC misses per cycle while (true): for counter in LLC misses, cycles: buf[i++] = readCounter(counter)
    15. 15. 0 4 15 while (true): for counter in HT2 SHIM, Core, Cycles: buf[i++] = readCounter(counter); HT1 HT1 IPC 0 4 Core IPC 0 4 HT2 SHIM IPC HT1 IPC = Core IPC – HT2 SHIM IPC HT2 Observe Local Counters
    16. 16. Correlate Hardware and Software Signals 16 while (true): for counter in HT2 SHIM, Core, cycles: buf[i++] = readCounter(counter); tid = thread on HT1 buf[i++] = tid.method; 0 1 2 3 4 HT1 IPC 0 1 2 3 4 Core IPC 0 1 2 3 4 HT2 SHIM IPC 1 2 3 A() B() C() HT1 HT2 HT1 stack
    17. 17. Fidelity 17
    18. 18. Raw Samples 18 IPC (log scale) % of samples (log scale)
    19. 19. Problem: Samples Are Not Atomic 19 time Counters: C – cycles, R - retired instructions R0 C0 IPC1 IPC2 IPC3 R1 C1 R2 C2 R3 C3 IPC = (Rt – Rt-1) / (Ct – Ct-1) ✗✓ ✓
    20. 20. Solution: Use Clock As Ground Truth 20 time Cs 0R0C0Ce 0 IPC1 IPC2 IPC3 Cs 1R1C1Ce 1 Cs 2R2C2Ce 2 Cs 3R3C3Ce 3 ✗✓ ✓ CPC1 = 1.0 +/- 1% CPC2 = 1.0 +/- 1% CPC3 != 1.0 +/- 1% CPC = (Ce t – Ce t-1) / (Cs t – Cs t-1) this should be 1! while (true): buf[i++] = readCycle();// read Cs for counter in HT2 SHIM, Core, cycles: buf[i++] = readCounter(counter); buf[i++] = readCycle();// read Ce tid = thread on HT1 buf[i++] = tid.method;
    21. 21. Filter Lusearch Samples 21 ---- raw IPC %ofsamples(logscale) ---- raw CPC ---- filtered IPC ---- filtered CPC in [0.99,1.01]
    22. 22. overheads 22
    23. 23. Software Signal Other Core 23 0 0.5 1 1.5 2 2.5 3 3.5 4 30 cycles 1213 cycles observe method and loop IDs. NormalizedtowithoutSHIM Overheads are from write invalidate transactions. 3MHz: more than an order of magnitude better than ‘maximum’ 113MHz: more than three orders of magnitude better than ‘maximum’
    24. 24. Software Signal Same Core 24 0 0.5 1 1.5 2 2.5 15 cycles 1505 cycles NormalizedtowithoutSHIM Overheads are from sharing the core resources. observe method and loop IDs.
    25. 25. Hardware and Software Signals Same Core 25 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 495 cycles Correlate IPC with method and loop IDs. NormalizedtowithoutSHIM
    26. 26. Reducing Overheads • Bursty sampling • SMT priorities • Heterogeneous multicore • Globally visible per-thread performance counters 26
    27. 27. Conclusion • High frequency sampling is important • SHIM observes signals directly, low overhead • Cycles per cycle filters samples • Opportunities for hardware analysis • Opportunities for hardware design 27 Questions? https://github.com/ShimProfiler/SHIM
    28. 28. Backup Slides 28
    29. 29. 100 KHz (10 μs) High or low ? 29
    30. 30. 10 μs is not bad 30
    31. 31. 10 μs is not bad? 31 25 μs!Simple Address Book *Name: Xi YANG *Email: xi.yang@anu.edu.au
    32. 32. 100 KHz (10 μs) won’t see this 32 The 25 μs life of the address_book.SerializeToOstream(&output). Sampling at 5 MHz, 608 cycles

    ×