To hit Ruby3x3, we must first figure out **what** we're going to measure, **how** we're going to measure it, in order to get what we actually want. I'll cover some standard definitions of benchmarking in dynamic languages, as well as the tradeoffs that must be made when benchmarking. I'll look at some of the possible benchmarks that could be considered for Ruby 3x3, and evaluate them for what they're good for measuring, and what they're less good for measuring, in order to help the Ruby community decide what the 3x goal is going to be measured against.
8. Definition
Benchmark:
Comparing the execution time of different
interpreters, or options.
Comparing the execution time of algorithms
Comparing the accuracy of different machine
learning algorithms
11. Microbenchmarks
Pros
Often easy to setup and run.
Targeted to a particular
aspect.
Fast acquisition of data.
Cons
Exaggerates effects.
Not typically generalizable.
A very small program written to explore the performance of one
aspect of the system under test.
12. Full Applications
Pros
Immediate and obvious real
world impact!
Cons
Small effects can be
swamped in natural
application variance.
Can be complicated to
setup, or slow to run!
Benchmarking a whole application
13. Application Kernel
Pros
Tight connection to real
world code.
Typically more
generalizable.
Cons
Difficult to know how much
of a an application should
be included vs. mocked.
A particular part of an application extracted for the express purpose
of constructing a benchmark.
14. Pitfalls in benchmark design
Un-Ruby-Like Code:
Code that looks like another language.
“You can write FORTRAN in any language”
Code that never produces garbage.
Code without exceptions
15. Pitfalls in benchmark design
Input Data is a key part of many benchmarks: Watch out
for weird input data!
Imagine an MP3 compressor benchmark
– Inputs are
1. Silence. weird because most mp3s are not silence.
2. White noise. weird because most mp3s have some
structure.
– Reduces the generalizability of the results!
16. The Art of Benchmarking: What do you run?
What do you measure?
26. An aside on misleading with speedup.
Speedup:
A ratio computed between a
baseline and experimental time
measurement.
27. An aside on misleading with speedup.
Speedup:
A ratio computed between a
baseline and experimental time
measurement.
28. An aside on misleading with speedup.
“He who controls the baseline
controls the speedup”
29. An aside on misleading with speedup.
“Our parallelization system shows
linear speedup as the number of
threads increases”
30. An aside on misleading with speedup.
0
1
2
3
4
5
6
7
8
9
1 thread 2 thread 4 thread 8 thread
SPEEDUP
Speedup
31. An aside on misleading with speedup.
Measurement Time (s)
Original Sequential Program 10.0
Parallelized 1 thread 100.0
Parallelized 2 thread 50.0
Parallelized 4 thread 25.0
Parallelized 8 thread 12.5
The distinction between relative speedup
and absolute speedup.
33. Both of these are valid benchmarks!
$ cat test.rb
...
puts Benchmark.measure {
1_000_000.times {
compute_foo()
}
}
$ for i in `seq 1 10`; do
ruby t.rb ; done;
...
10.times {
puts Benchmark.measure {
1_000_000.times {
compute_foo()
}
}
}
vs.
But they’re going to measure (and may encourage
the optimization of ) two different things!
34. Definition
Warmup:
The time from application start
until it hits peak performance.
100
64 69
36
25 30 25 26 25 26 25
1 2 3 4 5 6 7 8 9 10 11
Time per Iteration (s)
35. When has warmup finished?
Despite this, even knowing warmup exists is important: It
allows us to choose methodologies that can accommodate the
possibility!
36. Definition
Run-to-Run Variance
The observed effect that
identical runs do not have
identical times.
$ for i in `seq 1 5`; do ruby -I../../lib/ string-equal.rb
--loopn 1 1000; done;
1.347334558
1.348350632
1.30690478
1.314764977
1.323862345
37. Methodology:
An incomplete list of decisions that need to be made when
developing benchmarking methodology:
1. Does your methodology account for warmup?
2. How are you accounting for run-to-run variance?
3. How are you accounting for the effects of the garbage
collector?
38. Pitfalls in benchmark design
Accounting for warmup often means producing
intermediate scores, so you can see when they stabilize.
If you aren’t accounting for warmup, you may find
that you miss out on peak performance.
39. Pitfalls in benchmark design
Account for run to run variance by running multiple times,
and presenting confidence intervals!
Be sure you’re methodology doesn’t encourage wild
variations in performance though!
42. Garbage Collector Impact
Garbage collector impact can make benchmarks incredibly difficult to
compare:
The Ruby+OMR Preview uses the OMR GC technology, including a
change to move off heap data on heap.
Side effect of this is that it’s crazy difficult to compare against the default
ruby: there’s an entirely different set of data on the heap!
If heap size adapts to machine memory, you’ll need to figure out how to
lock it to give good comparisons across machines
42
string malloc string OMRBuffer
45. User Error
$ time ruby their_implementation.rb 100000
real 0m10.003s
user 0m08.001s
sys 0m02.007s
$ time ruby my_implementation.rb 10000
real 0m1.003s
user 0m0.801s
sys 0m0.206s
10x speedup!
46. User Error
$ time ruby their_implementation.rb 100000
real 0m10.003s
user 0m08.001s
sys 0m02.007s
$ time ruby my_implementation.rb 10000
real 0m1.003s
user 0m0.801s
sys 0m0.206s
10x speedup!
Pro Tip: Use a harness that
keeps you out of the
benchmarking process.
Aim for reproducibility!
48. Other Hardware Effects to watch for!
TurboBoost (and similar): Frequency scaling based
on the season.
49. Other Hardware Effects to watch for!
TurboBoost (and similar): Frequency scaling based
on the season location.
50. Other Hardware Effects to watch for!
TurboBoost (and similar): Frequency scaling based
on the season location rack
51. Other Hardware Effects to watch for!
TurboBoost (and similar): Frequency scaling based
on the season location rack CPU temperature.
Even in the cloud! [1]
[1]: http://www.brendangregg.com/blog/2014-09-15/the-msrs-
of-ec2.html
52. Software Pitfalls
What about your backup service?
Long sequence of benchmarks… do you have
automatic software updates installed?
Do your system administrators know you are
benchmarking?
54. Paranoia is a matter of
Effect Sizes
Hardware Changes:
– Disable turbo boost,
– Disable hyperthreading.
Krun tool:
– Set ulimit for heap and stack.
– Reboot machine before execution
– Monitor dmesg for unexpected output
– Monitor temperature of machine.
– Disable pstates
– CPU Governor set to performance mode.
– Perf sample rate control.
– Disable ASLR.
– Create a new user account for each run
http://arxiv.org/pdf/1602.00602v1.pdf
59. Squeezing a Water Balloon
Be sure to measure associated metrics to have a
clear headed view of tradeoffs:
For example: JIT Compilation:
Trade startup speed for peak speed.
Trade footprint for speed.
60. Benchmarks age!
Benchmarks can be wrung of all their possible
performance at some point.
Using the same benchmarks for too long can lead to
shortsighted decisions driven by old benchmarks.
Idiomatic code evolves in a language.
Benchmark use of language features can help drive
adoption!
–Be sure to benchmark desirable new language features!
60
64. Recall: Benchmarks drive change
Thought: Choose 9 application kernels that
represent what we want from a future CRuby!
• Why 9?
• Too many benchmarks can diffuse effort.
• Also! 3x3 = 9!
¯_(ツ)_/¯
65. Brainstorming on the nine?
1. Some CPU intensive applications:
• OptCarrot, Neural Nets, Monte Carlo Tree
Search, PSD filter pipeline?
2. Some memory intensive application:
• Large tree mutation benchmark?
3. A startup benchmark:
• time ruby -e “def foo; ‘100’; end; puts foo”?
4. Some web application framework benchmarks.
66. Choose a methodology that drives the change we
want in CRuby.
Want great performance, but not huge warmup
times?
–Only run 5 iterations, and score the last one?
Don’t want to deal with warmup?
–Don’t run iterations: Score the first run!
I Error Bars
69. Use the ecosystem!
Add a standard performance harness to RubyGems.
Would allow VM developers to sample popular gems, and
run a perf suite written by gem authors.
With effort, time and $$$, we could make broad statements
about performance impact on the gem ecosystem.
70. Use the ecosystem!
Doesn’t just help VM developers
Gem authors get
1. Enabled for performance tracking!
2. Easier performance reporting with VM developers.
OMR is a project trying to create reusable components for building or augmenting language runtimes.
Should be some news soon, so follow us on twitter.
Please, come talk to me about OMR! But, I’m not here to talk about OMR right now.
That purple circle hides a big concept! Let’s dig into it.
Benchmarking is this weird combination of art and science, that drives me mad. The problem is that benchmarks seem so objective and scientific, but are filled with judgement calls, and the science is hard!
The art of benchmarking ends up being a long list of questions and decisions you have to ask yourself, filled with judgement calls.
First off, what do you run?
Sometimes this involves mocking up parts of the normal application flow in such a way to keep the code isolated.
Imagine how this perturbs the code paths that your interpreter is going to take.
The art of benchmarking ends up being a long list of questions and decisions you have to ask yourself, filled with judgement calls.
First off, what do you run?
Lots of questions have to be asked when you are benchmarking.
This is equally true of both application developers and those who are developing language runtimes!
Often when we’
CPU time can be pretty misleading in a lot of circumstances: Notice that sleep used almost no CPU time, because it didn’t do anything! But it spent a long time running!
Can be important though if you’re on a platform that charges by CPU usage!
L
For example, in a web server, latency would be how long it takes a request to be processed after the request is received.
The art of benchmarking ends up being a long list of questions and decisions you have to ask yourself, filled with judgement calls.
First off, what do you run?
Lots of questions have to be asked when you are benchmarking.
This is equally true of both application developers and those who are developing language runtimes!
Typically, speedup is talking about a measurement on the same machine with a software change of some kind, though one can also compute speedups by changing hardware.
Typically, speedup is talking about a measurement on the same machine with a software change of some kind.
I used to be an academic, and I learned while I was there that it’s terribly easy to lie with speedup.
Typically, speedup is talking about a measurement on the same machine with a software change of some kind, though one can also compute speedups by changing hardware.
To abuse a quote from Dune,
To abuse a quote from Dune,
To abuse a quote from Dune,
You’ll note even at 8 threads, the parallel program is slower than the original.
Relative: Relative to 1 thread
Absolute: Relative to the fastest sequential version!
This point isn’t obvious to everyone.
The first will try to encourage faster startup – if compute foo runs quickly, startup costs will dominate the run on the left side.
Warmup can occur as code loading is happening, caches are warmed up, JIT compilation occurs, etc.
Warmup is a really awkward term, because while many people understands what you mean, but it’s not got a great scientific definition.
Warmup can occur as code loading is happening, caches are warmed up, JIT compilation occurs, operating system thread scheduling
Reporting the minimum time for example.
When trying to measure performance be aware that benchmarks can act weird! You’ll have to report with a methodology that can handle it!
3x degradation of performance by having too small a heap.
Imagine you do your benchmark baseline on your couch at home, but then you get to work and find your change has made everything 3x faster!
You benchmark 10 rubies…
But we would like to be able to measure small changes….
Faster code can come at the cost of increased warmup time, increased footrprint etc.
2. Just because you’re the fastest C89 compiler today doesn’t matter if people are writing C11 code that looks different!
-
At this point, we go to the wise tenderlove, who reminds us!
Please… whatever you do though, account for some variance.