Active Benchmarking: Casual Benchmarking: You Benchmark A, But Actually Measure B, and Conclude You've Measured C
Active Benchmarking: Casual Benchmarking: You Benchmark A, But Actually Measure B, and Conclude You've Measured C
Active Benchmarking: Casual Benchmarking: You Benchmark A, But Actually Measure B, and Conclude You've Measured C
1. If possible, configure the benchmark to run for a long duration in a steady state: eg, hours.
2. While the benchmark is running, analyze the performance of all components involved using other
tools, to identify the true limiter of the benchmark.
The process of active benchmarking is similar to the performance analysis of any application. One
difference, which can make this process easier, is that you have a known workload to begin analyzing:
the benchmark applied.
Passive Benchmarking
Benchmarks are commonly executed and then ignored until they have completed. That is passive
benchmarking, where the main objective is the collection of benchmark data. Data is not Information.
A telltale sign is when the only technical results presented are the benchmark results. I've seen
countless slide decks, blog posts, and articles that present an impressive bar chart of comparitive
results, but then no supporting technical evidence. It's been my job to get to the bottom of many of
these, and I typically find that they are wrong or misleading almost every time. The primary reason is
that they have been run passively, "fire and forget" style, with no additional analysis, and all problems
were overlooked.
Active Benchmarking
With active benchmarking, you analyze performance while the benchmark is still running (not just after
it's done), using other tools. You can confirm that the benchmark tests what you intend it to, and that
you understand what that is. Data becomes Information. This can also identify the true limiters of the
system under test, or of the benchmark itself.
To perform active benchmarking, you may use any performance analysis tool that your OS provides:
vmstat, iostat, mpstat, sar, top, tcpdump/snoop, perf, bcc+eBPF/DTrace/SystemTap, strace/truss, etc.
You can also follow a performance analysis methodology to guide your usage of these tools. The USE
Method is especially suited for this, since it identifies typical limiters: hardware and software
resources.
Did they run other tools while the benchmark was running? Can they provide screenshots?
Can they explain why the benchmark result was X, and not 2X (twice as fast)? ie, what is the
limiting factor?
Ideally, include the limiting factor (or suspected limiting factor) along with the benchmark results. For
example: "the file system result was limited by the CPU speed of the server, and the benchmark being
single-threaded". For evidence, this statement could include a screenshot showing that the benchmark
was single-threaded and CPU-bound: for example, on Linux, using "pidstat -t 1"; on Solaris, using
"prstat -mLc 1".
Apart from analysis while the benchmark runs, you should also analyze its configuration beforehand.
Ideally, the benchmark is open source, allowing you to study the source code, as well as any Makefiles
and compiler options.
Examples
The following are worked examples of active benchmarking, showing the tools used for analysis:
Problem Checklist
Common pitfalls that can be identified using active benchmarking, are when the benchmark is:
The most common case is where a benchmark is not really testing what it claims to test, which can be
identified using active benchmarking. Sometimes the results are still useful, now that they can be
interpreted correctly.
Statistical Analysis
This is the statistical analysis of numerical benchmark results after the benchmark has completed. This
is often considered a useful exercise to develop new information from raw benchmark data, for better
understanding results, and for developing confidence. However, if the benchmark results were wrong
or misleading to begin with, statistical analysis can make matters worse. A sound statistical method can
make benchmark results seem trustworthy, when in fact, they are false. New information developed
may also be false, compounding the problem.
The only good outcome, given bad results, is that statistical analysis deems them untrustworthy (eg, too
high CoV), and analysis moves to understanding what went wrong with the actual benchmark. In
practice, this doesn't happen as much as I'd like. Often, the wrong target has been benchmarked, but the
results are statistically sound.
Statistical analysis is useful after active benchmarking – when you have valid numbers to work with.
iostat first, R later.
Updates
I gave a lightning talk at Surge 2013 titled Benchmarking Gone Wrong, which provides a
memorable anecdote for active benchmarking.