描述
Describe the bug DoNotOptimize seems to have unpredictable behavior on ternary conditionals.
I'm trying to benchmark the performance of manual flushing to zero on denormals and I define a macro like this:
#include <cmath>
#include <cfloat>
#define FLUSHF(x) ((x) = fabsf(x)<FLT_MIN ? 0 : (x))
which basically flushes any denormal floats (i.e. below FLT_MIN) to 0 to prevent performance degradation caused by denormal numbers on x86.
This macro is used in the following benchmark:
static void
flush_32(benchmark::State& state)
{
float mem = FLT_MIN;
for (auto _ : state) {
benchmark::DoNotOptimize(mem = 0.999f * mem);
benchmark::DoNotOptimize(FLUSHF(mem));
}
}
BENCHMARK(flush_32);
When compiling using g++ with -O3 optimization, FLUSHF doesn't seem to be working correctly and the flushing does not happen, which results in slower execution.
To see this in action, try compile the following with -O2, -O3, and clang++ and see the performance difference (you might need an x86 machine)
#include <benchmark/benchmark.h>
#include <cmath>
#include <cfloat>
#define FLUSHF(x) ((x) = fabsf(x)<FLT_MIN ? 0 : (x))
static void
flush_32(benchmark::State& state)
{
float mem = FLT_MIN;
for (auto _ : state) {
benchmark::DoNotOptimize(mem = 0.999f * mem);
benchmark::DoNotOptimize(FLUSHF(mem));
}
}
BENCHMARK(flush_32);
BENCHMARK_MAIN();
$ g++ -O3 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
flush_32 36.9 ns 36.9 ns 18957274
$ g++ -O2 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
flush_32 6.06 ns 6.05 ns 101055135
$ clang++ -O3 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
flush_32 5.30 ns 5.30 ns 102412118
$ clang++ -O2 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
flush_32 5.23 ns 5.23 ns 101888597
Because FLUSHF is partially optimized away for gcc with -O3, it runs much slower.
Strangely, the flushing works fine with double. I also tried the original escape function from the video mentioned in the comment.
template <class Tp>
inline void escape(Tp& value)
{
asm volatile("" : : "g"(value) : "memory");
}
and it works correctly. Try compile the following to see the problem:
#include <benchmark/benchmark.h>
#include <cmath>
#include <cfloat>
#define FLUSH(x) ((x) = fabs(x)<DBL_MIN ? 0 : (x))
#define FLUSHF(x) ((x) = fabsf(x)<FLT_MIN ? 0 : (x))
template <class Tp>
inline void escape(Tp& value)
{
asm volatile("" : : "g"(value) : "memory");
}
static void
flush_64(benchmark::State& state)
{
double mem = FLT_MIN;
for (auto _ : state) {
benchmark::DoNotOptimize(mem = 0.999 * mem);
benchmark::DoNotOptimize(FLUSH(mem));
}
}
BENCHMARK(flush_64);
static void
flush_32(benchmark::State& state)
{
float mem = FLT_MIN;
for (auto _ : state) {
benchmark::DoNotOptimize(mem = 0.999f * mem);
benchmark::DoNotOptimize(FLUSHF(mem));
}
}
BENCHMARK(flush_32);
static void
escape_32(benchmark::State& state)
{
float mem = FLT_MIN;
for (auto _ : state) {
benchmark::DoNotOptimize(mem = 0.999f * mem);
escape(FLUSHF(mem));
}
}
BENCHMARK(escape_32);
BENCHMARK_MAIN();
$ g++ -O3 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
flush_64 0.606 ns 0.605 ns 995985269
flush_32 37.5 ns 37.5 ns 18310441
escape_32 0.773 ns 0.772 ns 901474201
$ g++ -O2 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
flush_64 5.94 ns 5.94 ns 101705109
flush_32 6.12 ns 6.11 ns 115467957
escape_32 4.65 ns 4.64 ns 152872058
$ clang++ -O3 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
flush_64 5.24 ns 5.24 ns 99614101
flush_32 5.34 ns 5.34 ns 130798955
escape_32 5.35 ns 5.35 ns 129643413
$ clang++ -O2 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
flush_64 5.33 ns 5.24 ns 98349666
flush_32 5.38 ns 5.27 ns 132894039
escape_32 5.44 ns 5.44 ns 129066762
The anomaly of flush_32 with gcc -O3 is apparently a problem.
Is there a way to fix this problem without manually introducing another non-portable function to escape the optimization?
System Which OS, compiler, and compiler version are you using:
- OS: Linux 5.12.12
- Compiler and version: g++ 11.1.0, clang++ 12.0.0
Expected behavior DoNotOptimize works predictably on ternary conditionals.