Skip to content

Conversation

@mosra
Copy link
Owner

@mosra mosra commented Apr 21, 2020

Original article: https://pharr.org/matt/blog/2019/11/03/difference-of-floats.html

While this makes 32-bit float cross product precision basically equivalent to a 64-bit calculation casted back to 32-bit, it stays with the speed halfway between the straightforward 32- and 64-bit implementation. Benchmark on Release:

Starting Magnum::Math::Test::VectorBenchmark with 9 test cases...
 BENCH [2]   0.98 ± 0.05   ns cross2Baseline<Float>()@24999x100000 (wall time)
 BENCH [3]   3.44 ± 0.11   ns cross2Baseline<Double>()@24999x100000 (wall time)
 BENCH [4]   1.97 ± 0.08   ns cross2()@24999x100000 (wall time)
 BENCH [5]   2.22 ± 0.11   ns cross3Baseline<Float>()@24999x100000 (wall time)
 BENCH [6]   4.69 ± 0.22   ns cross3Baseline<Double>()@24999x100000 (wall time)
 BENCH [7]   3.32 ± 0.15   ns cross3()@24999x100000 (wall time)
Finished Magnum::Math::Test::VectorBenchmark with 0 errors out of 450000 checks.

However this happens only on platforms that actually have a FMA instruction. For example on Emscripten the code is ten times slower than the baseline implementation, which is not an acceptable tradeoff -- there simply using doubles to calculate the result is faster. And enabling the more precise variant only on some platforms doesn't seem like a good idea for portability. For the record, benchmark output on Chrome (node.js in the terminal gives similar results):

Starting Magnum::Math::Test::VectorBenchmark with 7 test cases...
 BENCH [2]   2.53 ± 0.34   ns cross2Baseline<Float>()@499x100000 (wall time)
 BENCH [3]   5.18 ± 1.30   ns cross2Baseline<Double>()@499x100000 (wall time)
 BENCH [4]   6.22 ± 0.46   ns cross2()@499x100000 (wall time)
 BENCH [5]   2.73 ± 0.35   ns cross3Baseline<Float>()@499x100000 (wall time)
 BENCH [6]   5.94 ± 0.61   ns cross3Baseline<Double>()@499x100000 (wall time)
 BENCH [7]  28.77 ± 2.40   ns cross3()@499x100000 (wall time)
Finished Magnum::Math::Test::VectorBenchmark with 0 errors out of 7000 checks.

Stashing this aside until I'm clearer what to do with this. Things to keep an eye on:

mosra added 4 commits April 21, 2020 22:02
Have to do some precision improvements, so a baseline is needed. The
debug perf is beyond awful, actually.
And the Vector3 version 5% slower in Release, on GCC at least. FFS,
what was I thinking with the gather() things. Nice in user code,
extremely bad in library code.
While this makes 32-bit float cross product precision basically
equivalent to a 64-bit calculation casted back to 32-bit, it stays with
the speed halfway between the straightforward 32- and 64-bit
implementation.

However only on platforms that actually have a FMA instruction. For
example on Emscripten the code is TEN TIMES slower than the baseline
implementation, which is not an acceptable tradeoff -- there simply
using doubles to calculate the result is faster. And enabling the more
precise variant only on some platforms doesn't seem like a good idea for
portability.

Stashing this aside until I'm clearer what to do with this.
@mosra mosra mentioned this pull request May 9, 2020
87 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant