For my new attempt at making a really fast exact Boolean, based on the recent Ember paper, I need very fast 256-bit fixed point signed integer addition, subtraction, multiplication, dot products of 4d vectors, and determinants of 4d matrices, preferably with specializations when the operands known smaller bit lengths (32, 64, 96). I want an implementation that doesn’t require dynamic allocation.
I have several questions:
- Any opinion on how to put this into blenlib – should I try to fit it harmoniously into the current blenlib C++ MathBase family or just do something totally different? Should I even put it in Blenlib or just keep it private to my own boolean implementation files?
- An opinion on whether to do this from scratch or try to import/use/adapt some existing open source approach?
- Does anyone have a hankering for implementing/benchmarking such a library for me or helping me? I am planning on doing this myself but it is a bit of a detour on the way towards the main show.
I’ve looked a little at what might be available as options for #2.
Boost
Blender has Boost 1.8 as a dependency (though some Blender developers don’t like depending on Boost). Boost 1.8 has multiprecision. There are these types defined in the namespace boost::multiprecision:
- int128_t, int256_t, uint128_t, uint256_t
It is a little unclear what implementation method is used for these (the documentation talks about the possibility of a number of backends, including GMP).
This may be a good choice to start out with, since then I don’t have to code anything myself. However there is this Boost issue, which suggests the performance may not be the best. It points to this post by Arthur O’Dwyer.
Abseil
The post by Arthur just cited mentions that the Abseil implementation of uint128_t seems high quality. Blender does not now have a dependency on Abseil, but could be added if really needed. Else, perhaps it could be used for inspiration. Unfortunately there is no 256 bit type though.
GMP
Blender already has this as a dependency, so it is worth a look. The mpn_ low level functions could avoid the use of memory allocation. However, stuff on the web seems to indicate that GMP is slow for fixed-width multiprecision. (However that may be about the higher level stuff, which does allocation, rather than something built on the low level functions.)
CTBignum
The 2022 Nick Bouman paper Multiprecision Arithmetic for Cryptology in C++ states that his code beats the hand-optimized assembly-aided code of GMP by using C++ argument passing that lets the optimizer know how big the arrays are, among other things. The current code is for C++20, which is not yet required by Blender, but the archived earlier versions work with C++17.
Build our own
The Ember paper refers to building their own, using _addcarry_u64 and _mul_u64 intrinsics. We could do the same. By carefully looking at the compiler output using tools like Compiler Explorer, we could avoid assembly language and rely on compiler optimizations. It seems that with Blender’s hardware and compiler constraints, we can count on the compilers having a __uint128_t that might be useful as a building block. This Stack Overflow answer has some useful comments on the subject. This one has an explanation of the different intrinsics for different compilers (sigh).