Fixed-length multiprecision arithmetic

For my new attempt at making a really fast exact Boolean, based on the recent Ember paper, I need very fast 256-bit fixed point signed integer addition, subtraction, multiplication, dot products of 4d vectors, and determinants of 4d matrices, preferably with specializations when the operands known smaller bit lengths (32, 64, 96). I want an implementation that doesn’t require dynamic allocation.

I have several questions:

  1. Any opinion on how to put this into blenlib – should I try to fit it harmoniously into the current blenlib C++ MathBase family or just do something totally different? Should I even put it in Blenlib or just keep it private to my own boolean implementation files?
  2. An opinion on whether to do this from scratch or try to import/use/adapt some existing open source approach?
  3. Does anyone have a hankering for implementing/benchmarking such a library for me or helping me? I am planning on doing this myself but it is a bit of a detour on the way towards the main show.

I’ve looked a little at what might be available as options for #2.

Boost

Blender has Boost 1.8 as a dependency (though some Blender developers don’t like depending on Boost). Boost 1.8 has multiprecision. There are these types defined in the namespace boost::multiprecision:

  • int128_t, int256_t, uint128_t, uint256_t

It is a little unclear what implementation method is used for these (the documentation talks about the possibility of a number of backends, including GMP).

This may be a good choice to start out with, since then I don’t have to code anything myself. However there is this Boost issue, which suggests the performance may not be the best. It points to this post by Arthur O’Dwyer.

Abseil

The post by Arthur just cited mentions that the Abseil implementation of uint128_t seems high quality. Blender does not now have a dependency on Abseil, but could be added if really needed. Else, perhaps it could be used for inspiration. Unfortunately there is no 256 bit type though.

GMP

Blender already has this as a dependency, so it is worth a look. The mpn_ low level functions could avoid the use of memory allocation. However, stuff on the web seems to indicate that GMP is slow for fixed-width multiprecision. (However that may be about the higher level stuff, which does allocation, rather than something built on the low level functions.)

CTBignum

The 2022 Nick Bouman paper Multiprecision Arithmetic for Cryptology in C++ states that his code beats the hand-optimized assembly-aided code of GMP by using C++ argument passing that lets the optimizer know how big the arrays are, among other things. The current code is for C++20, which is not yet required by Blender, but the archived earlier versions work with C++17.

Build our own

The Ember paper refers to building their own, using _addcarry_u64 and _mul_u64 intrinsics. We could do the same. By carefully looking at the compiler output using tools like Compiler Explorer, we could avoid assembly language and rely on compiler optimizations. It seems that with Blender’s hardware and compiler constraints, we can count on the compilers having a __uint128_t that might be useful as a building block. This Stack Overflow answer has some useful comments on the subject. This one has an explanation of the different intrinsics for different compilers (sigh).

9 Likes

While I have no horse in this race math wise, and somewhat strange since I have two hats to wear here, I’ll leave some comments from a platform maintainers perspective that may influence your decisions followed by my personal views on things.

Boost

Boost like GMP is an optional dep, so you can’t count on it being there, furthermore due to it being in the VFX Platform we are somewhat limited version wise, for the rest of this year it will be boost 1.80 no exceptions can be made.

Abseil

We currently do not ship it but, doesn’t look super hard to build, I would have have no objections to adding it as a dep but admins/core would have final say here.

:heavy_plus_sign: has compatible apache 2 license
:heavy_minus_sign: No semantic versioning, expects us to live at head which isn’t great LTS wise

GMP

We already have it as an optional dep, nothing to add to this one.

CTBignum

The same way the VFX Platform hold us on boost 1.80, it also keeps us at C++17 for at least the rest of the year, the draft for 2024 is expected soon, but I’m not sure they’ll touch the C++ standard probably best not to count on it.

Build our own

No work for the platform team, go at it :slight_smile:


on a more personal note mostly based on your findings

You mentioned Boost/GMP have (possible) perf issues, given one of the reasons you picked this paper to implement were its performance characteristics it seems not ideal to put a library with known issues in that department at the base of it.

Abseil is actively maintained, shame about the versioning, but this is my favorite of the bunch though as many big projects like tensorflow depend on it, I didn’t check too closely may have cuda support?

CTBignum is sadly out, the choice here is between shipping an old version with known vulnerabilities/bugs/missing functionality or a c++ standard we can’t use. I can’t say i like either of the choices there, which is sad since it appears to be a very nice library.

Build your own , I certainly won’t stop you, but it doesn’t feel this is a great way to spend your already scarce time. I’d have this a last resort option only.

3 Likes

This topic is relatively new to me so take this with a grain of salt.

  • Regarding Boost and GMP I have nothing to add on top of what @LazyDodo said. Currently, I wouldn’t want to bet on those libraries.
  • Abseil generally seems like a nice library and I certainly wouldn’t be opposed to using it. There are likely other things in there that we could use in the future. That said, it seems like it doesn’t actually solve your use case because it doesn’t have a 256 bit integer type.
  • CTBignum looks like an interesting library. Besides the fact that it is C++20 I have two other concerns:
    • It’s focussed on cryptology which can have different requirements. For example, they explicitly mention constant-time functions which might be at odds with your preference for something that can also be optimized when fewer bits are actually used.
    • It does not seem to have specialized code for 256 bit ints. It seems likely that it is possible to get better performance by having hand-optimized code for this particular case.

Sadly that rules out all the libraries you mentioned so far. Searching for more I found e.g. this library. No idea what it performance characteristics are. It seems simple enough to get to compile in Blender and is MIT licensed. I think it would be a good starting point, so that you can get right into the main part of the implementation.

That said, if we don’t find a library that fits our needs perfectly, it does not seem tooo difficult to implement something ourselves. The scope of the required functions seems fairly manageable (although maybe I have a too simplistic view on this). I agree with @LazyDodo though in that it probably shouldn’t be you who implements that, because it just takes away your time from the more important parts. I wouldn’t mind spending some time on this in the next months, if that helps. That said, it would be useful to just get something working using some simple library like the one linked above before that.

2 Likes

Thanks for the useful information from a platform perspective, LazyDodo. Just to follow up with a couple of things we talked about in chat.

  • Abseil doesn’t have a 256 type, so isn’t really an option now. I mentioned in mainly as something whose implementation could be examined for inspiration, since it was praised as being high quality.

  • There are other github projects that implement 256-bit unsigned arithmetic (usually with cryptography applications in mind), such as GitHub - calccrypto/uint256_t: C++ unsigned 256 bit integer type and GitHub - piggypiggy/fp256: An efficient library for 256 bit integer arithmetic and probably others. It would be interesting to benchmark these, but I need signed arithmetic, so would have to store an extra sign bit somewhere and put sign-testing logic around the basic operations if I were to use these, all of which sounds non-optimal and space-wasting, so would rather not use these solutions out of the box. (I think the same is true for CTBignum.)

[Edit: Jacques said some of the same things in his previous post, which I didn’t see before posting mine. Thanks for you comments Jacques. I agree that for now I will use something simple and off-the-shelf so I can concentrate on the main algorithm, while intending to eventually replace it with something better later.

1 Like

If we have our own implementation, that doesn’t mean we have to implement it from scratch. We can still adapt an existing implementation, or add a patched library in extern/. So it doesn’t have to take a lot of time.

Having a small header only library as an external dependency also doesn’t necessarily make anyone’s life easier. Something like CTBignum I’d be inclined to put in extern/ regardless if we need to make changes to it.

2 Likes