Proposal: CPP SIMD Wrappers

Currently in stdc there is an experimental extension for wrapping SIMD (https://en.cppreference.com/w/cpp/experimental/simd/simd). It is part of the ISO/IEC TS 19570:2018 Section 9 “Data-Parallel Types”. The wrapper allows to use vectorizations independent of the platform it can be used to optimize for (SSE/AVX/AVX512). It uses CPP templates to do this. For an introduction please check https://www.youtube.com/watch?v=8khWb-Bhhvs

Having access to a library already implementing this standard would benefit us as it makes it easier to optimize parts of our code base. Currently we have several places where we implement a section twice (non SSE and SSE supported). Having these wrappers we would still need to implement these twice (non SIMD and SIMD), but the SIMD versions would be more readable and support more hardware platforms than only SSE. There are also a few implementations that implement the ISO Standard.

My proposal is to add a BLI_simd.hh that wraps one of the available implementations. When it becomes available in the future we could simply migrate over by including the right SIMD.

Side note: SIMD isn’t a magic bullet. It needs to go hand in hand with address locality for best performance. Striving for Address locality can be difficult in certain areas of blender and SIMD would not add any benefits.

Areas where I would like to use it is to optimize the modifier. What is lacking in this proposal is the selection of the library to use, but that will be done in a follow-up proposal.

I think it’s fine to have something like this. But you need something a bit higher level built on top of this to write readable code. I think the Embree code is a good example of such an abstraction.

One part is to ensure the new float3 and float4 types use SIMD operations under the hood, to take advantage of 4 wide SIMD. I believe auto-vectorization will already happen automatically in many cases here, also for our existing C math library.

The other would be to support structure of array memory layouts, where you can have functions templated to run with 1, 4, 8 or 16 wide SIMD. For this you would have bool/int/float/float3 classes with a template parameter to determine the SIMD width.

Having a C++ simd library in Blender would be great.

I was also playing with this idea in the past and lately again after reading this. It triggered me to add SIMDe to compiler explorer to make some experiments easier.

I’m not sure what the right approach is. Lately, I was thinking that it would be good to build a C++ library on top of SIMDe, but I’m not sure. There seem to be two main approaches to implementing such a library:

  1. Provide C++ wrappers for specific cpu architectures. This is what is done in Cycles with the seei, avxf, etc. types.
  2. Provide C++ wrappers for generic vector types such as packed floats. This seems to be the approach of the experimental simd library linked to by Jeroen.

Having generic vector types is nice and might be the way to go for Blender. However, there will always be the need to use architecture specific intrinsics when really optimizing something… In most other cases, auto-vectorization might be good enough already.

I experimented with one approach to writing a C++ simd library last year. I had types like float_v<N> and int32_v<N> with specializations for different N. I’m still not sure if that approach was good.

@brecht, I’m not familiar with Embree, can you give me a link to a file that shows how its simd abstraction works?

I think it is important that float3 in BLI_float3.hh is really just struct float3 { float x, y, z; }. It should not have special alignment requirements. Also sizeof(float3) should remain 12. Without these requirements, it would be much harder to use this type when interacting with other Blender data. When I have an array of float3, I expect that there are really only 3 floats, and not 4 per element. I wrote a bit more about this here.

That sounds good.

See for example this function:

The types ending with x have arbitrary width that is determined by a template. So there is vfloatx, vboolx, Vec3vfx, etc. The naming of all this is a bit confusing in that you have to understand what the various prefixes and postfixes mean.

But basically you want to be able to write code that mostly looks like non-SIMD code, with the exception of some utility functions like any, all, reduce and select.

The way this works in Embree is that you have a Vec3<T>, where T can be just a float but also 4, 8 or 16 floats in an SIMD type. The float3 equivalent is then typedef Vec3<float>.

Thanks. I’ll read a bit more of the Embree code. At a first glance, I like it as well.

Yay for suggesting a wrapper rather than doing yet another implementation on our own for “reasons” , I’d like to toss OIIO’s simd.h into the hat for consideration, it already covers the bool/int/float types in 4,8 and 16 sizes, for no sse,sse…avx and neon instruction set and we have it available in the libs we ship in SVN. however…

This does open up a rather different and less pleasant issue: availability of the instructions. Except for cycles that gets away with it since it completely isolates the codepaths for different architectures in a single compilation unit with a single entry point, the codebase as a whole still targets SSE2 meaning all the goodies of sse3/4/avx/avx2/avx512 are just unavailable to us.

Mixing SSE2 and AVX/2 code (even if you would check the cpu flags at runtime) can will run amok in all kinds of subtle ways and if you think “we’re smart! that’ll never bite us!” T55054

We’d need a math library that would at runtime select the optimal implementation, Can this be done? absolutely! designing such a thing is well beyond “lets have a wrapper” though (don’t get me wrong we’d still need a wrapper to efficiently implement it, but it is by no means the biggest issue there is to solve here)

1 Like

Note: This might be a naive suggestion due to my limited knowledge in this area.
Have you considered using Enoki?

I see benefits to have both a low level template and a higher level math library. I would say the one requirement would be testability.

@EAW Thanks for the input! An issue that I see with enoki is that it pushes the application to have the data structures localized (structure of arrays). This is IMO normally a good thing as it improves CPU cache utilization, except for blender that could mean 2 additional transformations (that drops performance) or a huge refactoring before we can actually use it. Structure Of Arrays is typically done in high performance systems (and modern game engines), but it is hard to implement in systems where the data is structured in a more domain specific manner (like Blender is).

The OpenImageIO implementation follows the ISO standard to a certain degree. The main difference is that OIIO uses more specific types (no generic pack template), so IMO we should consider this one also for the low level API.

For the more generic API, I can understand that in a specific domain this is possible, looking at some parts of blender code the domain isn’t that specific. MOD_meshdeform for example does a lot of madd that no compiler vectorizes. Having a low level API would still be useful in these situations. Of coursewe could still add a high level interface, but with just one specific user :slight_smile:

1 Like

For what it’s worth, ispc supports x86 SIMD and Arm Neon and we already include it in our dependencies. This could be an interesting option as well and could scale a single source file from scalar code to 16 wide AVX512.