Proposal: CPP SIMD Wrappers

JeroenBakker · July 13, 2020, 8:21am

Currently in stdc there is an experimental extension for wrapping SIMD (std::experimental::simd - cppreference.com). It is part of the ISO/IEC TS 19570:2018 Section 9 “Data-Parallel Types”. The wrapper allows to use vectorizations independent of the platform it can be used to optimize for (SSE/AVX/AVX512). It uses CPP templates to do this. For an introduction please check https://www.youtube.com/watch?v=8khWb-Bhhvs

Having access to a library already implementing this standard would benefit us as it makes it easier to optimize parts of our code base. Currently we have several places where we implement a section twice (non SSE and SSE supported). Having these wrappers we would still need to implement these twice (non SIMD and SIMD), but the SIMD versions would be more readable and support more hardware platforms than only SSE. There are also a few implementations that implement the ISO Standard.

My proposal is to add a BLI_simd.hh that wraps one of the available implementations. When it becomes available in the future we could simply migrate over by including the right SIMD.

Side note: SIMD isn’t a magic bullet. It needs to go hand in hand with address locality for best performance. Striving for Address locality can be difficult in certain areas of blender and SIMD would not add any benefits.

Areas where I would like to use it is to optimize the modifier. What is lacking in this proposal is the selection of the library to use, but that will be done in a follow-up proposal.

brecht · July 13, 2020, 8:45am

I think it’s fine to have something like this. But you need something a bit higher level built on top of this to write readable code. I think the Embree code is a good example of such an abstraction.

One part is to ensure the new float3 and float4 types use SIMD operations under the hood, to take advantage of 4 wide SIMD. I believe auto-vectorization will already happen automatically in many cases here, also for our existing C math library.

The other would be to support structure of array memory layouts, where you can have functions templated to run with 1, 4, 8 or 16 wide SIMD. For this you would have bool/int/float/float3 classes with a template parameter to determine the SIMD width.

jacqueslucke · July 13, 2020, 10:26am

Having a C++ simd library in Blender would be great.

I was also playing with this idea in the past and lately again after reading this. It triggered me to add SIMDe to compiler explorer to make some experiments easier.

I’m not sure what the right approach is. Lately, I was thinking that it would be good to build a C++ library on top of SIMDe, but I’m not sure. There seem to be two main approaches to implementing such a library:

Provide C++ wrappers for specific cpu architectures. This is what is done in Cycles with the seei, avxf, etc. types.
Provide C++ wrappers for generic vector types such as packed floats. This seems to be the approach of the experimental simd library linked to by Jeroen.

Having generic vector types is nice and might be the way to go for Blender. However, there will always be the need to use architecture specific intrinsics when really optimizing something… In most other cases, auto-vectorization might be good enough already.

I experimented with one approach to writing a C++ simd library last year. I had types like float_v<N> and int32_v<N> with specializations for different N. I’m still not sure if that approach was good.

@brecht, I’m not familiar with Embree, can you give me a link to a file that shows how its simd abstraction works?

I think it is important that float3 in BLI_float3.hh is really just struct float3 { float x, y, z; }. It should not have special alignment requirements. Also sizeof(float3) should remain 12. Without these requirements, it would be much harder to use this type when interacting with other Blender data. When I have an array of float3, I expect that there are really only 3 floats, and not 4 per element. I wrote a bit more about this here.

That sounds good.

brecht · July 13, 2020, 11:26am

See for example this function:

github.com

RenderKit/embree/blob/master/kernels/geometry/curve_intersector_ribbon.h#L68


      
          {
            const vfloat<M> num = det(p2-p1,p1-p0);
            const vfloat<M> den2 = dot(p2-p1,p2-p1);
            return std::make_pair(num*num,den2);
          }
          
          /* performs culling against a cylinder */
           template<int M>
           __forceinline vbool<M> cylinder_culling_test(const Vec2vf<M>& p0, const Vec2vf<M>& p1, const Vec2vf<M>& p2, const vfloat<M>& r)
          {
            const std::pair<vfloat<M>,vfloat<M>> d = sqr_point_line_distance<M>(p0,p1,p2);
            return d.first <= r*r*d.second;
          }
          
          template<int M = VSIZEX, typename NativeCurve3ff, typename Epilog>
          __forceinline bool intersect_ribbon(const Vec3fa& ray_org, const Vec3fa& ray_dir, const float ray_tnear, const float& ray_tfar,
                                              const LinearSpace3fa& ray_space, const float& depth_scale,
                                              const NativeCurve3ff& curve3D, const int N,
                                              const Epilog& epilog)
          {
            /* transform control points into ray space */

The types ending with x have arbitrary width that is determined by a template. So there is vfloatx, vboolx, Vec3vfx, etc. The naming of all this is a bit confusing in that you have to understand what the various prefixes and postfixes mean.

But basically you want to be able to write code that mostly looks like non-SIMD code, with the exception of some utility functions like any, all, reduce and select.

The way this works in Embree is that you have a Vec3<T>, where T can be just a float but also 4, 8 or 16 floats in an SIMD type. The float3 equivalent is then typedef Vec3<float>.

jacqueslucke · July 13, 2020, 11:54am

Thanks. I’ll read a bit more of the Embree code. At a first glance, I like it as well.

LazyDodo · July 13, 2020, 1:14pm

Yay for suggesting a wrapper rather than doing yet another implementation on our own for “reasons” , I’d like to toss OIIO’s simd.h into the hat for consideration, it already covers the bool/int/float types in 4,8 and 16 sizes, for no sse,sse…avx and neon instruction set and we have it available in the libs we ship in SVN. however…

This does open up a rather different and less pleasant issue: availability of the instructions. Except for cycles that gets away with it since it completely isolates the codepaths for different architectures in a single compilation unit with a single entry point, the codebase as a whole still targets SSE2 meaning all the goodies of sse3/4/avx/avx2/avx512 are just unavailable to us.

Mixing SSE2 and AVX/2 code (even if you would check the cpu flags at runtime) ~~can~~ will run amok in all kinds of subtle ways and if you think “we’re smart! that’ll never bite us!” T55054

We’d need a math library that would at runtime select the optimal implementation, Can this be done? absolutely! designing such a thing is well beyond “lets have a wrapper” though (don’t get me wrong we’d still need a wrapper to efficiently implement it, but it is by no means the biggest issue there is to solve here)

EAW · July 13, 2020, 2:22pm

Note: This might be a naive suggestion due to my limited knowledge in this area.
Have you considered using Enoki?

JeroenBakker · July 13, 2020, 2:57pm

I see benefits to have both a low level template and a higher level math library. I would say the one requirement would be testability.

@EAW Thanks for the input! An issue that I see with enoki is that it pushes the application to have the data structures localized (structure of arrays). This is IMO normally a good thing as it improves CPU cache utilization, except for blender that could mean 2 additional transformations (that drops performance) or a huge refactoring before we can actually use it. Structure Of Arrays is typically done in high performance systems (and modern game engines), but it is hard to implement in systems where the data is structured in a more domain specific manner (like Blender is).

The OpenImageIO implementation follows the ISO standard to a certain degree. The main difference is that OIIO uses more specific types (no generic pack template), so IMO we should consider this one also for the low level API.

For the more generic API, I can understand that in a specific domain this is possible, looking at some parts of blender code the domain isn’t that specific. MOD_meshdeform for example does a lot of madd that no compiler vectorizes. Having a low level API would still be useful in these situations. Of coursewe could still add a high level interface, but with just one specific user

StefanW · July 13, 2020, 5:37pm

For what it’s worth, ispc supports x86 SIMD and Arm Neon and we already include it in our dependencies. This could be an interesting option as well and could scale a single source file from scalar code to 16 wide AVX512.

aras_p · November 6, 2023, 9:13am

I started to think about SIMD things, and hey ho, thread resurrection! As far as I can tell, there are several approaches:

“make my 3d/4d vectors and matrices use SIMD”. This one is easiest to plug into existing codebase, where data & code flow is not “throughput oriented” (no SoA, branchy code, etc.). Downside, is that expected speedup is pretty much never larger than 3x, and more often is around 2x for the SIMD’ified parts. Still better than nothing though. For this one, you could imagine having 4-way math related data types that easily map to SSE4/NEON (and a scalar fallback for other archs) and call it a day.
“have an N-wide SIMD abstraction, make code process N things at once”. This one can take advantage of e.g. AVX2 (8-wide), or AVX512 (16-wide). Usually requires a lot of changes to existing code, to work well. A whole lot of code inside Blender today is very much “operate on one thing at a time”, and making them operate on N things at once is easy in some parts, medium complexity in others, and very hard in yet others. Crucially, this approach does not try to map “a vector” or “a matrix” to SIMD! Instead of say trying to somehow make “deform a vertex by a bone” use SIMD (as approach 1 would do), it would instead try to do “deform N vertices by a bone” (or alternatively in this particular case, maybe “deform one vertex by N bones”). Whether N should be statically fixed at compile time based on target ISA, or explicitly specified by the code writer, or dynamically dispatched to multiple N versions, is an implementation choice. Any sort of code that is heavy in branches is usually hard.
Similar to 2, but use some “magic tooling” to make branchy code that processes N things at once easier to write. Most prominent approach is ISPC (Blender already has that as indirect dependency, due to OpenImageDenoise). Another, much less battle tested & supported, approach is like CppSPMD.

Today Blender uses no SIMD abstraction, with relatively little use of SSE2 intrinsics directly, that are “translated” to ARM NEON via sse2neon.h. That kinda works, but is a bit ugly to write and read. Most of SIMDified code I’ve seen is somewhere between style 1 and style 2 above.

My own experience is:

Style 1 above is a decent stepping stone to make some parts of existing code “a little bit faster”, but the potential upside is very limited. A lot of the times you’re working with say 3D vectors, so even in “theoretically 100% of code can be SIMD here” is at most 3x faster (and uses 75% of SSE/NEON compute capability; 37% of AVX2 capability).
For style 2 code, there are various existing SIMD libraries, it’s also fairly trivial to write your own that does exactly what you need. You have various styles ranging from “this is N-wide vector with N being CPU SIMD width” (e.g. in astc-encoder astcenc_vecmathlib.h), to “programmer specifies desired N” like highway or libsimdpp. My own preference has been to “N is just machine SIMD width” – if your code is already SIMD friendly, then most/all loops end up “process N items, process next N items, …, process remainder as scalar”, and the code does not really care that much whether N is 4, 8, 16 or just 1 for fallback.
I did have good experience using ISPC in the past, for anything involving more than a couple of branches it makes it much easier to write code for, since it does all the execution mask handling (very much like a GPU programming model really). It also makes it trivial to compile several versions of the code and dispatch the right one. Downside: it’s another language (very much like C, but not exactly C), and it’s an external tool that needs to plug into build system, etc. etc.

(oh, and “auto vectorization” pretty much never works, except in really, really trivial cases – but that requires code to be very SIMD friendly to begin with, at which point it’s super trivial to manually SIMD it already)

Howard_Trickey · November 6, 2023, 11:43am

If we wanted to make some of the mesh code more like your approach 2, what does this say about the current layout of the major data structures? Will they just work or do they have to be changed into things with better alignment? And if the latter, is it worth it to have temporary data structures with better alignment (but more space-wasting) to operate on, and then move back into Mesh data structures later? Do you have any proposed parts of Blender where you have been thinking of applying SIMD, and if so, what does that look like from a data structure point of view?

aras_p · November 6, 2023, 11:59am

I’ve only looked at several places, but my current takeaway is that the Mesh (maybe not BMesh) data is mostly fine – it’s arrays of attributes, nice and easy. But the parts that use them, also use a lot of other data, e.g. I was looking into armature deform code – and it’s branches upon branches, loosely fetching data from some “bone” structures, and making decisions “in the last minute, on one item at a time”. But again, this is highly dependent on exact places of code, blender has a lot of data structures.

In general, when designing for SIMD (or even hopes of auto-vectorization), for any data structure and code, first ask “can this be easily made to work on N things at once, instead of one thing at once?”. In some parts of blender today, the answer is “yeah, easily”, in some other parts it’s “uhh I’d have no idea where to even start lol”

HooglyBoogly · November 6, 2023, 9:39pm

Most of the recent mesh code is written to be compatible with that style of SIMD. At least we’ve been doing our best to write simple hot loops with clear inputs and outputs and minimal branching. Indeed, BMesh is basically the opposite and just doesn’t fit into this paradigm.

And multifunctions in geometry nodes are designed for that sort of SIMD as well. We’ve basically just been waiting on a choice of SIMD wrapper library or related decision) and (maybe?) some nicer way to switch implementations at runtime.

aras_p · December 10, 2023, 12:04pm

FWIF I just looked at using OpenImageIO simd. for some sequencer image processing effects. TL;DR I think using an embedded “out own, much smaller but also without any missing bits we need” SIMD library would be better. Sure you can supply improvements to OIIO SIMD support (which I did), but then you also get into versioning mess. More detail in a comment here #115892 - WIP: VSE: speedup Alpha Over blend with SIMD - blender - Blender Projects

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

Proposal: CPP SIMD Wrappers