Exploration of SSE optimizing some bf_blenlib math routines

A few weeks ago I decided to undertake a fun hobby project just for myself to see what benefit, if any, using SSE in some of Blender’s math centric code paths would yield. In short, yes, it’s clear that Blender is leaving some performance on the floor.

However, while the gains are great at the microbenchmark level, they tend to not yield any noticeable benefit to the full scenarios I profiled. This is both surprising and unsurprising really. Unsurprisingly Blender is “big software” so it’s obvious that it would be doing much, much more than spinning in math-heavy code paths to the point where humans might notice. However, it’s still surprising that a 3d DCC is doing so much non-math work in the profiled scenarios too :slight_smile:

Additionally, without more invasive Blender work, performance would still remain lower than optimal due to the storage format used in the math centric code paths (float[3]).

So I want to put this here to see what the Blender developers think of this effort and to help answer the following questions:
- Is converting the math routines that I outline in my benchmark to SSE interesting to do considering the results I present?
- Generally, what changes are allowed in DNA?

  • Would modifying some float[3] fields to be float[4] be allowed (disregard if we want to do this; I just want to know if it’s possible in a backward compatible way)?
  • Would adding a bmesh layer like CD_NORMAL_SSE as an experiment be less invasive than changing CD_NORMAL itself? Or would it be a similar amount of work? Would back-compat be better or worse here?

Full benchmark code (GPL 3), the motivating scenarios that were profiled, and the results are here: GitHub - jessey-git/blender_bench: A playground/benchmarking program for testing out alternative implementations of low level math routines within Blender

3 Likes

Interesting! I was wondering about this just a couple of weeks ago, having just seen a presentation on using SIMD and some C++ coding frameworks to make it easier to do so.

It does indeed feel like for the most part the performance gains you are getting for the routines get drowned out so much by the surrounding code in Blender that a solution that requires using more memory is probably not worth it by itself. (I do wonder if maybe communication with GPUs might also be made better with other layouts for 3d coordinates - maybe for alignment with cache line reasons.) . The cases where SSE variants of the routines that are faster without needing to change the API are probably the easiest to argue that we should just take.

There is a “do_versions” mechanism in Blender that allows for changing DNA by having the developer write fixup code on load for previous-version files. So I think it would be possible to make the change you are talking about (though this would be quite a large amount of fixup, and one might hesitate to make such a drastic change for that reason). Besides changing DNA, you could also consider just changing the internal data structures that are run-time only that DNA gets converted into. Specifically, BMesh. Then you wouldn’t have to worry about the DNA change.

No doubt the more noticeable to user use of SIMD in Blender would be in CPU rendering, in Cycles. I haven’t looked at that code but suspect it already uses SSE sometimes.

1 Like

Cycles indeed has specialized SSE2/3/41/AVX/AVX2 kernels already however, great care must be taken with mixing code for different architectures, it is a lot less straight forward as you may think

while cycles has all problems contained in a single translation unit which limits the problem somewhat, sometimes stray instructions for still get out

as for if it’s a good idea or not, that wholly depends on the real word difference, it’s hard to argue that you took 30ns of a function call, but how does that translate to real world performance? will users notice?

Also the fact that sculpting can move verts around at high speed with no issues, but when you manually drag one around in edit mode we suddenly have perf issues seems to indicate the problem can be better solved in an algorithmic way and we somehow already have, but not everywhere…

1 Like

Yes, Cycles already loads the SSE kernel variants as necessary for CPU.

And yes, with the current of usage of float[3] being as pervasive as it is, the marginal speed up gained in that configuration just doesn’t yield favorable end to end results when pitted against the rest of the work blender needs to do. And then it’s difficult to ascertain whether the float[4] or SSE __m128 datatype configurations would yield actual gains without invasive blender modification too…

Kinda stuck :slight_smile:

Yup, I tangentially hit a bit of this in my Summary section. Without a good end result, the bar would be quite high to make these changes; even the ones which are non-invasive. I listed some criteria I would use to gauge whether it would be a good change or not.

Algorithmic changes are a must here for sure. I was just curious if some meaningful change could be spotted. Win some, lose some… at least I had fun and learned a bit on the way! Still, hard to walk away from the 1.5x to 1.9x speed increase of say calculating normals.

Personally, I think padding 3d vectors with an additional fourth float is not a good idea in most cases. My main reason is that it encourages people to do a “wrong” kind of vectorization. By wrong I mean suboptimal and complex.


Here is a simplified view of why I think so (I originally wrote quite a bit more, but could not finish that yet).

To optimize performance, developers should focus on interleaving the processing of many elements, instead of trying to optimize the latency of processing a single element.

I think that padding 3d vectors with a fourth float encourages developers to do the opposite. It makes it look like this vector can be processed much faster now, when in reality it usually can’t. That is because most non-trivial algorithms do different operations on the x, y and z coordinates. A better approach is to always process multiple vectors at ones. Also see this guide on optimizing normalization of many vectors.

Furthermore, in my opinion, functions that process 3d vectors but require 4d vectors as input (maybe even without telling the caller) have a bad contract and should be avoided.

1 Like

Padding was a necessity about 10 years ago when unaligned SSE loads came with a performance penalty. To my knowledge, no currently shipping CPU has that any more. Still inflating data structures to make them contain 25% unused ballast needs to be justified by a big performance increase IMHO.

Think you are talking about aligned vs unaligned read/writes here? while that problem mostly has gone away on modern architectures. However you can load 4x32 bit floats into a single 128 bit sse register with a single instruction, while if you need to load a 3x32 bit into an sse register you’ll need to split up this operation with several loads and in the end combine them. there is without a performance penalty there.

that being said, given sculpting can pull it off to work fluidly on a high density mesh, i’m still convinced that algorithmic changes to edit more are the preferred way to go here rather than nitpicking at the micro level (which could probably still be done after the algorithmic changes if needed)

Thanks for the link and other things to mull over Jacques. I especially like that Intel already did the exercise to show that AoS to SoA conversion at runtime is potentially worthwhile (and gave code too :)) Yes, I did not expect miracles with SSE, especially with all the scenarios I profiled being extremely slow already. I was just curious really and wanted to explore the code base more. I modeled some of what I did on the existing Cycles SSE kernels (at least to double-check the math etc.)

The alignment comment is more about storing things natively in __m128 vs. float[4] rather than the difference between a loada and loadu really. While float[4]'s don’t come with an alignment requirement, and loading those directly is indeed fast enough, __m128 would come with strict 16 byte alignment requirements which would necessitate a better set of structure design and layout in some places. Good to know that ignoring loada and loadu differences would probably be fine; float[4] would more naturally fit within existing blender practices instead of spreading the SSE types around anyhow.

In the meantime I’ll just keep my playground open to anyone wanting to play or develop but won’t send out patches. If proper algorithmic changes end up happening I’ll re-evaluate if the micro improvements show any promise again later.

No doubt there are cases where alignment and padding matter, but I would shy away from padding data structures “just because”. Inflating a core data structure by 25% (possibly reducing cache hit rate) before even having identified a bottleneck is - to me - premature optimization.

I’m not saying SIMD is useless - once you have identified a bottleneck, it can be quite beneficial. In my experience, normal calculations are often a bottleneck and simply doing four sqrt() operations in parallel (while leaving everything else inn scalar code) can already improve things noticeably.

The story in Cycles is a bit different: storing XYZ in a float4 was not only beneficial on Core 2 generation CPUs, but CUDA and OpenCL also natively work with float4. Cycles also has different algorithms and data layout for different architectures - BVH2 for GPUs, BVH4 for SSE2 and BVH8 for recent AVX.