Context/ Who TF is this guy?:
Outside perspective from a different FOSS community that also deals with low level hands optimized libraries (I do High Performance Computing BLAS kernels in C or hand accelerated assembly in AVX1, AVX2 or AVX512).
Post:
SSE2 is ancient as far as SIMD instruction sets go, SSE3 doesn’t help much and all the way up to 4.1 doesn’t help much either.
If you’re going to give SSE one last hurrah, I’d do it fully. x86-64-v2 is the right way to do so.
Broadly speaking, using GCC to guide minimums tends to help.
In most circles for legacy/life support builds, Nehalem/Westmere is the furthest back I’d go. (Last Intel generation before AVX). The difference between Nehalem and Penryn is negligible, but folks on legacy systems are much more likely to be using Nehalem IME (x58 hexacore Xeons for example can be had for ~15-20USD online)
Unfortunately, uArch and ISA don’t help much. For example, Intel and AMD in their infinite wisdom are still launching CPU’s in the “low end”/embed product segments that only support up to SSE4.2, while chips in the same family support up to AVX512. These chips then get used by certain vendors for use cases outside of their intended purposes, such as ultra low cost laptops, thin clients, etc.
I’d have to dig into the build/compilation system but setting your minimum for x86-64-v2 and setting the tuning flag (GCC and LLVM (clang and the Intel/AMD ICX/AOCC versions) use -mtune=[name]
) to whatever architecture your current data/ a blender user survey shows is most popular could be a nice bump. (Mtune changes which instruction cost tables are used by the compiler, while still only allowing for the instructions provided by the prior march flag, which would be -march=x86-64-v2
)
Especially if you’re going to use an ‘x86-64-v[something]’ flag, you’ll want some sort of tune flag with it, else you fall back on the generic cost tables which are far from great.
Hard Data
A trick you can use is assume that users of blender tend to overlap decently with those who also use Steam, and approximate based on steam hardware survey data. From there, you can see that there’s only a 0.26% overlap of folks who have SSE4.1 but don’t have SSE4.2.
(The numbers are nearly identical for Linux + x86 macOS + Windows. Click other settings at the bottom of this page: Steam Hardware & Software Survey)
You can also see that of those who have 4.2 but don’t have AVX, there’s only 2.55%. (And that still 96+% of all windows steam users)
Every “mainstream” CPU from Intel and AMD has had AVX2 or more since Haswell on Intel and Ryzen1 on AMD.
The only exceptions are the embedded chips that I alluded to earlier. But dictating the support model of a project the size and scope of blender based on ISA support of chips designed for set top boxes on consumer firewalls seems a little strange to me.
Realistically I think you’d be fine to mark ~3.5 as v2, with the next major release 4.x as V3.
There will always be some very small/minor SKUs from vendors like intel/AMD/NV etc. that have poor overlap with major revisions, but keeping them on life support does little to move the project along other than increase tech dept.
If I can suggest: Skip AVX (AKA AVX1).
AVX didn’t add much outside of creating YMM registers (256 bit) for floats only on Sandy bridge. AVX1 had no integer support, you were still stuck on SSE4.2. Ivy bridge added the f16c instruction for supporting IEEE754 compliant fp16, but they’re so slow you may as well just do it in AVX/SSE4.2 and save the hassle (this is what GCC and clang are doing for C and C++ 23 to support native FP16 as part of the specification, but I digress)
AVX2 added the capability of using the full length of YMM registers for ints and floats, extended the AVX1 VEX encoding options to ints and did some other clever stuff.
There’s a good reason that Bulldozer/Excavator/Sandy bridge/Ivy bridge got skipped by V3: AVX1 was boring and didn’t get anything done.
Anyway, before this turns into a full on blog post
TL; DR:
Assuming the blender communities users roughly overlap with Steam, everybody* has SSE4.2.
AVX1 was a beta for AVX2, which actually got things done.
IMO and IME, you could do something like -march=x86-64-v2
for 3.5, then go to x86-64-v3
for 4.0
TIP: If you’re setting requirements via x86-64-v[something] use the -mtune=[some CPU arch]
and set [some CPU arch]
to be roughly whatever is most common within the blender community.
None of the x86-64-v have vector cost tables, which sometimes leads to people thinking they’re broken. Something to watch out for.
With how similar Intel was for many years, -march=skylake
is a very good bet. Be prepared for some people to accuse you of being intel shills. If you have more Zen based users, then use -mtune=znver2, at which point assume you’ll be called AMD shills.
Choose neither and performance suffers