Proposal: Bump minimum CPU requirements for Blender

Hi everyone,
the minimum CPU instruction set for x86-64 that is required to launch Blender and stated on the requirements page is SSE2 at the moment. This is very old, and also in contradiction to the other things mentioned in our minimum requirements:

  • 64-bit quad core CPU with SSE2 support
    – The first released quad-core processors already came with higher instruction support:
    AMD Phenom, 2007, SSE3, SSE4a
    Intel Core 2 Extreme, 2007, SSE4.1

  • Less than 10 year old
    – If this is the basis for minimum requirements, even AVX (Intel Sandy Bridge 2011, AMD Bulldozer, 2011) would be possible as minimum instruction level.

Also important to keep in mind is that our minimum GPU requirements are already more recent and therefore the limiting factor when it comes to supported systems:

Considering all these points, increasing the CPU instruction minimum to SSE4.1 seems natural. This way we would still support CPUs that are 14-15 years old, while again the limiting factor are already GPUs which require a 12 year old card minimum. Further bumps to e.g. AVX could be considered at a later point. Other DCC applications already require at least SSE4.2 or even AVX.

Benefits of bumping to a higher instruction level:

  • Improved performance (either via automatic compiler optimizations or actual SSE code) in various math heavy parts of Blender would be possible.
  • Cycles currently compiles various kernels, for SSE2, SSE3, SSE41, AVX and AVX2. Reducing this would result in less code and decreased compile time.

Feedback is welcome.

Best regards,
Thomas

18 Likes

sse41 feels like the right choice to me, however this feels more like a change we’d do from 3.x to 4.x than one from 3.4 to 3.5, but I admit not a hill to die on for me.

Dropping the SSE2 kernel will hardly make a noticeable impact for a build with all GPU kernels on, and for local development there’s already WITH_CYCLES_NATIVE_ONLY to keep the build times down, so that’s a bit of a non argument for me.

2 Likes

Althought the documentation is weird, I would not compare CPU with GPUs. GPU’s aren’t standardized and fall faster behind than CPU’s following a standardized instruction set with selected extensions.

I also agree with LazyDodo, we can discuss, but would wait for a major version bump to actually implement it. Until then would be good to have some figures to the benefits.

Have you tested how much difference it makes? I can imagine the new mesh code would benefit from it wider registries, but most code wouldn’t. Lib c functions should already use wider registries when available.

My experiences with coding algorithms towards SSE4/AVX is that it can help, but restructuring code is needed to get better results.

So not against it, just timing and some figures to actually see the impact.

It might make sense to align with the x86-64 microarchitecture feature levels defined by GCC and LLVM.
Specifically, x86-64-v2 sounds reasonable: CMPXCHG16B, LAHF-SAHF, POPCNT, SSE3, SSE4.1, SSE4.2 and SSSE3.

5 Likes

oh that is nice, x86-64-v2 sounds like a great target!

Those dates are misleading. For example, I’m still on an Athlon II X2 270 that I bought in 2012 (released 2Q 2011 according to CPU-World) and I don’t see sse4_1 in my /proc/cpuinfo flags, let alone anything AVX-related.

For reference, here’s what the Athlon II X2 270 does support:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save

…and yes, that is a dual-core CPU but, for my puttering around for various uses (diagramming, pose/perspective references, plans to do some model making for simple 3D games and some video editing, etc.), Blender currently works perfectly well on it.

(My budget’s mostly been going into more and bigger hard drives and, given its relatively small size, I was hoping to at least keep this machine ticking along until the projected opening of the new chip fabs in 2024 or 2025.)

@Jeroen-Bakker
I didn’t run benchmarks yet, but I remember that compiling with higher flags gave some performance improvements in the past for the old Blender Internal render. I agree it would be helpful to get some recent numbers though.

@lukasstockner97 I wasn’t aware of these, x86-64-v2 indeed looks like a good target.

@ssokolow Our minimum requirements are quad core and < 10 years ago, and my point is that taking these two given requirements into account, SSE4.1 availability can be assumed.

I would be more flexible when it comes to the timing of this. Of course there is no point in bumping the requirements just because, but if there are valid arguments (performance, code cleanup, developers who like to introduce SSE4 in their code…) I think this doesn’t need to wait for 4.0. Blender has always been very low on the minimal requirements compared to other software and that is a good thing. But at some point (again, collecting more data first is good) this is the right thing to do and people on older hardware can still use the existing builds. With 3.3 LTS, there is even a version that will get further support and fixes for another two years, no matter if we bump these requirements now or in a year for 4.0.

1 Like

That’s fair.

I do note, however, that my 2011 chip had the same SSE4a as that 2007 Phenom, so it’s possible that there are still some quad-core AMD chips out there without SSE4.1 which won’t be 10 years old for a little while longer.

It was during the period when AMD was in decline after all.

Something to verify at least.

1 Like

Context/ Who TF is this guy?:

Outside perspective from a different FOSS community that also deals with low level hands optimized libraries (I do High Performance Computing BLAS kernels in C or hand accelerated assembly in AVX1, AVX2 or AVX512).
Post:

SSE2 is ancient as far as SIMD instruction sets go, SSE3 doesn’t help much and all the way up to 4.1 doesn’t help much either.

If you’re going to give SSE one last hurrah, I’d do it fully. x86-64-v2 is the right way to do so.

Broadly speaking, using GCC to guide minimums tends to help.

In most circles for legacy/life support builds, Nehalem/Westmere is the furthest back I’d go. (Last Intel generation before AVX). The difference between Nehalem and Penryn is negligible, but folks on legacy systems are much more likely to be using Nehalem IME (x58 hexacore Xeons for example can be had for ~15-20USD online)

Unfortunately, uArch and ISA don’t help much. For example, Intel and AMD in their infinite wisdom are still launching CPU’s in the “low end”/embed product segments that only support up to SSE4.2, while chips in the same family support up to AVX512. These chips then get used by certain vendors for use cases outside of their intended purposes, such as ultra low cost laptops, thin clients, etc.

I’d have to dig into the build/compilation system but setting your minimum for x86-64-v2 and setting the tuning flag (GCC and LLVM (clang and the Intel/AMD ICX/AOCC versions) use -mtune=[name]) to whatever architecture your current data/ a blender user survey shows is most popular could be a nice bump. (Mtune changes which instruction cost tables are used by the compiler, while still only allowing for the instructions provided by the prior march flag, which would be -march=x86-64-v2)

Especially if you’re going to use an ‘x86-64-v[something]’ flag, you’ll want some sort of tune flag with it, else you fall back on the generic cost tables which are far from great.

Hard Data

A trick you can use is assume that users of blender tend to overlap decently with those who also use Steam, and approximate based on steam hardware survey data. From there, you can see that there’s only a 0.26% overlap of folks who have SSE4.1 but don’t have SSE4.2.

(The numbers are nearly identical for Linux + x86 macOS + Windows. Click other settings at the bottom of this page: Steam Hardware & Software Survey)

You can also see that of those who have 4.2 but don’t have AVX, there’s only 2.55%. (And that still 96+% of all windows steam users)

Every “mainstream” CPU from Intel and AMD has had AVX2 or more since Haswell on Intel and Ryzen1 on AMD.

The only exceptions are the embedded chips that I alluded to earlier. But dictating the support model of a project the size and scope of blender based on ISA support of chips designed for set top boxes on consumer firewalls seems a little strange to me.

Realistically I think you’d be fine to mark ~3.5 as v2, with the next major release 4.x as V3.

There will always be some very small/minor SKUs from vendors like intel/AMD/NV etc. that have poor overlap with major revisions, but keeping them on life support does little to move the project along other than increase tech dept.

If I can suggest: Skip AVX (AKA AVX1).

AVX didn’t add much outside of creating YMM registers (256 bit) for floats only on Sandy bridge. AVX1 had no integer support, you were still stuck on SSE4.2. Ivy bridge added the f16c instruction for supporting IEEE754 compliant fp16, but they’re so slow you may as well just do it in AVX/SSE4.2 and save the hassle (this is what GCC and clang are doing for C and C++ 23 to support native FP16 as part of the specification, but I digress)

AVX2 added the capability of using the full length of YMM registers for ints and floats, extended the AVX1 VEX encoding options to ints and did some other clever stuff.

There’s a good reason that Bulldozer/Excavator/Sandy bridge/Ivy bridge got skipped by V3: AVX1 was boring and didn’t get anything done.

Anyway, before this turns into a full on blog post

TL; DR:

Assuming the blender communities users roughly overlap with Steam, everybody* has SSE4.2.

AVX1 was a beta for AVX2, which actually got things done.

IMO and IME, you could do something like -march=x86-64-v2 for 3.5, then go to x86-64-v3 for 4.0

TIP: If you’re setting requirements via x86-64-v[something] use the -mtune=[some CPU arch] and set [some CPU arch] to be roughly whatever is most common within the blender community.
None of the x86-64-v [] have vector cost tables, which sometimes leads to people thinking they’re broken. Something to watch out for.

With how similar Intel was for many years, -march=skylake is a very good bet. Be prepared for some people to accuse you of being intel shills. If you have more Zen based users, then use -mtune=znver2, at which point assume you’ll be called AMD shills.

Choose neither and performance suffers :sweat_smile:

8 Likes

You’re making it sound like there’s a huge difference between supplying -mtune or not, what are we talking about here 1-2 % depending on the workload or more like 10%+ across the board?

I think it is wrong to obsolete the Phenom II X6. This is still a capable CPU for a basic user, and the Piledriver 8-core is hardly better as it has 4 FPUs instead of 6.

It has SSE4a and SSE3 but no SSSE3.

For me, there should have been one extra x86 level, leaving out the AVX in the Bulldozer family is a waste.

Think it’s a little early to become arbiters of “this cpu gets in, this cpu is out” as this is all hinging on changing the requirement from sse2 to something else leading to tangible performance improvements.

I’m far more likely to throw an old cpu under the bus if the rest of the users gets a 30/40% perf boost in some common tasks. if we change the requirement and there’s no tangible benefit, we may as well not do it.

but to make that case, we’d have to collect performance metrics first

8 Likes

The Phenom II X4 still appears sometimes on some European Amazon bestseller lists. On Amazon India they sell tons of old Intel CPUs, the Core 2 Duo still shows up.

I forgot that the Athlon II X4 still shows up in Europe.

If you have a Phenom II X6 on an AM3+ motherboard it’s not worth it to just move 8-core FX as they are overpriced. You need a new motherboard and RAM.

If you have an AM3 motherboard you cannot just put an FX.

Nothing has been decided yet, having some data on the performance first is the way to go. :slight_smile:

I understand your concerns regarding your system, but please keep in mind that we are talking about pretty old hardware and Blender isn’t a word processor. I can buy a computer for 100€ today but I can not expect to run the latest computer games on it either.

There is no intention to artifically exclude old hardware, but if tests show a performance benefit and (looking at the Steam numbers posted above) this would benefit ~98% of users, this is the way to go and just a natural software evolution. Blender 3.3, 3.4 etc will still be here for older systems to run for years to come.

12 Likes

I’m in favor of this proposal, using x86-64-v2 seems like a good compromise.

However, I think we should be writing SIMD register width agnostic code that isn’t necessarily using specific intrinsics. With libraries like Google’s Highway, we should be able to switch code paths at runtime to select the best fit for specific CPUs. At that point, performance isn’t one of the benefits of this change, since we could target AVX2 or AVX512 while keeping support for any older CPU.

The remaining arguments are binary size, compile time, and a smaller supported testing surface.

7 Likes

I think agnostic code would be the way to go to make everybody happy.

Now it is a bad time for AMD users to upgrade as AM5 and DDR5 are still very expensive.

Also some countries have 100% import tariffs.

From what I understand this kind of dynamic dispatch mechanism would still require quite a bit of work from our side, to refactor code and mark specific functions to be optimized.

I don’t have a good sense of how much time it would take, but it doesn’t seem like something we just quickly switch on and get a performance improvement the way setting a few build flags might.

It can be case dependent, but some tool-chains, in the absence of tuning flags default to Core2 Quad era cost tables.

-mtune alone, depending on the chip, can be anywhere from ~2 to ~9%. Comes down to how branchy the code is, how “lucky” you end up with hot code overlapping in cache lines etc.

Some tools like Coz can help with this (GitHub - plasma-umass/coz: Coz: Causal Profiling)

Talk here:"Performance Matters" by Emery Berger - YouTube

(I’ve used it, but no affiliation. It does “catch” things that other profilers don’t. Integration with GPUs isn’t there, so you’d still be doing nvprof, intelVtune etc. for those use cases)

I can run some quick experiments on “Next generation” x86 CPUs if you’d like. I have access to CPU’s with the latest AVX512 extensions, including the newest FP16 instructions that can operate on IEEE types due for official adoption in C/C++23.

Currently cloning the repo, let me know.

-Felix

edit:

[Seems as a newer account I’m limited in responses to this topic. Does the user community also use mailings lists/etc.? Just so I can provide some context/Help]

On the topic of SIMD vector length agnostic code, a quick grep of the existing codebase (grep addps -r ) shows a few lines of inline assembly in the eigen SSE sub tree, from which you can double check and see that Eigen is being used to generate the different code paths for SSE all the way to AVX512. AKA you’re already generating code for all of them in the binary/ “bloating” the size. (You also have coverage for ARM Neon instructions for example, on top of Cuda, Hip etc.)

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel
  Model name:            [Redacted]
    CPU family:          6
    Model:               151
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            2
    CPU max MHz:         6500.0000
    CPU min MHz:         800.0000
    BogoMIPS:            7219.20
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss 
                         ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art
                          arch_perfmon pebs bts rep_good nopl xtopology nonstop_
                         tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes6
                         4 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xt
                         pr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_dead
                         line_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowp
                         refetch cpuid_fault cat_l2 invpcid_single cdp_l2 ssbd i
                         brs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriori
                         ty ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep b
                         mi2 erms invpcid rdt_a avx512f avx512dq rdseed adx smap
                          avx512ifma clflushopt clwb intel_pt avx512cd sha_ni av
                         x512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lo
                         ck_detect avx_vnni avx512_bf16 dtherm ida arat pln pts 
                         hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi a
                         vx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes
                          vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcn
                         tdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md
                         _clear serialize pconfig arch_lbr ibt avx512_fp16 flush
                         _l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                  [Redacted]
  L1i:                   [Redacted]
  L2:                    [Redacted]
  L3:                    [Redacted]
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer
                          sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB fillin
                         g
  Srbds:                 Not affected
  Tsx async abort:       Not affected
1 Like

I’ll quickly add, does Blender have any decent user data/survey response data on hardware across the user base?

I’ve found things like opendata . blender. org/ and there’s also some data to be scrapped from the Blender Benchmark - OpenBenchmarking.org data entries, but this data will skew very hard towards higher performance systems/the overclocking/gaming crowd, with the later of the two being more Linux focused.

The Opendata .blender. org results would still have the problems above of skewing towards higher performance systems, but by extracting the data and a regex, you could match that to approximate ISA coverage?

[Edit to respond to below comment from @brecht]

Enabling V2 would mean that any systems would be assumed to have up to SSE4.2 as a minimum, but other code paths would still be used/available/emitted.

My instincts tell me something like shipping the binary builds with a minimum requirements of -mx86-64-v2

then setting up your compute kernel dependencies to be something along the lines of
-march=x86-64-v2 -mtune=westmere (SSE4.2 AKA legacy)
-march=x86-64-v3 -mtune=skylake (Mainstream, AVX2)
-march=x86-64-v4 -mtune=skylake-avx512(Basic AVX512, it includes AVX512F, VL, CD, BW, DQ.)

Binary becomes a case of supporting 3 major release tuning targets.

There’s no official timeline/expectation of if/when an eventual v5 will be released, but if I were to guess, it would be at the Alderlake/Sapphire rapids boundary. You get all of Skylake-AVX512+

PKU, AVX512VBMI, AVX512IFMA, SHA, AVX512VNNI, GFNI, VAES, AVX512VBMI2 VPCLMULQDQ, AVX512BITALG, RDPID, AVX512VPOPCNTDQ, PCONFIG, WBNOINVD, CLWB, MOVDIRI, MOVDIR64B, AVX512VP2INTERSECT, ENQCMD, CLDEMOTE, PTWRITE, WAITPKG, SERIALIZE, TSXLDTRK, UINTR, AVX-VNNI, AVX512FP16 and AVX512BF16

The notable ones being FP16, BF16, VP2, BitALG, VPOP, VBMI 1 and 2, and integer fuse multiply add.

If you want to look at the intrinsic for these instructions, the intel webpage does a decent job: