We don’t collect any information like that. Opendata indeed will be skewed towards people with new CPUs that want to benchmark them, I don’t think it can tell us anything useful.
In terms of what is acceptable to drop, it will depend on the performance benefit but I think it’s somewhere < 1% of users. If v3 for example means 10% of users according to those Steam numbers, that’s much too high even for Blender 4.0.
For Cycles, we can probably drop SSE3 and AVX kernels and only keep SSE2, SSE41 and AVX2. Might be worth trying a AVX512 kernel but not really expecting that to do much.
I’ve hinted at this previously, but I have yet to see solid convincing performance numbers to justify raising the requirements.
Do note that these numbers should exclude cycles, as it already has specialized kernels for various architectures . If anything we may have to look into adding specialized AMD/Intel kernels there if we were able to replicate the essentially free 2-9% perf bumps by using -mtune that @FelixCLC is hinting at.
until we have a solid gasp of what parts of blender get significantly faster by twiddling the build flags and nothing else, any discussion on what lower bar we should pick is putting the cart before the horse a tiny bit and is ultimately a waste of time for everyone involved.
Bleder overall nothing is easily available, while the unit test coverage is ok…ish we do not track any performance metrics currently, some work has been done in this area by geonodes team, but afaik this has been experimental in nature sofar.
If you specifically given your experience want to tinker with something right now, i’d say cycles is your best bet, you can find the instructions for the benchmarks over here the compiler flags for cycles can be found in intern\cycles\CMakeLists.txt
Generally agreed. Since one mentioned benefit of bumping the requirements is that we would be able to use SSE4 intrinsics in the code, I wanted to point out that there are probably preferable ways to use SIMD that don’t require that. Seems like after figuring out dynamic dispatch once, further use would be easier.
I don’t have high hopes for auto-vectorization, though it might help in some simple loops. That’s nice to get “for free”, but I think we ought to aim a bit higher than that.
So in terms of wanting data overall, just from people in this thread, what hardware do we have access to?
I can spool up an old ivybridge system for an AVX but no AVX2 example, I can do Bulldozer era APUs, Haswell (first gen AVX2), AlderLake for one of the most recent AVX2 systems, and another specialized AVX512 system
Typically in full application profiling, seeing an uptake of overall performance in the 2-9% total throughput (say in CFD solvers which scale roughly with SGEMM FLOPS) with per uArch tuning is common.
This assumes that you were compute bound and not say memory bandwidth bound, I/O bound etc.
That’s a deeper rabbit hole and gets very OT, but for the sake of pointing you towards resources if you’re curious: ARXIV has a fair few interesting papers on the topic, or if you prefer talks/videos, look into the divergence of Flops vs IOPS in HPC to get a sense of it.
A practical example of this that consumers are seeing these days are massive amounts of cache in consumer CPUs to help “hide” how bandwidth starved modern CPUs and GPUs are.
It’s also known as “Feeding the beast”. It’s why Intels next gen “Max” series HPC data center CPUs will have on package HBM memory, because HBM can keep up, unlike traditional DRAM (including “Next gen” DDR5, which still isn’t fast enough).
You can see the same sort of approach with AMD and the “V-Cache” they’ve been adding to some of their Epyc and now a single consumer Ryzen SKU.
Time waiting on data to come into the core is time not computing/doing valuable work. Therefore why we have 3 levels of cache on CPUs these days
This is the important thing for legacy system users to remember. These blender versions aren’t going anywhere. Also, you aren’t going to be stuck on old hardware forever.
Those who are able to invest their time and effort so that they can afford better hardware should be… I guess “rewarded” is a way of putting it, but isn’t quite like that. Sacrificing even 1% performance, just to keep something from over a decade ago on the officially supported list, doesn’t make sense.
I don’t even want to mention it, but I suppose if somebody is willing to spend their time to include really old hardware in the latest and greatest releases, you could have 2 different blender versions.
The question is if the old versions could sill be used because of security and compatibility. I think that some old chips could be better than some not so old ones in terms of security too.
Reading a little more I don’t even know if this is runtime performance, but I compile blender enough that I’ll take faster compile time. Sorry netburst, core2, or whoever this would potentially leave mired on some barbaric and uncivilized FOSS DCC made in the year ~2024.
Also to make it clear, I started this thread in the hope to agree on new CPU requirements. That doesn’t meen we have to make the switch immediately. But having an agreement to move to something like x86-64-v2 can stimulate development and ideas and once a patch is tangible and can be merged it can just be done without further discussions. I think that would be a fair compromise.
The problem is that SSE4.1 is not a nice cutoff point… Something great like the X6 gets dropped and weaklings get in. But now it is pretty basic and not a cheap upgrade.
It was annoying when some game required 4.2 just for some DRM.
Choose whether to enable SIMD compiler flags or not, options are: None SSE42 AVX.
Although not required, it is strongly recommended to enable SIMD. AVX implies SSE42.
Open Image IO:
OIIO comes with SIMD support as well, their changelog comes with some mentions.
Many noise() varieties have been sped up significantly on architectures with SIMD vector instructions (such as SSE). Higher speedups when returning vectors rather than scalars; higher speedups when supplying derivatives as well, higher speedups the higher the dimensionality of the domain. We’re seeing 2x-3x improvement for noise(point), depending on the specific variety, 3x-4x speedup for 4D noise varieties. #415#422#423 (1.6.1, 1.6.2)
Further improvements are possible (SIMD batched shading mode), but these would require coding changes in Cycles first.
If you want i could build you a (windows) lib set with sse42 as the target, given this is a good 12+ hours of work at the minimum, I’d like to see some commitments here first, “meh maybe if i have time I’ll take a look” isn’t gonna cut it.