I have access to a dual-socket server node with 96 CPU cores and 8 workstation GPUs. I tried to do Cycles rendering on it and discovered that it worked about as fast as my 6-core/1-GPU desktop.
The rendering part itself is reasonably fine - it manages to do 256 samples in 4K resolution in about 1.5 minutes. But only after it spends 5-7 minutes to load the objects and textures (it’s a pretty big scene with ~250 textures in PNG format.)
I know that Cycles initialization is normally slow, and I don’t understand why it spends an order of magnitude more time loading objects and textures than it takes Eevee to render the whole scene start to finish, or why it needs to reload them for every frame when they are exactly the same. But at least on my desktop it takes 30 seconds to load before 5-10 minutes of path-tracing. Here the timing is reversed.
Throughout the 5-minute load, CPU usage is low (1-2 CPUs worth) and disk traffic is practically nil.
I took a look under the hood with gdb. It looks like the bottleneck is in blenkernel/intern/image.c. It has a single global mutex that serializes all image operations. So, Cycles needs to load 250 png’s for every frame, and it needs to uncompress these png’s, and, because of the mutex, it can only uncompress them one at a time, even on 96 cores (while one thread is decompressing, the other 95 are waiting on the mutex).
I tried the latest source. The problem is still present. But if I comment out some of the locks/unlocks in that file, initialization time goes down to from several minutes to 20 seconds. I expect that changing the mutex from global to per-image would do the trick too.
Are there any problems with this? I can try to implement it properly and submit a pull request if that’s acceptable.