Slow initialization in Cycles (Linux, lots of textures)

Eugene-Kuznetsov · October 25, 2021, 6:35am

I have access to a dual-socket server node with 96 CPU cores and 8 workstation GPUs. I tried to do Cycles rendering on it and discovered that it worked about as fast as my 6-core/1-GPU desktop.
The rendering part itself is reasonably fine - it manages to do 256 samples in 4K resolution in about 1.5 minutes. But only after it spends 5-7 minutes to load the objects and textures (it’s a pretty big scene with ~250 textures in PNG format.)

I know that Cycles initialization is normally slow, and I don’t understand why it spends an order of magnitude more time loading objects and textures than it takes Eevee to render the whole scene start to finish, or why it needs to reload them for every frame when they are exactly the same. But at least on my desktop it takes 30 seconds to load before 5-10 minutes of path-tracing. Here the timing is reversed.

Throughout the 5-minute load, CPU usage is low (1-2 CPUs worth) and disk traffic is practically nil.

I took a look under the hood with gdb. It looks like the bottleneck is in blenkernel/intern/image.c. It has a single global mutex that serializes all image operations. So, Cycles needs to load 250 png’s for every frame, and it needs to uncompress these png’s, and, because of the mutex, it can only uncompress them one at a time, even on 96 cores (while one thread is decompressing, the other 95 are waiting on the mutex).

I tried the latest source. The problem is still present. But if I comment out some of the locks/unlocks in that file, initialization time goes down to from several minutes to 20 seconds. I expect that changing the mutex from global to per-image would do the trick too.

Are there any problems with this? I can try to implement it properly and submit a pull request if that’s acceptable.

sergey · October 25, 2021, 10:11am

The threading synchronization in the image.c is indeed not suitable for read of many images from many threads. A lot of things can be done there to improve scalability problems and bring the code to a more modern epoch where we know we have a lot of threads.

However, commenting locking is not good solution for this More proper would be to go away from the global lock to a per-Image datablock lock, similar to how we have eval_mutex in the Mesh_Runtime. The goal should be to allow decoding to happen from many threads, without causing race condition between image loading and cache management (BKE_image_free_buffers_ex, image_mem_size).

If you can have a closer look into moving away from global to per-image locks that’d be a very welcome contribution!

P.S. If you want some more instant solution for the rendering tasks you’re doing now (to unlock a production, i.e.) you can work things around by unpacking image textures (Cycles only uses image.c codepath for packed and generated images).

Eugene-Kuznetsov · October 25, 2021, 8:56pm

Right, I just commented out the locking because it was obviously read-only accesses in my particular case and it was much easier than doing the real coding. Unpacking works, but it’s a hassle because I have to either upload the blend file in the unpacked form, or figure out how to unpack it on the server with a script from command line.

I’ve since discovered that the master branch no longer has OpenCL, and HIP does not work correctly for me, I’ll need to look into that before I can get back to the mutex issue.

Alaska · October 26, 2021, 5:18am

HIP support is still being worked on. Official guides on how to set it up along with a list of supported GPUs, GPU drivers, and operating systems will be made avaliable in the coming months. HIP support is expected to be enabled, or at least accessable, during the development of Blender 3.1. But timelines may change.

Eugene-Kuznetsov · October 27, 2021, 9:21pm

I work with HIP closely as part of my day job. I sometimes have to find and fix bugs in HIP runtime.

In this case, it looked like it could be a straightforward CUDA port. GPU kernels are the same, the interface looks ported. But I’m running into some sort of memory access violation inside the kernel_gpu_integrator_shade_surface kernel. Not sure what’s causing this, but somehow both SVM node offsets (kernel/svm.h, svm_eval_nodes) and object indexes (kernel/integrator/integrator_shade_surface.h) occasionally have illegal values, and there’s no range checking, so any illegal value causes all sorts of troubles.

Managed to put in range checking, but that did not help - I’m getting a corrupted image instead. So, something is fundamentally broken. Checked with the folks doing the HIP port and they don’t recall seeing anything like this (but then, they are working with different hardware and different OS.)

Bummer.

sergey · October 29, 2021, 12:57pm

Meanwhile I’ve created ⚙ D13032 Localize image mutex lock into runtime field of Image datablock which goes away from the global lock. Give it a whirl!

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

Slow initialization in Cycles (Linux, lots of textures)