Multi Instance Multi GPU rendering is slower than expected

Hello,

I am running thousands of renders for synthetic data generation on 4 gpus. The time to render each image is about 1.2s if I use all 4 gpus, and only about 1 second if I only use 1 gpu (due to read/write overhead across 4 gpus I guess). To speed up the render time, I had an idea to open 4 blender processes and assign each with a separate gpu, and then dividing the workload by 4 and feeding each gpu its own workload. Each gpu gets a unique list of images to render.

To do this I spawn 4 blender processes using subprocess.Popen, then using subprocess.communicate() to wait for each process to finish rendering. Each blender process gets a separate gpu to work with. I do that by setting the use flag of every device to 0, and then setting the gpu I want to be used as 1.

Theoretically this should result in a 4x speedup since it’s running on separate gpus altogether. But in practice, I’m getting something like 2.7s to render each image. Overall it’s still faster to render with, but the speedup is only about 70% rather than 4x. Any ideas on how to get the full 4x increase?

Some of these numbers must be wrong. If it takes 1 second to render on 1 GPU. And 1.2 or 2.7 seconds to rendering on 4 GPUs, then you’re not seeing a performance uplift. Instead you’re seeing a performance decrease in every case.

Either way. Since you’re rendering on 4 GPUs, I have a suspicion that you’re using Cycles as your rendering engine. If that is the case, then you need to take into consideration this:
When Cycles, the rendering of each frame in an animation consists of three steps:

  1. Scene initialization (converting the Blender scene into a Cycles scene) - This is done on the CPU.
  2. Rendering - This is done on the GPU.
  3. Saving of the image to disk - This is done on the CPU.

After those tasks are done, Blender/Cycles moves onto the next frame. Since two of those tasks are done on the CPU, they can not be sped up by distrobuting work across multiple GPUs. And since your render times are so low, I suspect the fact those two tasks are done on the CPU is leading to you not seeing the performance uplift you want.

Note: Lukas Stocker was was working on making that last part (saving the image) asyncronous which would result in faster overall animation renders when the output frames were large and the render times were short. You can find the patch here (⚙ D7952 Render: Add option to write frames asynchronously when rendering animations). If you want to use it, you can apply it to the Blender source code and build Blender yourself. But I suspect it may require a bit of manual work to get working.

4 Likes

The performance uplift comes from the fact that it’s 4 frames being done every 2.7 seconds, because 4 gpus, instead of 1 frame, which turns into an average of 1 frame every 0.67s. But yeah, I’m technically seeing a downlift in performance.

I’m using cycles with about 30 samples on a 1280x1024 image with optix denoising turned on. It’s running on a machine with 4 RTX A6000s and a threadripper 3960x. The output is a png and I believe there’s no other processing on the output which could increase amount of time it takes to write to disk. The disk itself is an nvme ssd with pretty high write speeds, so it shouldn’t be too big of a bottleneck either. When I run system monitor while rendering on 4 processes, I see that cpu usage is around 100% on every single thread. The scenes themselves are fairly light and only use about 2GB of vram on each gpu (out of 48 gb).

I didn’t know that scene initialization took a lot of CPU. Currently, the script I’m using creates and destroys a new blender scene for every frame. That’s done by parsing a json config file and using blender api to create the specified scene with the specified obj models. The models are linked instead of reloaded for every frame, which helps with the read overhead. I can try out the asynchronous write I guess, any other avenues of speedup that you can think of?

Would making a process for each frame until vram limit is hit help, or some other way to max out the available vram and gpu usage? What about ways to speed up scene initialization on Cycles?

Based on the fact your script is creating and destroying instances of Blender for each frame, you will probably see no improvement with the asynchronous write patch. The asynchronous write patch will only help if you create a Blender instance that then renders multiple frames of an animation one after another before being destroyed.

As for other avenues you could persue to speed up rendering, I’m not exactly sure. The CPU is probably your limiting factor in this case, so upgrading your CPU, or shifting the GPUs to different different computers with their own CPUs might help.
As for ways to speed up scene initialization, maybe you could create a script/program that asyncronously creates the next Cycles scens while the GPUs are rendering the current one. But that might lead to a performance decrease due to various factors?

Maybe other people will have better ideas on what to do?

My suggestion is as follows.

load all models into a blender scene (or a subset)
create a python script to

  • ensure only one unique GPU is used (and not one thats already used, you may need to pipe extra commands through for this)
  • enable / disable render visibility on different objects based on the json file
  • Change render output directory
  • render
  • loop back to the first part

Then, set up 4 different instances, each rendering on ONE GPU and not the cpu. I would also split the json file up into 4 different files, to ensure there is no overlap.

I would also suggest doing this all via commandline.

If its only taking up 2GB of ram, you could be able to up the amount of instances aswell, keeping everything utilised as much as possible.