2.92, Optix CPU + GPU is slower than GPU only

Hi,
A couple of months ago, when I first tried rendering with optix I was blown away by the difference in speed. On my current project scene I had a x4 speed up in render times compared to CUDA.
I use a i9, 2080ti at work.
Recently I heard that 2.92 allowed to use both CPU and GPU with Optix, so I gave it a try, thinking that my i9 was going to be of some use.
However, I noticed that there was a significant slow down when checking CPU as well as GPU in the optix panel. Not by much, but the few tests I did were consistant.
I setup a quick test scene, a Nishita sky for the environment, a plane emitting 2500 icospheres as hair, and 100 interpolated children, with a standard principled shader. So only geometry, no animation.
Here are my results :
2.91 + Optix (GPU), TS 64 = 16.34, 15.48, 15.17
2.92 + Optix (GPU+CPU) TS 64 = 18.14, 18.38, 18.18
2.92 + Optix (GPU) TS 64 = 15.55, 15.69
2.91 + Optix (GPU), TS 256 = 11.62, 11.74
2.92 + Optix (GPU+CPU) TS 256 = 13.64, 13.97
2.92 + Optix (GPU) TS 256 = 11.92, 11.95

Is there a reason for this ? Certain circumstances for which the CPU is beneficial, others when it is not ? I realise that my test scene is very limited, no skinning, no motion blur, but it is a fairly good representation of the current project I am working on, for which I am trying to optimize render times)
Thanks !

1 Like

Have you tested at 16x16 or 32x32?

16x16 optix GPU only = 34s
16x16 optix GPU + CPU = 32.8s
128 samples, 1920x1080
Do such small tile sizes have any advantage for a 1920x1080 render ?

Yes of course, it’s not the render size but the device and amount of threads, GPU alone benefit of bigger tiles, CPU benefit from small tule size, CPU + GPU the same, however now that the GPU can steal tiles from the CPU a bit bigger tiles will be better, so try 32x32 and see what happens :slight_smile:

However that was not the same scene as before, right?

32x32 optix GPU only = 19s
32x32 optix GPU + CPU = 22s

So only at 16x16 is the cpu+gpu combo faster than gpu alone (but it’s twice as slow as 256x256 gpu only…)

Yes, I am on the same computer, same exact scene every time. For every configuration I launch the render a few times to make sure I get average values that are coherent, and every time the render times of a single configuration only differ by a few miliseconds.

What about other scenes?
BMW Demo:
https://download.blender.org/demo/test/BMW27_2.blend.zip

Classroom:
https://download.blender.org/demo/test/classroom.zip

BMW : Optix GPU only, 1920x1080 256 tile size = 1:27m
Optix GPU + CPU, = 1:26m
Optix GPU only, 32 tilesize = 2:08
Optix GPU + CPU = 2:02
Classroom :
Optix GPU only, 1920x1080 256 tile size = 1:25m
Optix GPU + CPU, = 1:24m
Optix GPU only, 32 tilesize = 1:55
Optix GPU + CPU = 1:50
What’s interesting though, during all these processes, my gpu is reported as only 8% busy (both gpu only and combo). While the cpu’s 16 threads are all 100% (CPU + GPU), normal usage for GPU only.

Perhaps, with a configuration like mine where the GPU is much faster and optimized than the CPU, the time spent building 2bvh is pretty much the same as the time otherwise spent by the GPU rendering the cpu’s tiles ?

If you are looking at Windows Task Manager, you will see this:

What is the exact model of the CPU?

Oh that is nice to know, thanks !

My cpu is Intel Core i9-9900K at 3.60Ghz

What is the CPU only render time with BMW scene?

Sorry I don’t have access to that specific computer over the week end, i’ll try tomorrow.
However, I ran those tests on the bmw scene on my laptop (i7-10875H + RTX 2070 Max-Q)
And at tile size 256
I get :
Optix with gpu alone : 2:26
Gpu + cpu : 2:16
And at tilesize 64 :
GPU only = 2:40
GPU + CPU = 2:25
On my laptop the difference between cpu and gpu power is probably lower. And I noticed that at the tile size 256, the gpu is done rendering all the tiles before the cpu has let any one go, so maybe the “stealing process” loses time ?

Hi.
It would be nice to know the render times with CPU only for BMW scene. I think you have not provided that data yet (CPU only render time). You use 64x64 tile size in all tests. The larger the tile sizes, the CPU will have less chance to finish rendering before the GPU steals the job. Also, small tile sizes are optimal for CPU.

So, I started rendering the bmw scene, with 64x64 tile size, Optix with CPU only, and I get 15:23 minutes.

So, from all this, here are the conclusions I can draw :

  • 256 px is optimal for my GPU
  • 256 px is not optimal for the CPU
  • 64 px is better for the CPU, and CPU + GPU combo gives better results than GPU only, however, the overall render time is higher than with 256 px GPU only
    perhaps an idea would be to split the user defined tilesize to benefit the CPU ?
    Let’s say I ask for 256px tile size because it’s what works best for my GPU,
    could the CPU split those tiles by the number of threads available and render them ?
    In my case, that would be 256/16 = 16px.
    So while the GPU renders his big 256 chunks, the CPU would render the same chunk size, but use all it’s cores to do it instead of assigning big chunks to single cores ?
    Worst case scenario : the CPU is not done rendering his own 256px chunk when the GPU has done the rest, I guess tile stealing would not be possible because the cpu’s tile has been split in 16. Then either wait for the cpu to finish, or let the GPU start over that particular tile if it’s faster than waiting.

If you choose CPU from Device item in Properties panel, it is always CPU only no matter what device you have selected in Preferences > System.

But, Are you sure that the ones you mentioned are the models of the CPUs that you have?

According to Blender Benchmarks results:


the render time for any of the CPUs you mentioned here for BMW default scene (CPU, 35 square samples, 32x32 tile size) gives an average time of around 4 minutes.

EDIT:
Just to know if you are doing things right, for CPU you just open “bmw27_cpu.blend” and without touching any settings, you render image.

It is the same CPU, but I changed the resolution to 100% instead of 50% to be coherant with my earlier tests,
If I change it back to 50% I do get around 4 minutes, at 64px

To sum up, with the default 50% resolution :
CPU 64px : 4:00 mins
Optix GPU + CPU 256px : 0:23 (too big to fit all cpu tiles)
Optix GPU ONLY 256px : 0:23
Optix GPU + CPU 128px : 0:22
Optix GPU ONLY 128px : 0:23
Optix GPU + CPU 64px : 0:24
Optix GPU ONLY 64px : 0:25
Optix GPU + CPU 32px : 0:28
Optix GPU ONLY 32px : 0:31
Optix GPU + CPU 16px : 0:45
Optix GPU ONLY 16px : 0:55

1 Like

Interesting, thanks for sharing your observations.

I’m using progressive rendering, but I believe under the hood Cycles is still dividing the rendering into tiles, or is the screen turned into one big tile when using progressive rendering?

I don’t know how they split the processing tasks, but if multi-processor tasks are not optimized, the overhead of communication between them to coordinate which one does what can make a task longer.

It works best if you can dedicate one core of a multi-core system to be the boss, delegating chores to the remaining cores. And, of course, the software should be written to take advantage of multiple cores, optimizing the work flow.

OK thanks.
With your tests it could be said then that if GPUs render time is 10 times faster than CPU, then CPU+GPU rendering is meaningless. It then remains to know in practice until how many times faster the GPU has to be for CPU+GPU rendering to make sense.

1 Like

I still can’t get tile stealing to work. That would be a start to increase the combined performance. The next step would be to have different sizes for CPU and GPU rendering. Probably having the CPU buckets being some fraction of the GPU buckets to make it easier.

1 Like

Did some tests too.

Ryzen 1700 (16 threads), RTX 2060 Super, 24 Gb RAM, Windows 10, Blender 2.92.0, 128 samples.