Problem: GPU + CPU slower than GPU Only, Suggestion: All CPU Threads as One Tile

I love Cycles, absolutely love it.

I really love that the Blender developers have in the latest version of Blender, made it possible to render with both CPU and GPU at the same time, it has the potential to edge out a bit more speed from a PC and take advantage of a CPU that is otherwise just sitting idle throughout the whole render process.

When you’re rendering lengthy scenes with Cycles, you’ll take every bit of extra speed you can get!

But so far I haven’t used this feature for anything else other than performance tests… because rendering with GPU + CPU is actually slower than rendering with just my GPUs.

Yes really, rendering with both my GPUs + my CPU is slower than just rendering with my GPUs.

For context, I have a pair of GTX 1080 Ti’s, and an AMD Ryzen 7 1800X.

It doesn’t matter what type of scene I’m rendering, I’ve tested this numerous times with different scenes, every time the result is the same, if I only render with my GPU, I get a faster render time than if I render with GPU + CPU. Tile size and sample count appear to make little or no difference.

The reason why is pretty obvious when you watch the image rendering.

The GPUs absolutely blaze past the CPU threads in speed when rendering an individual tile. What takes the GPUs perhaps 2 seconds to render, usually takes around 30 seconds for my CPU threads to finish.

That’s pretty logical. 1 thread of an 8 core CPU is always going to be much slower than an entire GTX 1080 Ti when applied to the same size task.

When the image is almost finished, there’s usually several tiles still rendering, all CPU tiles. The GPU can no longer assist with the rendering, because there’s no more tiles left for it to render. So instead the GPU has to sit idle waiting for the CPU to finish rendering it’s tiles. And sometimes many of the CPU’s cores are sitting idle as well while waiting for 3 or 4 tiles to finish rendering.

It defies logic, but rendering with both CPU and GPU combined is slower as a result. What should be, indisputable, universally, always faster is actually slower due to an imbalance in how work load is distributed.

Suggestions for Fixing This

The problem is an imbalance in how workload is distributed, a single CPU thread can’t compete with a GPU, so the CPU threads need smaller workloads.

An option to have all the CPU threads used together (acting almost like a GPU) for a single tile, would allow the CPU to focus on one tile instead of a dozen and finish that tile quickly instead of taking a long time to finish many tiles. This way if the GPU(s) finishes it’s tile(s) first, it spends less time idle waiting for the CPU to finish.

Or, alternative solution, an option for variable tile sizes per CUDA device would help, with the GPU able to render large tile sizes, say 128px or 256px, and the CPU threads rendering much smaller tiles, say only 16px/32px.

4 Likes

I also experienced similar issues where GPU alone was faster than GPU + CPU. As the GPU buckets usually are faster than the CPU ones, it can happen that the last buckets remaining were CPU ones, while the GPU is idle and has to wait for the CPU to catch up. And depending on where these buckets are stuck. This can take longer than the GPU alone.

I always wanted to have the option to have different bucket sizes for GPU and CPU. I’m not sure if it’s possible for the GPU to support the CPU buckets that are left behind at the end. That would be awesome too. Maybe some “intelligence” could be infused into the way buckets are distributed between GPU and CPU.

I’ve added a thread in Right-Click-Select about it. Maybe other users would want it too.

Okay hacky idea: Afaik you can combine two low sample renders with different seeds to a new image with the noise level of double the sample ammount of a single render. Maybe when the GUP is finished with it’s last tile, the next best CPU tile that is still rendering could save it’s progress in memory - then the GPU takes over and renders the remaining samples with a different seed - then both results are combined to the finished tile and the GPU jumps to the next still rendering CPU tile.

1 Like

It’s been discussed before on this forum. The plan is to let the GPU render many small tiles at once.

Whatever the solution, it’s not a design issue but a matter of a developer finding time to work on it.

2 Likes

Is it technically possible for the GPU to finish a tile that the CPU has already started?

1 Like

Yes, but it would be messy and I wouldn’t try doing it, there are better solutions.

Have you tried rendering with 16x16 or 32x32 tiles? Cycles is now often faster on the GPU with tiny tiles than with the old huge tiles, and this also pretty much solves the issue of CPUs stuck on large tiles at the end when you enable CPU+GPU as well.

Sure, there’s more fancy stuff that can be done in the future, but the situation today seems pretty good.

1 Like

It’s true that the performance of the GPU with smaller tiles got a lot better with time. There’s still a bit of a difference but not as bad as it has been been before. So the improvement gain might be small but maybe still worth having that option.

The biggest problem is that if the CPU buckets are stuck at the end, rendering some very heavy materials (especially refractions), and you have 20 of them doing that, it can take many minutes for them to get the the job done, while the faster GPU is already finished and just waiting for them. If the GPU could be used at that stage, it could massively improve the overall render time.

Idea how it could be solved:
Well since there is render merging with different samples developed, so potentially there could be merging half baked tiles and there fore “tile multithreaded” feature:

If thread finish last last available tile, it will start already started tile and then if sum of both half done tile samples will be equal to target it will merge those half rendered tile into rendered tile.

Eg. There is last tile that is being rendered with CPU, it is computing 100 sample out of 1000, GPU just finished rendering previous tile but there is no more free tiles, so it starts computing already rendered tile. Few seconds later CPU computed 300 and GPU 700 samples. 300+700=1000, so those two half made tiles are merged into one finished tile.

Would that work?

Let the BOSS decide guys! I trust him !!!
Go Brecht, let us dream!!!
:joy::joy::joy::vulcan_salute:

1 Like

I discovered this issue also and ended up turning off the combined CPU+GPU setting once I noticed. It wouldn’t be perfect, but it seems like a very simple algorithm that could be applied to the code would basically make the CPU tiles abort their jobs entirely and let the GPU take over from scratch on those tiles once the GPU is left waiting. It would then be incrementally more work to allow the GPU to take over on those CPU tiles and utilize their existing work, spending the remaining time to finish up what they started.

2 Likes

I’m not sure how right this is as I’m just getting into this kind of code, but to me, it seems the most elegant way of merging the cpu cores’ work would be to assign them different “times” in the pseudo-random sampling, not having the GPU pick up where the CPU leaves off. Basically:

how it is now, cpu core 0 would handle samples 0-127 on tile 1, and core 1 would handle 0-127 on tile 2, etc.

with the change, core 0 would handle 0-31, core 1 would handle 32-63, core 2: 64-95, core 3: 96-127
and then they could be slapped together/averaged real fast by the GPU or the cores before CPU worked on the next tile.

Notice how when you have many cores, though, time savings are reduced, as more partial tiles need to be averaged to make a single one. The only cheap ways I could think of doing that, for example with 8 cores, are to use 4 cores to combine 2 partial tiles each, then 2 cores to combine those partial tiles. even cascading like this, you end up with a best-case scenario of exponentially fewer cores being used as the partial tiles are averaged.

as well, they could simply be cached somewhere then brought back and combined by the GPU when it’s done with its work, but that also induces some of the same problem using gpu and cpu together already causes.

I have no clue how GPU archetecture or segmenting/partitioning/reserving cores works, or if it’s even feasible, but that could potentially eliminate the bottleneck. never gone into GPU stuff before though so I could be really wrong here.