CPU/CUDA split

Is anybody using this really to debug? To me it seems the CPU split kernel was thought to help devs see if bugs were due to driver and CUDA split was an attempt to make it faster, but in the end, no devs I know use cpu split and no user uses cuda split because it’s so slow. So wouldn’t it make sens to remove those (or at least add a compile flag to easily disable them)? If they weren’t used in 2 years, I don’t see this happen any time soon, so I think it’s safe?

I am against removing them.

The CPU split kernel in its current state is already useful for the debugging. When you encounter a problem in the OpenCL split kernel, you can easily try it out on the CPU to see if it’s something that’s broken in the split path, and if it is, step through it in a debugger.

The CUDA kernel is IMHO still not finished - the shader sorting needs to be fixed, that alone should give it quite a speed boost. Future support for HW accelerated ray tracing will most likely require a split kernel too.

In the long run, I’d even say that the split kernel should be the main kernel for both CPU and GPU paths. Supporting out-of-core geometry and texture is most efficient with coherent access, and that is best done via batching and sorting in a split kernel (see Disney’s Hyperion renderer or Redshift).

1 Like

Ok, but then it would make sens to make the move once and for all. I’m also happy if everything uses the same code base. My main problem was code duplication and the fact that it just make code harder to read, longer to compile, etc. I don’t know how far Mai went with this unification. Is it like 90% done, 50%, 20%? As far as I heard, despite many trials, cpu was like 10% slower and cuda like 10x slower?

CUDA is not 10x slower, more like 2x at most. Not sure about CPU.

Getting the performance on par will likely take significant work. Not a lot of time has been invested in optimizations for the CPU and CUDA split kernels as far as I know. But we can’t move before someone has time to do it.

I’ve played a bit with the CPU part when testing my texture cache. Currently the CPU split kernel traces packet sizes of 1x1 rays, which prevents it from benefitting from coherent memory access. Pixar reports getting good results from tracing hundreds of rays at a time:

Our trace and shade groups typically consist of up to a few hundred rays and ray hit points, respectively. We do not collect entire wavefronts of rays and ray hits (millions of rays and ray hits) since that would use huge amounts of memory or require streaming the data to disk. Instead, we have found that keeping relatively small groups often give the benefits of coherent tracing and shading without a large memory overhead.

Changing the CPU kernel to an appropriately sized packet size and implementing shader sorting (if I’m not mistaken, only the OpenCL path right now performs a sort) might bring the split kernel to performance parity.

2 Likes

Is it fair to sum up this thread so far as:

  • Split kernel isn’t great on cpu/cuda right now.
  • No sane user would use it.
  • We have a couple of ideas on how to fix this.
  • But really lack the dev and/or time to do this.

The cycles kernels take up a significant chunk of the total build time of blender, (which really adds up when you use clang on windows, which takes 2x+ the time msvc takes) and split-kernel doubles this.

If nobody is using it currently, I wouldn’t mind if we hid the split kernels behind a cmake flag so all other devs don’t have to pay the price every time they build just because someone may someday work it.

2 Likes

I think disabling it by default would make sense - right now they’re only useful for debugging and eventually maybe for a developer looking into speeding them up, but both cases can just enable them in their local builds.

1 Like

Thanks for the link. I saw the mega kernel for OpenCL was removed, that’s great. I hope someone will have time to see if the split kernel is doable for all devices.
In the meantime, I still thing it’s a good idea to hide those behind cmake flags disabled by default. I already have enough features started to work on for a year, so if I start to have a look to it at some point, it will be 2020+.

I’m fine with adding an option for it disabled by default, anyone feel free to commit that or contribute a patch.

My only concern is, that if nobody compiles with it any more, future patches may become incompatible with it.

Most of the split code is still getting daily exercise from the opencl code path, so i don’t think it’ll rot that quickly. Ideally we have a nightly run of the unit tests, which could build with it on, but i suppose that’s a subject for a different thread.

Slightly off topic, but I wasn’t aware there were unit tests for Blender. Where can I find out more about that?

https://wiki.blender.org/wiki/Tools/Tests/Setup

1 Like

Allow me to revive a bit this conversation.

I’ve been trying a few things and for the time being I got this results:

With CPU the speed improvement with Split Kernel is around 2%, that 2% may be a lot if we are talking about animation and it’s very well needed, when you work with a renderfarm every second means less moeny spent in the farm.

With Optix the speed improvement with Split Kernel is way bigger, around 10% / 11%, but I have to use 2 CUDA stream to fully leverage it, and a big bucket size (but that may be because a simple modification I did).

So I don’t see a reason on why not to use Split Kernel, unless some of you says that as it is right now it misses something important and something does not work with it.

We need (and with “we” I mean ALL the users :slight_smile: ) all the speed we can get from Cycles, all of it, and if the split kernel gives us a 2% speed increase it is very very welcome.

EDIT: BTW if the scene is super simple, split kernel in CPU is slower, I see benefit in medium complexity scenes

1 Like

OK, and now I have to correct myself, because I’ve been doing the dumb.

In Optix what I was seeing is not the Split kernel effect, it was just the effect of having more CUDA streams.

In CPU, my results could be considered margin of error, I’ll do more tests in the near future, but I think I was wrong.

The big speed improvement I saw in Optix is probably related to using 2 CUDA streams instead of 1, and I still have to confirm this with more scenes and more situations.