Long period spent in systemcalls when rendering on multi gpu numa system

I’m trying to do background cycles rendering on my university’s dgx system running NUMA. While everything works there is a huge delay before rendering information is displayed. An example of console output can be found here on a stack exchange post I made. Each additional GPU added to the render increases the delay ~300ms. At 8 gpus this takes up over 50% of rendering time.

I’m using the BlenderProc which is used to generate synthetic machine learning datasets and need to render images pretty fast (< 5ms). Is this performance issue inherit in that renders are expected to take much longer than this or is this a bug or configuration issue on my end? From strace the delay seems to being occurring in some identical ioctl calls, the same set of calls occurs 32 times per gpu added.

     0.000024 close(37)                 = 0
     0.000031 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x20), 0x7ffc04307920) = 0
     0.000500 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc043078d0) = 0
     0.000130 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x20), 0x7ffc04307890) = 0
     0.000138 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc04307840) = 0
     0.000123 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x20), 0x7ffc04307890) = 0
     0.000131 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc04307840) = 0
     0.000122 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc043078d0) = 0
     0.000124 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc043078d0) = 0
     0.000210 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xb0), 0x7ffc04307080) = 0
     0.007327 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc04306f50) = 0
     0.000161 openat(AT_FDCWD, "/proc/driver/nvidia/params", O_RDONLY) = 37
     0.000066 fstat(37, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
     0.000023 read(37, "Mobile: 4294967295\nResmanDebugLe"..., 1024) = 649
     0.000036 close(37)                 = 0
     0.000024 stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 255), ...}) = 0
     0.000025 openat(AT_FDCWD, "/dev/nvidiactl", O_RDWR) = 37
     0.000028 fcntl(37, F_SETFD, FD_CLOEXEC) = 0
     0.000018 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffc043070a0) = 0
     0.000213 mmap(0x202c00000, 4194304, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 37, 0) = 0x202c00000
     0.000130 close(37)                 = 0

The above block shows one set of ioctl calls that I’m mentioning (ie that block appears 32 times per gpu added) with wall time (this used -r but -T gave similar results). As you can see there is a pretty large delay of about 7ms occurring.

I should also mention that the system is also using Slurm and Singularity containers for handling jobs but the problem still occurs when using neither of those things. I’ve tried changing the number of threads available in blender aswell with no luck.

Any help would be appreciated. Thanks.

We use a CUDA API that does busy-waiting for the GPU to return results, which takes up one processor core. So if you don’t have as many CPU cores as you have GPUs, that might explain the problem.

We’d like to improve this, but it’s not something that I expect to happen in the next few months.

Other than that, I’m not sure what the cause would be, perhaps profiling can reveal where those ioctl calls are happening.

The system has many more cores than it does processors so that probably isn’t the problem.

From what I got from cachegrind the ioctl calls are coming from CUDA but since cachegrind can’t monitor activity in syscalls (as far as I know) it would be difficult to figure out which ones are causing the long hang as there are thousands of calls. I’ve also tried compiling blender from source to do some debugging but had issues getting it to compile. I might try again if I have the time.

Windows or Linux? LazyDodo is the one to ask for help on Windows.

I’m only using linux systems. Do you mean for compiling or for the performance error I’m finding?

For building Blender. I build Blender on my Ubuntu system, I might be some help, but I’m no LazyDodo.