I’m trying to do background cycles rendering on my university’s dgx system running NUMA. While everything works there is a huge delay before rendering information is displayed. An example of console output can be found here on a stack exchange post I made. Each additional GPU added to the render increases the delay ~300ms. At 8 gpus this takes up over 50% of rendering time.
I’m using the BlenderProc which is used to generate synthetic machine learning datasets and need to render images pretty fast (< 5ms). Is this performance issue inherit in that renders are expected to take much longer than this or is this a bug or configuration issue on my end? From strace the delay seems to being occurring in some identical ioctl calls, the same set of calls occurs 32 times per gpu added.
--
0.000024 close(37) = 0
0.000031 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x20), 0x7ffc04307920) = 0
0.000500 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc043078d0) = 0
0.000130 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x20), 0x7ffc04307890) = 0
0.000138 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc04307840) = 0
0.000123 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x20), 0x7ffc04307890) = 0
0.000131 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc04307840) = 0
0.000122 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc043078d0) = 0
0.000124 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc043078d0) = 0
0.000210 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xb0), 0x7ffc04307080) = 0
0.007327 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc04306f50) = 0
0.000161 openat(AT_FDCWD, "/proc/driver/nvidia/params", O_RDONLY) = 37
0.000066 fstat(37, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
0.000023 read(37, "Mobile: 4294967295\nResmanDebugLe"..., 1024) = 649
0.000036 close(37) = 0
0.000024 stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 255), ...}) = 0
0.000025 openat(AT_FDCWD, "/dev/nvidiactl", O_RDWR) = 37
0.000028 fcntl(37, F_SETFD, FD_CLOEXEC) = 0
0.000018 ioctl(7, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffc043070a0) = 0
0.000213 mmap(0x202c00000, 4194304, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 37, 0) = 0x202c00000
0.000130 close(37) = 0
--
The above block shows one set of ioctl calls that I’m mentioning (ie that block appears 32 times per gpu added) with wall time (this used -r but -T gave similar results). As you can see there is a pretty large delay of about 7ms occurring.
I should also mention that the system is also using Slurm and Singularity containers for handling jobs but the problem still occurs when using neither of those things. I’ve tried changing the number of threads available in blender aswell with no luck.
Any help would be appreciated. Thanks.