OpenCL on Apple M1

After some fiddling I finally got OpenCL working on my new MacBook Air with the M1. Basically it was only a matter of finding the right settings with some trial and error, the current blender code is already up to the task.
I did not test very much yet, but the famous BMW renders in about 3:40 minutes (GPU and CPU combined).

As I totally understand the reasons behind the deactivation of OpenCL on macOS in blender, my question is:
Is it possible to add some small additions to the deactivated macOS-parts or even turn it on for the M1 by default as long as there is no official support yet? Maybe it is useful for other users too.

Best regards,
Christian

2 Likes

We won’t be re-enabling OpenCL support on macOS or doing any improvements to the code for it. It can work in simpler scenes but there are problems in others.

1 Like

I understand and can confirm, that not all demo files work.
The nasty part of these bugs is, that the program does not gracefully fail, but in the background Apples OpenCL/Metal-wrapper begins to slowly grow in size (>25 GB!!) and blender becomes frozen and must be terminated with force.
To my limited understanding Apple does an on-the-fly translation to Metal with MTLCompilerService without a real OpenCL-driver and this does not work reliably for now.

As a very similar problem is described here: https://discussions.apple.com/thread/252165752
It seems possible, that Metal itself could be the problem.

Anyway, I was pleasantly surprised how well blender works on the M1 in limited scenarios and will do further tests. Basically I only had to switch off NANOVDB and turn on the OpenCL-driver again, the rest would have been small UI adaptions.

If I find something interesting, I would like to post it here.

Thank You,
Christian

1 Like

Some observations:

The ugly memory leak only occurs, when any of the (3) volume materials is connected to the volume input of the material output node.

Every surface shader does not freeze and only hair material shows wrong results.

For denoising I have to use OpenImageDenoise, NLM gives artifacts.

The SSS performance of the GPU seems to be mediocre but works without noticeable errors (this monster_under_the_bed demo works nicely). The SSS kernel needs the longest time to compile , the rest of the kernels is built in under 5 seconds each.

I only have the base 8GB model and of course big scenes bring the MacBook down, but it stays more snappy while rendering, compared to my 32GB i4790k with an 8GB Vega 56, which lags in OpenCL (Win10 and Big Sur ;).

To be continued …

1 Like

… and now it works!

In an attempt to leave optimizations to the assumed OpenCL-wrapper I disabled inlining and now volume shading works on the GPU.

Everything that does not work now, I would attribute to the lack of RAM in my machine, but I will continue testing.

Here is my first test (CT-Scan of a knee -> BVTK-Nodes -> no manual clean up, only basic lighting and a procedural bone material, muscles as volume):


Everything native arm64, no Rosetta2.

@brecht
I understand that You will not happily re-enable OpenCL on the Mac, especially as long Apple tags it it as deprecated, but would you eventually reevaluate the situation in the not so distant future?

Small status update:

I continued with more testing and benchmarking. Not a single crash, but different speed gains or losses with or without GPU.
Terminal output for bmw27 OpenCl:
Fra:1 Mem:297.39M (Peak 315.39M) | Time:04:33.81 | Mem:638.78M, Peak:646.78M | Scene, RenderLayer | Finished

I0422 16:54:02.770961 301022528 blender_session.cpp:591] Total render time: 271.796

I0422 16:54:02.770993 301022528 blender_session.cpp:592] Render time (without synchronization): 268.257

This is GPU only and if I read the blender benchmark database correctly, we are positioned between a NVIDIA 1050 and a 1050 Ti running CUDA.
For me this leads to the assumption, that the OpenCL-wrapper from Apple works pretty well. AFAIK the raw GPU of the M1 should be in the ballpark of these two NVIDIA GPU’s and the real-world result does reflect this closely.
Given the similarities between OpenCL kernel and Metal Performance Shaders I doubt, that a native hypothetical cycles metal device would improve this performance by a lot.

I will add more infos later, but I already can say, that I found at least one scene, that gives slight errors on the GPU and volume rendering performance seems to be worse than on the CPU.

After more tests my enthusiasm has sunk significantly.

It is not the problem, that OpenCL does not work reliably an Apple Silicon. Only one of my personal blend files showed very little artifacts with hybrid rendering (minimal dark squares on the GPU) and I had not a single crash.

But the fun fact for me was, that after the initial test with the BMW scene (randomly chosen), I did not find a single scene, where CPU-only was NOT faster than GPU or GPU+CPU.

I do not have the technical knowledge to draw a conclusion from it, i.e. if it points to specific weak points of Apples GPU vs driver deficits. But for the moment, there is not really much to win.

The good thing is, that overall performance of the arm64-build of Blender is really snappy and the CPU rendering performance of the M1 is significantly superior compared to my old Haswell 4790. It is very usable and stable, even with only 8GB for my relatively small scenes.

I will recheck the situation after the next Big Sur update.

2 Likes

You might have seen the announcement here:

Basically the current OpenCL kernel implementation is being removed entirely. Performance issues are part of the reason. The way forward will be a Metal backed on macOS. There’s nothing specific we can announce regarding that, but probably it is just a matter of time.

10 Likes

“The way forward will be a Metal backed on macOS.”

I love you man.

3 Likes

I am just curious how is the CPU rendering with the arm version of blender. Can you share the render times for the BMW benchmark in case of CPU only?

Here are some numbers:

bmw27_cpu (tilesize 32 px, sampling 35, image size 50 %)

CPU: 5:31 min

GPU only: 5:13 min

GPU + CPU: 3:21 min

With tile size 64:

GPU + CPU: 3:12 min

With tile size 256:

GPU only: 4:32 min

I still have no idea, why this blend file runs so well with OpenCL on the M1, whereas not a single one of my other blend files has not CPU only as the fastest version.

BTW, no changes after the update to macOS 11.3.

Looking forward to cycles-x!!

1 Like