OpenCL on Apple M1

After some fiddling I finally got OpenCL working on my new MacBook Air with the M1. Basically it was only a matter of finding the right settings with some trial and error, the current blender code is already up to the task.
I did not test very much yet, but the famous BMW renders in about 3:40 minutes (GPU and CPU combined).

As I totally understand the reasons behind the deactivation of OpenCL on macOS in blender, my question is:
Is it possible to add some small additions to the deactivated macOS-parts or even turn it on for the M1 by default as long as there is no official support yet? Maybe it is useful for other users too.

Best regards,
Christian

2 Likes

We won’t be re-enabling OpenCL support on macOS or doing any improvements to the code for it. It can work in simpler scenes but there are problems in others.

1 Like

I understand and can confirm, that not all demo files work.
The nasty part of these bugs is, that the program does not gracefully fail, but in the background Apples OpenCL/Metal-wrapper begins to slowly grow in size (>25 GB!!) and blender becomes frozen and must be terminated with force.
To my limited understanding Apple does an on-the-fly translation to Metal with MTLCompilerService without a real OpenCL-driver and this does not work reliably for now.

As a very similar problem is described here: https://discussions.apple.com/thread/252165752
It seems possible, that Metal itself could be the problem.

Anyway, I was pleasantly surprised how well blender works on the M1 in limited scenarios and will do further tests. Basically I only had to switch off NANOVDB and turn on the OpenCL-driver again, the rest would have been small UI adaptions.

If I find something interesting, I would like to post it here.

Thank You,
Christian

1 Like

Some observations:

The ugly memory leak only occurs, when any of the (3) volume materials is connected to the volume input of the material output node.

Every surface shader does not freeze and only hair material shows wrong results.

For denoising I have to use OpenImageDenoise, NLM gives artifacts.

The SSS performance of the GPU seems to be mediocre but works without noticeable errors (this monster_under_the_bed demo works nicely). The SSS kernel needs the longest time to compile , the rest of the kernels is built in under 5 seconds each.

I only have the base 8GB model and of course big scenes bring the MacBook down, but it stays more snappy while rendering, compared to my 32GB i4790k with an 8GB Vega 56, which lags in OpenCL (Win10 and Big Sur ;).

To be continued …

1 Like

… and now it works!

In an attempt to leave optimizations to the assumed OpenCL-wrapper I disabled inlining and now volume shading works on the GPU.

Everything that does not work now, I would attribute to the lack of RAM in my machine, but I will continue testing.

Here is my first test (CT-Scan of a knee -> BVTK-Nodes -> no manual clean up, only basic lighting and a procedural bone material, muscles as volume):


Everything native arm64, no Rosetta2.

@brecht
I understand that You will not happily re-enable OpenCL on the Mac, especially as long Apple tags it it as deprecated, but would you eventually reevaluate the situation in the not so distant future?

Small status update:

I continued with more testing and benchmarking. Not a single crash, but different speed gains or losses with or without GPU.
Terminal output for bmw27 OpenCl:
Fra:1 Mem:297.39M (Peak 315.39M) | Time:04:33.81 | Mem:638.78M, Peak:646.78M | Scene, RenderLayer | Finished

I0422 16:54:02.770961 301022528 blender_session.cpp:591] Total render time: 271.796

I0422 16:54:02.770993 301022528 blender_session.cpp:592] Render time (without synchronization): 268.257

This is GPU only and if I read the blender benchmark database correctly, we are positioned between a NVIDIA 1050 and a 1050 Ti running CUDA.
For me this leads to the assumption, that the OpenCL-wrapper from Apple works pretty well. AFAIK the raw GPU of the M1 should be in the ballpark of these two NVIDIA GPU’s and the real-world result does reflect this closely.
Given the similarities between OpenCL kernel and Metal Performance Shaders I doubt, that a native hypothetical cycles metal device would improve this performance by a lot.

I will add more infos later, but I already can say, that I found at least one scene, that gives slight errors on the GPU and volume rendering performance seems to be worse than on the CPU.

After more tests my enthusiasm has sunk significantly.

It is not the problem, that OpenCL does not work reliably an Apple Silicon. Only one of my personal blend files showed very little artifacts with hybrid rendering (minimal dark squares on the GPU) and I had not a single crash.

But the fun fact for me was, that after the initial test with the BMW scene (randomly chosen), I did not find a single scene, where CPU-only was NOT faster than GPU or GPU+CPU.

I do not have the technical knowledge to draw a conclusion from it, i.e. if it points to specific weak points of Apples GPU vs driver deficits. But for the moment, there is not really much to win.

The good thing is, that overall performance of the arm64-build of Blender is really snappy and the CPU rendering performance of the M1 is significantly superior compared to my old Haswell 4790. It is very usable and stable, even with only 8GB for my relatively small scenes.

I will recheck the situation after the next Big Sur update.

2 Likes

You might have seen the announcement here:

Basically the current OpenCL kernel implementation is being removed entirely. Performance issues are part of the reason. The way forward will be a Metal backed on macOS. There’s nothing specific we can announce regarding that, but probably it is just a matter of time.

16 Likes

“The way forward will be a Metal backed on macOS.”

I love you man.

5 Likes

I am just curious how is the CPU rendering with the arm version of blender. Can you share the render times for the BMW benchmark in case of CPU only?

Here are some numbers:

bmw27_cpu (tilesize 32 px, sampling 35, image size 50 %)

CPU: 5:31 min

GPU only: 5:13 min

GPU + CPU: 3:21 min

With tile size 64:

GPU + CPU: 3:12 min

With tile size 256:

GPU only: 4:32 min

I still have no idea, why this blend file runs so well with OpenCL on the M1, whereas not a single one of my other blend files has not CPU only as the fastest version.

BTW, no changes after the update to macOS 11.3.

Looking forward to cycles-x!!

1 Like

Amazing!! Hopefully the timeframe isn’t too long, trust the devs :wink:

Hi all, any more news on when GPU Cycles will be supported on M1 (or intel!) Macs? I think there is a huge pent-up demand from Mac users for this, who will be very grateful if Blender Devs can make this happen going forward.

Many Thanks !

I do not really have something substantial new, but here is what I have tried in the meantime.

My goal was to find the point, where I would hit a wall, while trying to port the CUDA-driver to METAL.

First step was pretty straightforward. Adding properties for a metal driver, getting a metal device from the OS.

Next step, building an empty metal kernel inside the blender build system and loading it successfully on render was a bit harder, but seems to work now.

Next on the list would have been the port of the CUDA-kernel (now GPU-kernel) and see if it would compile. This is where my story ended for now. Blender already uses a lot of macros and metal would need even more (i.e. for address space qualifiers, extra atomic types, …), which would pollute the generic driver parts a bit more. But the real showstopper for me have been c+±lambdas, which is a feature, that the newest metal version (2.4) simply does not support. Porting back cycles-x to an older c+±standard is probably not an option and code duplication only for metal does not sound much better.
There may be more obstacles further down the way, but in my serial approach, this is all I know for now (=believe to know).

Please remember, that I was for sure fare away from a working version and maybe even on the wrong path. But I am optimistic, that a person with more inside knowledge of cycles (= not me) could be successful.

3 Likes

Not bad for putting this together in your spare time.

By the way, have you heard the news that Apple is officially on board the Blender Foundation’s Dev Fund now? Michael Jones is working on a Metal backend for Cycles: ⚓ T92212 Cycles Metal device

He also had trouble with the lambdas, but he says it can be solved with function objects.

Full announcement here: Apple Joins Blender

Yes I noticed this and these are very nice news.

BTW, I am aware of functions objects in metal, but they have a slightly different API compared to usual C++ (if I am not mistaken) and while cycles-x is still a fast moving target, this was the moment for me to throw the towel in order to keep a sane amount of my spare time for other things. :grinning:

I think, we‘ll see some results in the not so distant future.

1 Like

Well rejoice bois. Metal is coming to blender. And Apple is finally backing Blender. Took em a while, but I’m glad this has come around!

2 Likes

To bridge the waiting time until the completion of the coming metal driver, I decided to do a final test with new M1Max (32GPU, 32 GB RAM), which was kindly handed to me by my boss.
I did the same simple OpenCL „port“ as before by using the latest commit before cycles-x was merged into master and simply turning on OpenCL in a hard coded way. I had to inline one kernel function by hand, because the OpenCL compile later complained about incompatible pointer types. The rest was vanilla blender code. The comparison value was the master version from today.
So here are some numbers, that should not be taken out of context.

monster under bed:
master CPU 10:13.99

last_opencl GPU 23:31.98
last_opencl GPU + CPU 17:52.14

bmw:
master CPU 3:13.34

last_opencl GPU 1:07.19 (default)

last_opencl GPU + CPU 1:02.21 (96px) quite variable

classroom:
master CPU 7:31.87

last_opencl GPU 13:31.03 (256px)
last_opencl GPU + CPU 11:51.23 (256px)
last_opencl CPU 8:28.86 (pre-heated)

junk shop:
master CPU 1:07.63

last_opencl GPU opencl compile freeze, MTLCompilerService slowly growing
last_opencl GPU + CPU
last_opencl CPU 1:23.48 (pre-heated)

Non of these measurements are intended to be compared to public benchmark data bases. This makes no sense.
My few conclusions are:

  1. It was the right decision to turn off OpenCL for MacOS. Without further optimizations the „old“ blender OpenCL code does not perform well in most of the cases, that I could test.
  2. When enabling CPU+GPU we have a pretty good torture test for the cooling system and the 14-inch-MacBook is really loud after a few minutes. Even if it would run optimized code, such a MacBook is simply not built for heavy number crunching for extended periods of time. For my personal use, this will not become my main computer.
  3. General performance of the M1Max in blender is very good and snappy. As long as it not used as a 24/7-rendering-workstation, You will probably not be disappointed.
  4. The again good performing BMW-benchmark shows the potential and I am really curious, what we will see with metal. At least I have some kind of baseline now.
1 Like

Not that I’m comparing to public benchmarks or anything… :smirk:
but the (unoptimized) OpenCL code for bmw scene seems to bench between a Max-Q 2060 and 2080.

I can’t wait to see what the Metal backend brings to the table.

1 Like

And now it is finally done!!!
I just tested the upcoming Metal patch and it works nicely with the BMW.
M1Max 32GB/24GPU:

48 seconds.

Classroom: 2:06 min

Monster under Bed: 3:35 min (MacBook completely silent and on battery, nice!!)

3 Likes