Cycles Apple Metal device feedback

Cycles in versions prior to Blender 2.79? Could render a scene on the GPU, but only if it fit inside the VRAM of the GPU. If it was even 1mb over, an out of memory error would occur.

In 2.79, or 2.80, one of the two, “Out of Core” rendering was added to Cycles. Now a scene could be rendered on the GPU and Cycles can use the VRAM of the GPU + a portion of the RAM of the CPU (with a performance penalty).

“Out of Core” rendering had to be explicitly coded into the CUDA/OptiX and more recently HIP implementations of Cycles. I can not see any signs of this being explicitly coded into the Metal implementation. Maybe I’m looking in the wrong place? Or maybe Metal supports “Out of Core” rendering without having to be explicitly added?

1 Like

Thank you! Interesting to see CPU+GPU take longer that just GPU… Anyone know what’s up?

Cycles-X in it’s current form has a “sub-optimal” work distribution method when combining two devices of varying speed in complex scenes. As a side effect this can lead to render times being increased when using something like CPU+GPU.

Work is currently underway to improve or entirely fix this.

1 Like

Oh, thanks for the explanation!

The “SoC” in Intel Macs are a bit different when compared against the Apple Silicon SoCs released to date.

For example, the Intel SoC includes the CPU and GPU on the same die, while the RAM is located somewhere else on the motherboard. The RAM is shared between the CPU and GPU, but I believe there might be some limitations.

If you have an AMD GPU in your Intel based Mac, then the Intel SoC will only include the CPU and Intel GPU on the same die. RAM is separate. The AMD GPU is located somewhere else on the motherboard, and it has it’s own set of (v)RAM it’s connected too.

This is true for both the laptops and desktops.

With the Apple Silicon SoCs, the CPU and GPU appear to be on the same die. And the RAM is on the same package as the CPU + GPU. If there were restrictions with how the Intel CPU + Intel GPU could work with RAM, they appear to have been mostly removed with the Apple Silicon SoCs.

1 Like

On further investigation. It seems Metal does stuff like “out of core” a bit differently when compared to CUDA and HIP. As such I was looking for the wrong types of things when looking at the Metal code in Blender. Hence my assumption about “Out of Core” functionality on Intel Macs with Discrete AMD GPUs is ill informed.

It’s best to wait for comments from people with more knowledge than I do.

Thanks for the excellent breakdown.

I thought the Vega Pro II was 14 TFLOPS (28 half precision) and the 3080 about the same. I haven’t dug into this, but I mentally clumped them into the same category. Might have been an error on my part here.

Interesting about the RT cores and 6x performance. That sort of explains it.

Thanks again.

It’s not clear to me what kind of communication you were expecting about this. But in general we are very careful not to recommend buying hardware based on future expectations of what might be in Blender. There’s too many unknowns for us to predict what will work and at which performance.

At this point I don’t know if it will be possible to use CPU memory for GPU rendering on Intel Macs.

For Apple M1, the memory is fully shared between CPU and GPU, and unlike most (all?) other consumer hardware all memory can be accessed with the same performance.

2 Likes

I was replying to Alaskas & Skylines post.
That the community wouldn’t be happy with this solution, or is this not alowed here, since I’m part of it.
I was expecting a follow up post sometime, with more information about it, because Alaska said he isn’t sure about it, we should ask (@Michael-Jones-Apple & @jason-apple) for more info.

That’s what I did and expected !

Thanks in advance

Metal does inherently support out of core, and any GPU resource with the ‘shared’ storage type is placed in and accessed from main RAM over the PCIe bus in AMD’s case. This aside, we would want a more context-aware handling of this for AMD, to more finely control which assets are stored and used from VRAM vs what is placed in System RAM to ensure the significant performance impact it’ll have is minimised!

We have a significant list of optimisations that we’re working on, and plan to land them in the repository as each is completed. We’ve just got started with Cycles, and while enabling it was our first priority, the team and I also want to make it fast as possible.

9 Likes

Thanks @jason-apple, that update is much appreciated!

@jason-apple Thanks for all the work!

BTW: I’m assuming out-of-core would be a later addition to Intel/AMD (not working atm?)

Assuming one would get a Vega card (HBM2) to be used in any Intel Mac platform in a eGPU form I believe Metal supports using system memory as GPU memory, would it work ok with current drivers/Blender?

Or maybe I’m making wrong assumptions.Anyway, in its current form is already very impressive and handy,and it looks like it can only get better, so can only be grateful and no matter what I’m already contemplating an addition of eGPU.

Hi guys,

Please try to minimize the amount of user tagging, to avoid obtrusiveness.

Also keep in mind that when you reply to someone’s post, tagging the post author is unnecessary, as the author will already get a notification of the reply.

Thanks!

1 Like

Hi Jason, I was wondering if you could clarify something?

Looking at the Metal code for Cycles, it seems almost everything is set to MTLStorageModeManaged.

If I understand this correctly, that means Apple Silicon GPUs (release so far) can use as much RAM as is accessible to it.
And in the case of AMD GPUs, the GPU can only use the VRAM it has?

Is this correct? Or can the AMD GPU access some resources that it can’t fit into it’s VRAM from the CPUs RAM? (Allowing a Cycles scene to exceed the size of the VRAM?)


What I want to know from this is: “Does MTLStorageModeManaged (what’s used in Cycles at the moment) allow out of core rendering on AMD GPUs? or is MTLStorageModeShared required for out of core rendering and is that optimization/feature planned to be added to Cycles in the future?

The technical answer is maybe it works! The complexity of this is that Metal will make Managed resources resident on demand, and it will depend on how many resources land OOC as to whether Metal can make this viable. Essentially Cycles will be able to utilise a little bit of OOC for free, but not much. So relying on this for a full OOC solution isn’t the future. The Cycles device explicitly tracking what is in VRAM (through Managed resources), and what is to always be accessed from CPU (through Shared resources) is work that should be done next for this.

We’re embracing the Blender community’s collaborative, open source development approach and methodology. We reached a point where Cycles is able to render with AMD GPUs, and we made that functionality available as soon as it was. It enables you all to use, enjoy, test, give feedback on and even contribute to the efforts directly. So we don’t know where the limits of current OOC functionality lies, but would love to learn that. Basically this work is hot off the Press!

We’ve a lot of work in Cycles and Metal still to do, including Intel GPU enablement, stability/bug-fixes, kernel optimisation, unified memory optimisations, out of core support expansion, algorithmic optimisations etc. While we want to work on all of these areas, logistically we need to focus our available engineering hours into a workable subset, based on what we feel we can make the most meaningful impact with.

I encourage anyone that would like to be more directly involved with the roadmap and engineering efforts to engage, jump into the code, and help us all move this work forward. We would love to work with fellow developers on expanding and improving the Metal features, and OOC improvements would be a great place for someone to look into.

22 Likes

My AMD Metal results:

MacBook Pro (2016)
Software: macOS 12.3 beta 3, Blender 3.2.0 (2022-02-22)
Hardware: 2,7 GHz Quad-Core Intel Core i7, 16 GB 2133 MHz LPDDR3, Radeon Pro 455 2 GB
eGPU: AMD Radeon RX 5700 XT 8GB

classroom
GPU (AMD Radeon RX 5700 XT only): 1:25.50 (85 seconds, Mem: 1367.73M, Peak: 1367.73M)

Thank you Blender!!!

Hey! Could someone run classroom with M1 Pro 8c/14c? Can it get below 3 minutes with that base model 14" MBP? Thanks in advance!

Looking at previous results posted in this thread, it seems the M1 Pro 8c/14c when using GPU rendering only in the classroom scene will have render times somewhere between 3 and 4 minutes.

GPU + CPU may be less than 3 minutes.

These are just estimates based on what was posted previous.

Wow. Blown away right now with y’alls hard work!!!

On my:
Mac mini (M1, 2020)
Memory: 16GB

I’m getting literally double the performance rendering on the GPU in 3.2.0 Alpha vs the CPU on 3.0.0!!!


Holy smokes that’s nuts. Thank you so much!!!

You are CRUSHING IT!!!

2 Likes

Here are some test I did on my 14” double binned.

GPU only
Classroom 3.1 alpha Pre BVH2 = 221 sec
Classroom 3.1 alpha BVH2 = 201 sec
Classroom 3.1 beta = 216 sec
Classroom 3.2 alpha = 216-217 sec

Yea don’t have the values with the CPU and GPU but the difference in early tests was fairly negligible like 10-20 sec faster in some instances.

3 Likes