GSoC 2024: Improve Distributed Rendering & Task Execution

Hi everyone!

My name is David Zhang, and I’ll be contributing to Flamenco over the summer, with improvements including allowing jobs to be paused and introducing sample-based distributed rendering of single images, which brings about better resource management and more flexible job scheduling.

For more details of implementation, see my original proposal

Synopsis

The objective of this project is to enhance the distributed rendering and task execution capabilities within Blender through several key improvements.

Firstly, we introduce the ability to pause jobs and submit them in a paused state, providing users with increased control over their rendering workflow and resource allocation. This feature will be particularly advantageous during peak usage periods or when prioritizing specific tasks.

Furthermore, we address the challenge of distributed rendering of single images by adopting a sample-based rendering approach. This method ensures more efficient utilization of computational resources across nodes, minimizing memory usage and avoiding artifacts caused by boundary dependencies.

Benefits

The benefits of these improvements to Blender and its community of artists are manifold. Artists will experience enhanced rendering efficiency and flexibility, enabling them to focus more on creativity and less on managing technical constraints. The introduction of job pausing and the ability to submit jobs in a paused state will allow for better resource management, reducing wait times and optimizing the use of available computational resources. The distributed rendering improvements will directly benefit artists working on complex scenes by reducing rendering times and improving image quality, without the need for extensive technical adjustments. These developments will also support future Blender enhancements by providing a more robust and flexible framework for distributed task execution and rendering.

Deliverables

The final deliverables of this project include

  1. New buttons and options for pausing and submitting jobs in a paused state in both the Manager web interface and the Flamenco Blender add-on. Pausing tasks are supported in the meantime.

  2. Improvements in distributed rendering that allows for distributed rendering of single images with minimized memory usage encapsulated in custom JOB_TYPE definitions and a Python merge script for efficient image processing.

18 Likes

Week 1
May 27 - May 31

During the first week, I

  • Had a weekly meeting with my mentor, agreed on details (such as how job transition logic should be modified after the introduction of a new state, why an intermediate state would be important, etc.), and learned about code contribution etiquettes (the idea behind making smaller commits and how commits involving OpenAPI should be structured)
  • Created a pull request for my first deliverable, which is support pausing jobs
  • Introduced paused state and implemented relevant status transition logic
  • Added basic test cases for unit testing

I was having a dental surgery during the week, so I wasn’t in the best mood of working.

In the following week, I will:

  • Collect more feedback from the community and polish the status transition logic implementation
  • Complete the frontend part to allow users to actually pause a job from the interface
  • Add more test cases and rigorously test everything implemented so far
6 Likes

Week 2
June 3 - June 7

During the second week, I made some significant progress on the project, including:

  • A working #1 Deliverable. When the user clicks on the Pause Job button, Flamenco sends the job to pause-requested status, and depending on the specific situation, either waits for active jobs to complete, fails the entire job, or sets the job status to paused. See the following demo:


  • Had a weekly meeting and specifically talked about interactive rebase, which I could use to flexibly build on top of existing codebase while being able to keep updating the existing codebase

  • Add test cases and make sure original test cases work with the introduction of a new job status

  • More minor edge cases considerations and created a design issue

In the following week, I have some personal stuff planned, but I will:

  • Build on top of what I’ve accomplished and brainstorm ways to implement submitting jobs with paused status (maybe involve upgrading the job compiler? Addon? Custom job type? API changes?)
  • Review the first PR and write more test cases
7 Likes

Week 3
June 10 - June 14

I spent a great amount of time traveling and moving to another city this week, so I wasn’t as productive as the previous two weeks. I mainly:

  • Added a few more minor fixes to my Deliverable #1, and fixed a few edge cases
  • Played around with the job_compiler and the addon component of Flamenco. They will be useful for implementing my Deliverable #2
  • Submitted my first PR to my mentor for review

I was on a plane during our regular weekly meeting time with my mentor, so we didn’t meet this week.

For the next week, I plan to spend a lot of time on the project, and here are some of my (ambitious) goals:

  • Address any feedback from the PR review
  • Add a few more test cases as I play around and as they come to my mind
  • Have a minimally useable product that is able to submit paused jobs
6 Likes

Week 4
June 17 - June 21

This was quite a productive week - I completed most of the tasks planned for this week, namely:

  • Fixed a few more minor issues in Deliverable #1 and code quality issues and submitted again for re-review.
  • Made original test cases work with the code changes and added a few more.
  • Backend for submitting jobs with an initial status, which involves OpenAPI and job_compiler changes. Job submission was not previously allowed to be assigned an initial status. Flamenco would put everything into a boring queued status.
  • Added a little checkbox in the add-on frontend that says “Submit As Paused”. Sadly it’s not quite functioning for now, but it soon will be.

  • Weekly meeting that went over implementation details of my second deliverable

Most of the work for the first two deliverables should be completed in just a few more days. I might be out of town next Thursday as it is a national holiday. Perhaps it’s time to look into my next deliverable, distributed rendering! (Hooray! That’s the fun part)

7 Likes

Week 5
June 24 — June 28

I wasn’t as productive on Blender as last week as I spent some tiring time on personal tasks (luckily I had the time to recharge during the weekend).

I landed my first PR (Yeah!) so that users could now pause jobs in Flamenco, which provides great flexibility for job scheduling when rendering multiple jobs. I also had a lengthy discussion with Sybren about how to go about implementing distributed rendering of single images, which could involve diving deep into the codebase of Cycles, so I expect my next week to be a researchy week. I have just a few more things to tidy up before I can land my next PR, so stay tuuuuuned!

6 Likes

Week 6
July 1 — July 5

This week I submitted another two PRs for review, one for the frontend and one for the backend. With a new initial_status field in the job description, users are now able to submit jobs in paused status, where previously all jobs had to be queued.

We had a weekly meeting with Sybren, who got more information from Sergey for how we should go about implementing distributed rendering of single images. I’ll first do some experiments with manually rendering and merging single images to verify if the approach works out. Some glitches with denoising are expected, as we don’t have sufficient information from Cycles yet, which would be complete in the future.

10 Likes

Week 7
July 8 — July 12

I had been experimenting and by the end of this week, I eventually had a working pipeline of rendering an image into three pieces then merging them together into a complete image. The whole pipeline looks like this:

  • Disable compositing
  • Figure out the borders (with 4 pixels of overscan) and do border rendering to get the three pieces
  • Find out the RenderLayers node in the node tree
  • Construct a set of nodes to merge the three pieces into one
  • Reconstruct the original node tree by replacing the RenderLayers node with the newly created set of nodes
  • Final compositing

The result looks promising:

There are still quite a few questions I need to answer before this thing could be called “production-level”:

  • Why do we have discrepancies when the three pieces are rendered with compositing enabled?

  • What additional data we need to make sure denoising works?
  • What are some Cycles features we are theoretically unable to support at the moment?

While figuring out the answers to these questions, I will also be landing my previous PR and open a new one this week.

7 Likes

Week 8
July 15 — July 19

This week has been more of a researchy week. I looked into Cycles’ path guiding and adaptive sampling feature, and confirmed through a bug report that compositing must be done after all pieces have been rendered and merged. I’ve started to translate the distributed rendering pipeline from a Python script into a custom JOB_TYPE and created a PR for that. I’ve also fixed a few other minor issues along the way.

For the next week, I plan to continue implementing the custom JOB_TYPE and have most of the part complete. I could then start playing around and identify scenarios when the rendering does not work well.

6 Likes

Week 9
July 22 — July 26

It has been a few weeks since I get into the project of distributed rendering of single images, and I’ve got quite some progress. I have a custom JOB_TYPE that takes in the render tile size and generates a bunch of border rendering tasks. After rendering, it automatically reads in all rendered pieces and generates a complex node structure that attempts to merge them into a complete image while preserving user’s compositing setup.

image

I still have a few issues to tackle before the product could be called a MVP, some of which are:

  • Figure out correct settings for translation nodes
  • Add support for denoising, which involves merging for the other layers
  • Decouple python scripts from the job type, as there is a length limit on command line arguments

I’ve also landed another PR #104323 - Add-on: Add checkbox to submit jobs in paused status - flamenco - Blender Projects this week. The submit job as paused feature has therefore been fully usable. Yeah!

3 Likes

Week 10
July 29 — August 2

The majority of my time this week has been spent on trying to figure out the alternative node structure for merging images. Previously I was using a simple alpha_over node to merge two images, but a bunch of issues occurred.

Thanks to the help of @OmarEmaraDev, I eventually developed a working solution that translates the border-rendered images to correct positions, merges, and composites into a complete final render (with some hacks, as there is still limitation with Blender compositor).

Here is a demo of my first successful image render:

I’ll continue addressing other issues mentioned before in the upcoming week:

  • Add support for denoising, which involves merging for the other layers
  • Decouple python scripts from the job type, as there is a length limit on command line arguments
  • Replace os with pathlib
6 Likes

Week 11
August 5 — August 9

I was mainly addressing the remaining issues this week. I managed to add some safety checks and replaced the out-dated os Python library with Pathlib. I also spent time experimenting with the idea of using an external config file for Python script generation instead of embedding the Python script within the JavaScript file. My first try was to make the custom job type load from a json file and generate the Python script accordingly, but using the fs library does not seem to be allowed. I need to research for other workarounds.

I need to spend time on other things for the next week, but I’ll continue working on the remaining issues and keep on track.

1 Like

Week 12
August 12 — August 16

I’ve been busy with some personal schedules this week and was not too much productive. I was mainly making the job type support tile sizes that are non-integer portions of the resolution. Now instead of calculating the tile size by hand, users could fill in some random tile size and optimize to their liking.

There are a lot of features that I’d like to add but wouldn’t have time to next week. I’ll continue contributing to this deliverable and make it more user-friendly after the program as well.

6 Likes

Final Report

As the Google Summer of Code is coming to an end after a 12-week period, it might be a good time to look back on my journey and reflect on what I’ve accomplished so far. It is one of the most memorable summers over years, and I still can’t believe that I’m now well-equipped to contribute to Blender, my favorite open-source software I’ve been using since high school, with which I’ve made some astonishingly vivid renderings for my high school graduation ceremony.

A Bit of Context

The main objective of this project is to enhance the distributed rendering and task execution capabilities within Flamenco through several key improvements.

Firstly, we introduce the ability to pause jobs and submit them in a paused state, providing users with increased control over their rendering workflow and resource allocation. This feature will be particularly advantageous during peak usage periods or when prioritizing specific tasks.

Furthermore, we address the challenge of distributed rendering of single images by adopting a tile-based rendering approach. This method ensures more efficient utilization of computational resources across nodes, minimizing memory usage and avoiding artifacts caused by boundary dependencies.

The full original proposal could be found here.

Current State of Deliverables

Support Pausing Jobs (PR)

I have implemented the ability for users to pause jobs in Flamenco, introducing a new paused state with relevant status transition logic. The frontend has been updated to allow users to pause a job, and comprehensive unit tests have been added to ensure the new functionality works correctly with existing systems. This feature has been fully implemented, reviewed, and the pull request has been merged into the main codebase.


Allow Jobs to be Submitted in paused Status (PR 1 and PR 2)

I’ve made necessary changes to the backend, including modifications to OpenAPI and the job compiler. On the frontend, I’ve added a “Submit As Paused” checkbox to the add-on interface. A new initial_status field has been introduced in the job description to support this functionality. Like the first deliverable, this feature has been fully implemented and integrated into the main codebase.

image

Distributed Rendering of Single Images (PR)

I’ve developed a working pipeline for rendering an image in multiple pieces and merging them together. This involves a custom JOB_TYPE that generates border rendering tasks based on the specified render tile size. The solution supports user-defined tile size on each axis and adaptive sampling by doing a 16-pixel overscan for each individual tile. I’ve also implemented a complex node structure to merge rendered pieces while preserving the user’s compositing setup.


image
image

Here is a demo showing the difference between the distributed rendered image and the image rendered as a whole.

Dreams and Future Plans

While the core functionality is in place, I’m still addressing some challenges. These include adding support for denoising across merged layers and optimizing the integration of Python scripts within the job type. Currently my solution does not support denoising after individual tiles have been merged, and the solution is passing in a long Python script as a command line parameter, which will ultimately hit a length limit. The ideal solution for these two challenges would involve a much greater scope. Changes will be made in Cycles to expose API endpoints for outputting the raw data for denoising, including denoising albedo, denoising normal, and denoising depth. The job compiler in Flamenco would also involve some refactoring to allow custom job types to be able to use the fs JavaScript library and generate Python scripts based on a config file. I also need to fix the limitation that tile sizes cannot lead to non-integer translations. I’ll learn about how the compositor does rounding optimization and support that in this job type.

A few minor things are also worth paying attention to. As more tests are conducted for the single image job type, I realized that a lot of Blender features require manual support in order for this job type to be fully put into production, which needs some community effort. I’ll be around for a long time at least to fix relevant issues and add features of high demands.

Final Words

There are more takeaways than what I could describe with words, and here are some of them. First, I’d like to express my greatest gratitude towards my mentor Dr. Sybren. We’ve been meeting weekly despite of the time difference, and Sybren is always supportive to unblock me from challenges and help me stick to the right track.

This project has also made open-source contributions into one of my habits. The experience demystifies open-source development. It is the hundreds of millions of developers and user communities contributing tirelessly with passion that makes up of the vibrant open-source culture and countless production-level software. I’m so honored to start this journey with Blender, a software that has a huge impact on myself, and I’m aspired to continue this habit beyond Flamenco, beyond Blender.

24 Likes