Greetings, First a little background on me. I’m a DevOps guy who works on Linux systems all day and programs in Python. I get to work with some pretty high end hardware at very large scale. I am used to using systems like puppet and salt stack to configure and coordinate 1000+ of servers. I’m not a huge Blender user, but I really believe in it and in the Open Source community. It is from this perspective that I wanted to approach how to scale Netrender up and open the door for new possibilities.
I wanted to get an idea of feedback on a feature I have been thinking about working on. Looking at the current Netrender code, it looks like most of the message processing is done via HTTP GET / POST type requests. I have been doing a lot of work with Kafka for my day job and moving the coordination from a push / pull method over to a pub / sub method seemed like it could be a good fit for Netrender. I am looking at how you solve coordination where there is the potential for a lot of information going between the master and slaves where it all needs to be coordinated in near real time. This would put a hard dependency on requiring a Kafka cluster to be running in your environment, but I think it has the potential to solve scaling problems while also opening the door for new functionality.
The rough idea right now, would be to add in Kafka producers and consumers to the slaves and masters. Each subscribes to one or more message topics depending on what makes the most sense and development moves along. A client submits a job to the master. The master can then post the job information to the ‘jobs’ topic. The message would use a JSON format and post all the job details. Things like file locations and they key value of current frame / total frames. All of your slaves would be subscribed to this topic in a consumer group. Doing this ensures that only one slave in the group will read each job message post at any one time. When the slave is done with its job, it posts the completed details back to the jobs topic. The master reads the completed job message and knows it can post the next frame in line as a new task to be completed. With the level of message processing that kafka allows for, you could also have all the slaves pushing realtime job and health metrics into the cluster. Each second, they could easily post a new message with Frame-Completion %, Time-Elapsed, hostname, CPU %, Mem %, Disk %. This can then all be read by the master and displayed on a dashboard.
Using Kafka, all the consumers in a consumer group are coordinated by the Kafka brokers. If a node fails out, the brokers will be aware of this and rebalance all the slaves connected. If a new host is added in, mid-job, the brokers will rebalance again.
** As I think about this, some logic would be necessary to ensure that a job that was processing on a node, if failed, would be resubmitted.
With all the job messages going into Kafka, you are ensured a certain level of redundancy of those messages. Easy to implement options are things like replication factor for partitions where each message can be written to at least 3 nodes in the cluster. Kafka even has rack-awareness if you wanted to go that far.
Message throughput: If you are not familiar with Kafka, you can push a LOT of messages through it. Kafka people will tell you millions of messages per second. On very modest clusters, we saw 40,000 messages per second and were CPU bound because it was only on 4 core nodes. I’ve pushed more than 350,000 messages per second from a single message producer. I realize these values might seem really high in volume, but I think this is where Blender could take the torch and lead. With this level of message throughput and coordination between all of your slave nodes, what rending options open up?
I am open to all questions about this and welcome any feedback and glaring problems I might be overlooking. Hopefully this is a good conversation starter for what might be possible.
I am not entirely sure that netrender is an ideal target for such a proposal. That said, it’s been a long time since I have looked at it. I wonder though if perhaps your Kafka idea would be better suited to something like the Flamenco render manager that @fsiddi works on for rendering at the studio?
My main focus is currently libtorrent which is probably the easier way, for sharing data and workload , so in my case is not just P2P rendering but also asset management/sharing , CPU and GPU processing tasks and overall online community from inside Blender. Basically a Blender supercomputer made out of every day desktop computers instead of one centralized service.
Currently a long term goal but I do think its doable and useful for my overall plans. I am no web dev so its going to be a challenge but I love a good challenge . I will be sticking to what I am familiar with so mainly Python with some C for extra performance.
The main driving force is of course bittorrent techology which excels in sharing large data and security , similar techniques are used by Blockchain which is what IPFS is based on. Hash key identification which both these technologies are using can be used also for version control of 3d assets , another area that interests me, similar technology is used by Git although Git is focused on source code aka text files, in my case its binary data like 3d models , materials and textures.
So that is why I say its very vague I am still at the the research stage to make sure I have a solid understanding of those technologies and the actual coding is planned for the start of next year, cause right now I have other priorities on my plate that need to get done.
FYI, I’ve been working on network rendering support in Cycles for the past ~4 months and hope I’ll be able to publish the patch soon.
The problem with implementing simple task-based approaches is that the rendering process is highly stateful, and writing code that would convert the current Cycles approach into such a system would be extremely fragile and unmaintainable in the long run.
Getting a basic network rendering implementation going isn’t that hard, you can do it in an afternoon. However, building a system that:
Maximizes utilization of the workers
Minimizes startup time (nobody wants to wait an extra 10sec before the render starts)
Hides network latency
Support joining/leaving of workers mid-render
Doesn’t require a central manager/broker or any other infrastructure
Scales to many workers (remember, you might have to send gigabytes of data for a single render and users will (try to) run this over WiFi)
Supports a wide range of networking setups (workers being somewhere on the internet behind a HTTP proxy, “normal” TCP/IP networking, multicasting between workers, InfiniBand between workers…)
Integrates well with the current code to make maintenance and supporting new features easy
Doesn’t add 100.000 lines of code
Is easy to use for the average user who doesn’t even really know what an IP address is
is a lot harder.
I’ve also been investigating several approaches listed here with people who have a background in large-scale infrastructure and cloud computing, and unfortunately the typical approaches to solving distributed computing problems like that don’t really work well in practice…
By the time you are done, your python add-on will be it’s own OS
It’s all a tricky subject. Worse when you need to consider multiple operating systems, and long term maintenance and support.
I tend to agree with Lukas that simplicity needs to be a factor, at least for something that ships with Blender. Nobody wants to setup a dedicated broker or something fancy; they just want to start up Blender on a second or third PC and press “Render”, and auto detect the other nodes and go. People that need more would likely know enough to automate Blender via CLI or use something like Flamenco.