First a little background on me. I’m a DevOps guy who works on Linux systems all day and programs in Python. I get to work with some pretty high end hardware at very large scale. I am used to using systems like puppet and salt stack to configure and coordinate 1000+ of servers. I’m not a huge Blender user, but I really believe in it and in the Open Source community. It is from this perspective that I wanted to approach how to scale Netrender up and open the door for new possibilities.
I wanted to get an idea of feedback on a feature I have been thinking about working on. Looking at the current Netrender code, it looks like most of the message processing is done via HTTP GET / POST type requests. I have been doing a lot of work with Kafka for my day job and moving the coordination from a push / pull method over to a pub / sub method seemed like it could be a good fit for Netrender. I am looking at how you solve coordination where there is the potential for a lot of information going between the master and slaves where it all needs to be coordinated in near real time. This would put a hard dependency on requiring a Kafka cluster to be running in your environment, but I think it has the potential to solve scaling problems while also opening the door for new functionality.
The rough idea right now, would be to add in Kafka producers and consumers to the slaves and masters. Each subscribes to one or more message topics depending on what makes the most sense and development moves along. A client submits a job to the master. The master can then post the job information to the ‘jobs’ topic. The message would use a JSON format and post all the job details. Things like file locations and they key value of current frame / total frames. All of your slaves would be subscribed to this topic in a consumer group. Doing this ensures that only one slave in the group will read each job message post at any one time. When the slave is done with its job, it posts the completed details back to the jobs topic. The master reads the completed job message and knows it can post the next frame in line as a new task to be completed. With the level of message processing that kafka allows for, you could also have all the slaves pushing realtime job and health metrics into the cluster. Each second, they could easily post a new message with Frame-Completion %, Time-Elapsed, hostname, CPU %, Mem %, Disk %. This can then all be read by the master and displayed on a dashboard.
I was looking through this page https://wiki.blender.org/wiki/Source/Render/Cycles/Network_Render and from my experience, message streaming at the throughput that Kafka is capable of, could really help tiles level rendering shine.
Some of the challenges that can be solved:
- Using Kafka, all the consumers in a consumer group are coordinated by the Kafka brokers. If a node fails out, the brokers will be aware of this and rebalance all the slaves connected. If a new host is added in, mid-job, the brokers will rebalance again.
** As I think about this, some logic would be necessary to ensure that a job that was processing on a node, if failed, would be resubmitted.
- With all the job messages going into Kafka, you are ensured a certain level of redundancy of those messages. Easy to implement options are things like replication factor for partitions where each message can be written to at least 3 nodes in the cluster. Kafka even has rack-awareness if you wanted to go that far.
- Message throughput: If you are not familiar with Kafka, you can push a LOT of messages through it. Kafka people will tell you millions of messages per second. On very modest clusters, we saw 40,000 messages per second and were CPU bound because it was only on 4 core nodes. I’ve pushed more than 350,000 messages per second from a single message producer. I realize these values might seem really high in volume, but I think this is where Blender could take the torch and lead. With this level of message throughput and coordination between all of your slave nodes, what rending options open up?
I am open to all questions about this and welcome any feedback and glaring problems I might be overlooking. Hopefully this is a good conversation starter for what might be possible.