Speed up VSE with multi-thread


#1

When I told the net friends that I am reading the Blender source code, they were very excited, they immediately threw out many known issues and hoped I could solve them. One of them, use Blender for video editing, the final export is single-threaded and cannot take advantage of multi-core.

They hope I can change the sequence rendering to multi-threaded, but I didn’t start working right away. I think about several different solutions, if on 4-core machine:

  1. Pre-process 4 frames of the movie with 4 threads and cache the result. When the main thread processes each frame, it directly takes out the part about movie from the cache. This solution does not pre-process the 3d scene. It only preprocesses the movie, which sounds more conservative, but I finally gave up on this solution because I think it will increase future maintenance costs.
  2. Render the left 4 frames of the timeline with 4 threads, and the main thread sorts the rendering results. I think the advantage of this strategy is that if the user stops rendering in the middle, you can get all the frames that are sorted by the main thread. but, it may not be friendly to the cache, because the rendering result of the previous frame cannot be obtained when the current frame is rendering.
  3. Divide the timeline into 4 slices, each thread processes a slice, and the main thread sorts the results, If some slices are time consuming, the main thread can trigger resharding when it detects this condition. Obviously, if the user stops rendering in the middle, the rendered result is useless to the user, just some scattered frames. But it is cache friendly because each frame is processe sequentially with in the time slice.

I saw that Blender has a lib for multi-thread. I am still learning these lib. I am currently preparing to implement the third strategy above.

I hope that you will be involved in the discussion.


#2

When improving performance, I suggest to always get real world example files, and do some profiling to figure out where the most time is spent. It may be in unexpected places.

What is often the slowest part of export is not the evaluation of sequencer strips, but the video encoding. Blender could be improved to take advantage of ffmpeg multithreaded encoding.

The sequencer does use multithreading for various effects, with each thread handling a slice. However this may not be happening for all the most expensive operations, or they may be avoidable threading overhead.

Processing multiple frames in parallel is a possibility, in that case it is likely a good idea to build further on the prefetching that is being worked on.
https://developer.blender.org/D3934


#3

I also think that the time-consuming part is video coding, so I quickly gave up my first solution. You said ffmpeg multi-threaded encoding, I will try to optimize this part and test whether the optimization works.

Regarding the part of effects processing, I have already seen the code before, and it is indeed parallel.

I also watched D3934 a few days ago, and I saw all the ISS speak in the forum. D3934 is really a useful optimization.

I think my third solution is still useful, because I can detect the number of worker threads to decide whether to use slice render, avoiding excessive thread overhead. The value of the third solution is that we don’t care too much about the internal rendering details of a single frame, but we still can have some optimization.


#4

FYI, a related patch was submitted here:
https://developer.blender.org/D4031


#5

I don’t care much about ffmpeg at the moment, because I used tool testing to find that optimization for ffmpeg doesn’t get much benefit.

I have been writing logic for parallel rendering of multiple frames, completed two day ago. If me have fully tested and added some lock, I should be able to use them.

But that piece of code is just written by my refactoring skill. There is no rule. If this is the case, it will make the future rendering module need to think too much about parallelism when making any changes.

So I want to redesign a multi-core implementation, so that the execution code of the child thread doesn’t contain anything about parallelism. All parallel related code is written in the parent thread, and a library is written to facilitate the writing of the parent thread. In addition, I need to enhance the existing context or discard the current context, and make a new thread-safe context.

I am designing this solution. without the coroutine (N:M thread) and channel (thread safe queue), only the context and some other utility functions.


#6

I’m surprised ffmpeg single-threaded isn’t actually a bottleneck. What is then?


#7

In fact.

what we need is improve our code so that make it fully parallelized, doesn’t relying on magic from third-party libraries