Sequencer cache improvement proposal

iss · September 16, 2018, 1:51pm

I saw this topic mentioned in code quest list. I think, that sequencer cache can be optimized quite a bit. If you are skeptical, I can(will have to anyway) do some quick hack™ to estimate difference of memory usage as I didn’t actually tried to research this in detail.
Also I don’t have very deep understanding of cache implementation, so please, correct me if I am wrong.

I am able to realize points below, up to prefetching as I haven’t clear image of exact implementation, that would be correct(updating animdata + thread safety).

Trivia:
We use cache to ensure smooth playback of timeline.
This may seem obvious, but when implementing cache, we should ask - will caching of this item actually help?

Overview of current cache:

Currently we use 2 types of cache used in sequencer
- preprocess cache - source data, before scaling and applying modifiers
- sequencer cache - sequence output after scaling and applying modifiers

When rendering strip stack we store
- each frame of sequence before preprocessing stage in preprocess cache
- each frame of sequence after preprocessing stage in sequencer cache
- result of stack rendering in sequencer cache
- still frames - not really interesting

Remove preprocess cache:

It does make sense to have preprocess cache, in current implementation, but there are good reasons, why not to use this cache:
- added code
- increased usage of memory (a LOT) even if there is no preprocessing going on

Getting rid of this cache is quite simple:
- remove modifiers, make them effects instead
- use prescaled proxies / scale uncached data

If sequence input is movie/image, chances are that we already do have proxies in desired resolution, therefore building cache is easier.
With patch D3597 if you chose 50% proxy resolution you will allways make proxies with resolution of 50% of project resolution, so no matter what the resolution of input is, if you have proxy, you can load it directly into RAM cache.

If sequence is effect, we render at desired resolution directly - no preprocess cache needed

Prevent unnecessary caching:

If I have timeline, on which I have movie with a bunch of effects with color correction / transformation applied user can build proxy on topmost sequence, that will output final image. On playback, if proxy is found, source is not rendered, therefore there is no need to cache these sources. This mechanism is already in place, only user can not make proxies on effects. With patch D3597 this is possible.

Other case may be static images. Properties of these can be animated, and we don’t really know if they are animated. If we know, that image is static, we can cache only one frame and use it for whole duration of sequence.

Prefetching / selective caching:

It may be good to be smart with prefetching. For example:
- if current strip stack plays at desired framerate, don’t waste time on trying to prefetch frames of this stack
- if rendering of part of timeline is time consuming, focus all possible resources to prevent FPS drop
- proxy is cache, make efficient way to load it to RAM
- filestream as cache for low RAM, but fast file storage scenario
- be aware of playhead position / time to be played vs load speed

Requirement for prefetching is having thread safe rendering process in place. Right now we are evaluating animdata for whole scene for each frame. This process write to scene object - not thread safe.

Prefetch thread would have to operate on disposable copy of scene. I tried this method, it seems to work well, just scene was visible in UI.
Other option is to use mutex, but “re-entrant” approach such as above would more likely prevent bugs like UI glitching and similar.
I would have to consult this or see similar code or consult mechanisms in place, so this is just wild guess.

Cache invalidation:

If user make change, that would change output of sequence, cache of this sequence have to be discarded / marked to be overwritten.
There is no mechanic that would do this process automatically.

Currently a lot of RNA handlers do call to invalidate cache. I was thinking about using checksums of some arrays of memory(effect data, sequence, curves, …). Doing invalidation manually doesn’t seem to be a bad thing. I am open minded here.

invalidation code however should applied to proxies also, as proxy is cache.

Please leave your opinion on proposed change.

brecht · September 17, 2018, 9:56am

This isn’t really my area, but I’m not sure if we should remove the preprocess cache under the assumption that users will have proxies. Proxies are a bit of an advanced workflow still, I think many users will not go through the trouble of setting them up?

Maybe we can better prioritize what stays in the cache, and keep sources if there is enough space? For example I can imagine caching the final output is more important than the sources or some intermediate result. Although that is for playback, for interactive editing it may be better to cache the sources and instead of the final output.

The first thing I would try is to analyze where exactly the bottlenecks are in the current system with typical setups. There may be bugs or unexpected issues that are holding back performance. Then from there you can make an informed decision about which optimizations would help most.

Predicting how fast future frames will render is difficult, you can’t really tell. For example you might start with a black screen and fade in, the first few frames will load really fast because they compress to almost nothing on disk, and then if you go on that you might get stuttering when the actual content starts playing. So best not try to be smart about that I think.

Some of the other ideas could be good, but it really depends on the problem that needs to be solved.

In 2.8 this seems like the right solution, the copy-on-write system was added for this purpose.

iss · September 17, 2018, 11:29am

Unless user has input that has only I-frames(which is one of characteristic of proxy), sequencer is unusable. You have to have proxies. At least that is situation now and users tolerate this.

We can improve this,

proxy is cache, make efficient way to load it to RAM

Will create a mechanism to do it. As you have to load frames in bulk.

As I said, I will collect some numbers, and present possible improvements.

Thanks I will try to consult details of this when I will have something ready.

brecht · September 17, 2018, 2:51pm

I think there may be a big gap between casual users who just want to edit together a few simple videos or an image sequence, and more serious users who will use proxies and heavier effects.

My main concern is ensuring both cases still work. If the preprocess cache is not effective for either then it seems fine to remove, but I don’t have much insight into that.

GDquest · September 17, 2018, 10:47pm

If the preprocess cache is not effective for either then it seems fine to remove, but I don’t have much insight into that.

I’ve edited hundreds of videos with the VSE, and it’s is unusable with HD footage without proxies. Even screencasts with a low bitrate. With the current playback engine, even with 100% proxies playback isn’t 100% smooth, although it’s fine for color grading. Just to second Richard here: you can’t use the VSE without proxies.

brecht · September 18, 2018, 7:32am

But many users do use it without proxies though, not everyone’s use case is the same. If you’re editing long videos with a bunch of cutting then sure, but if you’re just chaining together a few shots or trimming the start/end of a video you may not go through the trouble of setting up proxies.

Even with proxies, maybe the source cache is useful because they have a slow hard disk or network drive, maybe they’re just editing a short video or part of a longer video that fully fits in the cache, maybe it makes updates more interactive if you are tweaking strip settings, etc.

We just have to be careful not to make design decision purely from the point of view of an advanced user.

troy_s · September 21, 2018, 9:11pm

Isn’t it more prudent here Brecht to target the BI’s tested-in-production as a baseline? If we begin to think about lowest common denominator, I worry we have already lost the struggle.

I understand your care for “both cases”, but in this particular instance, it is a clear line between workable and not depending on which side is chosen to design to.

Can’t design for “everyone”.

The proxy path is essentially an offline / online division, which we certainly can clean up and make a little more workable. A toggle for example, with auto generating of onlines attached with LTC timecode etc.

tintwotin · September 22, 2018, 8:52am

As Blender is now, with no checks on ex. source ratio, fps or vfr(variable frame rate), and no warnings or suggestions to re-encode to proxies, if incompatibility with current project settings(or change project settings), many first-time casual users will get a bad experience caused by slow playback or audio out of sync(vfr).

So casual users may need help and guidance like warnings, suggestions(to change project ratio, fps or generate proxies of incompatible source-material), or even forced automatic proxy generation(in the case of vfr), for a successful experience of using the VSE. Meaning they would have to walk the same path as more serious users. Of course, these things should be optional, but I see nothing wrong in trying to nudge users into what is considered best-practice with the purpose of giving them the best possible experience.

brecht · September 22, 2018, 2:55pm

Certainly a better UI for proxies would help.

We seem to be going off track here. All I suggested is a alternative solution where instead of removing the preprocess cache, we put the data in the main cache at low priority. If there is a simple way to avoid the risk of breaking other use cases, then why not do it?

iss · September 25, 2018, 9:37pm

Some early news:
I looked in preprocess cache, and found, that it is unfinished or what. This means, that it’s not working. Here’s a snippet:

void BKE_sequencer_preprocessed_cache_put(const SeqRenderData *context, Sequence *seq, float cfra, eSeqStripElemIBuf type, ImBuf *ibuf)
{
	...
		if (preprocess_cache->cfra != cfra)
			BKE_sequencer_preprocessed_cache_cleanup();
	...

This cause any new frame to erase the cache. So there are no wasted resources.
To be fair, I didn’t realize, how much memory you need just for one frame. Now I am curious if other similar software use some kind of compression algorithm.

I added frame cost to seq cache. cost = (1/set_FPS)/time_spent_rendering_frame. So if cost < 1, frame renders fast enough.
I implemented cache viewer similar to what’s used in movieclip. Frame cost is shown in color scale - blue is best, red is worst. This reveals some interesting patterns…

After this I tested simplistic prefetcher. It works, I will have to implement freeing of “used” cache frames to be able to prefetch indefinitely.

Here is clip of playback with cacheview(cache for strips is disabled)

Interesting are strong blue strips, that are always on the same spot and evenly distributed. Red presumably indicate overhead from opening new filestream.

My plan is to finish this prefetcher with strategy: prefetch n future frames, if needed free “used” cache with lowest cost. This is good strategy to play back long parts of timeline.

Then I will look at how movie files are loaded and try to optimize this a bit. General idea is, that if some file plays smoothly outside of blender, it should play smoothly in blender. Without proxies of course.

After that we can implement prefetchers with more strategies - such as far lookahead for parts of timeline, that render so slowly, that “play” prefetcher would not be able to render frames fast enough
Editing prefetcher, triggered by user making edits - remember edits, keep sources in cache, possibly start prefetching result

These changes would render main thread mainly as a cache viewer, with prefetch threads maintaining cache depending on chosen strategy.
This process has to be automated, and I think, that we have to recognize workflow patterns.
Resources are scarce - with 16GB of cache you can store only 1 minute of full HD 60fps 8bit footage.

Speaking about cache viewer, which in context of proxies sequencer already is, user should be able(and pushed) to make proxy on some time consuming part, when he is satisfied with edits. then this 1 minute of cache should be enough for a lot of tasks.

In my case I have only 1GB of cache(4GB RAM, don’t laugh) and with proxies on effects I can at least preview timeline in real time, so even low-end users can be satisfied.

troy_s · September 27, 2018, 12:30am

And 8 bit isn’t sufficient for colour transforms / composites even for offlines, so you can factor in double the capacity.

iss · September 27, 2018, 1:28pm

It’s easy to say, a bit harder to code. But instead of using compression on cached frames I would feed frames to ffmpeg encoder, store encoded stream in RAM, and when data are not needed dump them to HDD as a proxy.

This would have to be managed through some scheduler, because file loading would have to have absolute priority. Normally you load quite small chunks, but you also have to dump few gigs of data suddenly

But basically the same goes for prefetcher - they should utilize as much CPU resources as possible, but playing thread have absolute priority.

Anyway, my progress is extremely slow, as I have to attend my daytime job, so this is still in quite distant future

tintwotin · October 23, 2018, 11:21am

Are you making any progress?

iss · October 24, 2018, 9:30am

Hi, thanks for asking.

I made some progress from last post, not so much as I m occupied by main job currently.

I tested rendering cache frames in scene copy - works OK, but by creating scene copy even outside of running bmain struct messes with UI (scene selector is red, keyframes stop working)

I need some storage for prefetch thread to avoid using globals. Same for cache itself. Scene I guess would be good.

I tested cache limitor - also works OK, but still need to polish that.

I started testing stability without using mutex - this is where I ended. Usage of mutex can block main thread for time of rendering 2 frames instead of one.

I am taking few days off this week so I will try to make some POC in week or 2.

From my tests this makes a huge impact on preview smoothness.

tintwotin · October 24, 2018, 10:09am

Wow, sounds great. I’ll post about this on the VSE FB Group.

tintwotin · November 12, 2018, 3:24pm

Congratulations on your new patch. Would love to test it, if you can share a build?

iss · November 13, 2018, 12:00am

Thanks

I guess I will update this thread also - early patch is here:
https://developer.blender.org/D3934

with win32 build in the comments:
https://developer.blender.org/D3934#89251

looch · November 24, 2018, 9:49am

@iss have you continuesd with the development of this?

iss · November 26, 2018, 6:18am

Yes, I will update patch in a moment, if you want binary, I can build it in few hours when I get home

looch · November 26, 2018, 10:10am

no rush, just wanted to know if it kept going, ive tried the current versions and for me it crashes and it seems to not let me adjust to more than 2 gb of ram

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

Sequencer cache improvement proposal