Better Audio Integration

Hello all,

I’m positing here in hopes to get some ideas on how to solve current issues in the audio system of Blender. The following list (copied from here: https://developer.blender.org/T59540#588156) will detail some design decisions that I had to make when implementing the audio system with implications - stuff that can be done and stuff that gets really hard because of these design decisions.

Let’s start with animation: Blender allows to animate everything which is a really nice conecept. There are extensive possibilities to do animations: with f-curves and then modifiers on top of those for example. Video runs with 24 to 240 fps so evaluating the animation system that often is not that big of a deal. However with audio, the sample rate is usually 48 kHz and above. Calling the animation system evaluation functions for many different properties at this rate is just too slow for real-time audio processing. This is the reason why I had to introduce an animation caching system in the audio library. The cache of course has more disadvantages. Properties that can be animated in the audio system are volume, panning, pitch, 3D location and orientation (of audio sources and the “camera”/listener) and the cache is updated on every frame change which can be too late during playback. Thus, it can easily run out of sync with what the animation system knows - slightly moving a single handle of the f-curve has consequences for multiple frames, but the cache is only updated for the current frame, since always updating the whole timeline is too slow. To be able to do the latter, the user is provided with an “Update Animation Cache” button which is not only an ugly hack but is also horrible UI design since users don’t want to and shouldn’t have to care about this at all. Another disadvantage of the cache is that since it’s storing values, it has to sample the animation data at some sampling rate and later reconstructs it with linear interpolation. This sampling rate currently is the frame-rate (fps) of the scene. I would have preferred to do it based on an absolute time value but when the frame-rate is changed in Blender all animation stays frame based and is thus moved to different time points. Another detail: during audio mixing the animation cache is evaluated once per mixing buffer: you can set the size of the buffer for playback in the blender user settings.

A related problem is pitch animation. Animating the pitch equals the animation of the playback speed of the audio. The audio system supports arbitrary pitch changes just fine - it has a very high quality resampler that easily deals with this since that is also required for the doppler effect. The problem with pitch animation arises with seeking. To do proper seeking of pitch animated audio, you would basically need to integrate the playback speed from the start of the audio file. Furthermore, to be really precise this needs to be done with exactly the same sampling that the playback will later do. Since this is a huge effort, it’s currently simply not done. Seeking within a pitch animated sequence simply assumes the current pitch is constant and this will of course end up in the wrong position. Users can only hear the correct audio if they start playback at the beginning of the strip. Similar problems by the way also arise if you try to render the waveform of a pitch animated sequence.

Talking about being precise naturally leads to the next problem: sample exact cuts. During playback it is not enough if the animation system would simply tell the audio system: “Hey, this strip has to start playing back now.” This works for rendering since the code runs synchronously with the rendering. But for audio you have to run asynchronously or you’ll get either a lot of stuttering, clicking and pops, or wrong timing that is different every time you play back. The consequence of this was that I implemented a duplicate of the sequencing inside the audio system which needs to be kept up to date with what VSE knows such as strip starts, offsets and lengths. With this data structure the playback of a strip can then start precisely at the position it has to.

Actually, there is another reason for the duplication of the sequencing system within the audio system: scene strip and simultaneous playback and mixdown support. The audio system actually manages the sequencing data separted into two types of data structures. Ones that are responsible for the actual data that is also available in the VSE such as strip starts, offsets and lengths. And the others store data that is required for playback such as handles to audio files and the current playback position within these files together with a link to the other data structures. The latter can then exist multiple times in order to support playback at the same time as mixdown and scene strips. I actually wonder how this is done for video files since those have similar problems - is there just one video file handle that is always seeked to the position where the current process/strip is located? If you don’t understand this problem, think of this example: you have a simple scene with just one audio or video strip. In another scene you add a scene strip of this scene twice, overlapping and with a small offset. Now during playback, you have to read this one audio/video strip for both scene strips at different time points. That works badly if you have to seek within the file constantly.

Currently, with the dependency graph, especially the first issue got worse, where you basically have to hit the “Update Animation Cache” button every time something changes. Clearly, these issues here should be fixed. Here are some ideas:

  • One option that I entertained before is to render audio into a buffer (has to be on disk for long animations) in a background thread, which would solve some issues. But keeping it up-to-date is not straightforward and it of course needs quite a bit of memory, raw audio is far from small.
  • For animation I’m a bit out of ideas. Maybe it would already help to move the animation cache from audaspace to Blender, where the animation system could be responsible for keeping it up-to-date?
  • For pitch animated strips we could store/compute a mapping between frame number and seeking position within the audio file.
  • I wonder if the dependency graph can be used for the signal processing graph used for audio, covering also the case for scene strips and duplicating the graph for rendering/mixdown? Unfortunately, I so far don’t really understand how the dependency graph works.

With this information, I’d like to know what other developers think, especially @iss and @sybren and sergey (whom I cannot mention since apparently, I’m a new user here xD).

Cheers,
Jörg

6 Likes

@neXyon Thank you for all your work on Blender Audio and for this important topic !

I think that Blender and the VSE should focus more on audio for a basic but precise audio editing and good playback performance ( without asking to become a DAW of course ).

In fact we should not forget that in film and audio-visual productions the audio is a fundamental part of the experience to communicate emotions and working with Blender audio is not easy for the design limitations you described well.

I hope, for example, we can find a solution for the subframe audio editing ( the “sample exact cuts”) as it is very clear that audio and video are different ( kHz vs fps ).

The buffer solution is interesting for who can work with enough Ram ( i am using 32 GB ram but for next system i plan to use more).

About the Dependency Graph, it could be an occasion to think a solution also for the scene/sequence design problem ( “sequences inside scenes” is a limitation ).

Then as Blender is 3D it would interesting to talk about 3D audio improvements and binaural audio, but maybe in other occasion/topic :slight_smile:

Sorry if I am missing some important detail, I have never really understood what problem does animation cache solve. How can it help, when I open file and press play? Doesn’t it imply that cache has to collect datapoint during playback in order to smooth animation later? Or does it lag sligtly and I see animation effectively like one frame later as I should?

I don’t think there is way around using animation system evaluation functions. Sure these could build cache, but they must be used.
Since we are talking realtime, things are not that simple, as sound playback should also react to manual tweaks such as adjusting volume.

If I understand current system correctly, during playback Blender updates changes in scene sound in quite high frequency. During rendering this could happen more often then once per frame but is does not need to happen for each sample if smoothing is used.

This is basically what I described in T59540 in “Playback speed animation” section. I wouldn’t say, that getting f-curve data and integrating it is too complicated. I have written this in context of VSE where you wouldn’t expect anything more complex than simple keyframe animation.
I am not sure how complex cases could be handled. Possibly, as you suggested, sound animation cache to provide mapping between timeline frame and audio handle seek position (with relatively smooth animation data).

Same goes for presenting animated data (like waveforms) on timeline, and making edits based on visualized data. Difference is, that playback is dependency graph domain, and editing/presentation is editor domain.

In any case this would have to be resolved in Blender codebase.

There are two instances of anim structure that reference one file. The file is seeked (read and decoded) twice. With FFmpeg, each anim has also own decoder, so when you advance position by 1 frame, both decoders are fed only one more packet so this is quite efficient configuration.
In some cases VSE cache would be utilized as well, depends on setup. that’s besides the point I think.

I don’t think seeking in files is big problem though? Especially for audio.

This would have limitation where you can’t change properties of sound during playback, so not sure about that, but audio scrubbing would be top quality :slight_smile:

I think this could work. It could be completely invisible to user and would definitely improve current state.

I would say this is more issue with data organization and ownership. Report T69444 is quite good example.
If I remember correctly, I thought that it would be better if scene wasn’t owner of sound “output node” but rather camera or sequencer (or compositor?) as these are technically used by rendering as “output nodes”. This way relationship would be more clear.
This is again looking from VSE perspective. On the other hand you would expect to link scene with speaker, set it as background and it should work. There may be other workflows I am not aware of that would be in contradiction with my idea, so I am not able to evaluate this situation currently.
There are other problems with scene strips like T70569 So these problems will need design evaluation.

1 Like

Animation

I think this issue is related to the depsgraph not being able to evaluate data at two points in time simultaneously. For regular evaluation you want of course to just evaluate the current frame, as already happens in Blender. For audio evaluation it is indeed necessary to evaluate one (or more) frames ahead of time. For example, that way you know which volume will be needed in the next frame, and can do proper interpolation to reach that. Evaluation is not cheap though, so this ahead-of-time evaluation should really be limited to the audio system + its dependencies.

Pitch Animation & Sample Exact Cuts

How much memory would it take to store a sample number for each scene frame (obtained by integrating the pitch, as you said), so that you can do exact lookups? This could then maybe also be used for the exact cuts. Placing this buffer on disk is IMO fine, if it gets too big.

Basically it’s a directed acyclic graph that contains two kinds of information:

  • “to evaluate X, I need Y₁, Y₂, and Y₃, so evaluate those first”, and
  • “to evaluate things of type X, call this function”.

The depsgraph keeps track of what has and hasn’t changed since the last time it was fully evaluated, and thus knows what to re-evaluate when some data is requested.

1 Like