Hello all,
I’m positing here in hopes to get some ideas on how to solve current issues in the audio system of Blender. The following list (copied from here: https://developer.blender.org/T59540#588156) will detail some design decisions that I had to make when implementing the audio system with implications - stuff that can be done and stuff that gets really hard because of these design decisions.
Let’s start with animation: Blender allows to animate everything which is a really nice conecept. There are extensive possibilities to do animations: with f-curves and then modifiers on top of those for example. Video runs with 24 to 240 fps so evaluating the animation system that often is not that big of a deal. However with audio, the sample rate is usually 48 kHz and above. Calling the animation system evaluation functions for many different properties at this rate is just too slow for real-time audio processing. This is the reason why I had to introduce an animation caching system in the audio library. The cache of course has more disadvantages. Properties that can be animated in the audio system are volume, panning, pitch, 3D location and orientation (of audio sources and the “camera”/listener) and the cache is updated on every frame change which can be too late during playback. Thus, it can easily run out of sync with what the animation system knows - slightly moving a single handle of the f-curve has consequences for multiple frames, but the cache is only updated for the current frame, since always updating the whole timeline is too slow. To be able to do the latter, the user is provided with an “Update Animation Cache” button which is not only an ugly hack but is also horrible UI design since users don’t want to and shouldn’t have to care about this at all. Another disadvantage of the cache is that since it’s storing values, it has to sample the animation data at some sampling rate and later reconstructs it with linear interpolation. This sampling rate currently is the frame-rate (fps) of the scene. I would have preferred to do it based on an absolute time value but when the frame-rate is changed in Blender all animation stays frame based and is thus moved to different time points. Another detail: during audio mixing the animation cache is evaluated once per mixing buffer: you can set the size of the buffer for playback in the blender user settings.
A related problem is pitch animation. Animating the pitch equals the animation of the playback speed of the audio. The audio system supports arbitrary pitch changes just fine - it has a very high quality resampler that easily deals with this since that is also required for the doppler effect. The problem with pitch animation arises with seeking. To do proper seeking of pitch animated audio, you would basically need to integrate the playback speed from the start of the audio file. Furthermore, to be really precise this needs to be done with exactly the same sampling that the playback will later do. Since this is a huge effort, it’s currently simply not done. Seeking within a pitch animated sequence simply assumes the current pitch is constant and this will of course end up in the wrong position. Users can only hear the correct audio if they start playback at the beginning of the strip. Similar problems by the way also arise if you try to render the waveform of a pitch animated sequence.
Talking about being precise naturally leads to the next problem: sample exact cuts. During playback it is not enough if the animation system would simply tell the audio system: “Hey, this strip has to start playing back now.” This works for rendering since the code runs synchronously with the rendering. But for audio you have to run asynchronously or you’ll get either a lot of stuttering, clicking and pops, or wrong timing that is different every time you play back. The consequence of this was that I implemented a duplicate of the sequencing inside the audio system which needs to be kept up to date with what VSE knows such as strip starts, offsets and lengths. With this data structure the playback of a strip can then start precisely at the position it has to.
Actually, there is another reason for the duplication of the sequencing system within the audio system: scene strip and simultaneous playback and mixdown support. The audio system actually manages the sequencing data separted into two types of data structures. Ones that are responsible for the actual data that is also available in the VSE such as strip starts, offsets and lengths. And the others store data that is required for playback such as handles to audio files and the current playback position within these files together with a link to the other data structures. The latter can then exist multiple times in order to support playback at the same time as mixdown and scene strips. I actually wonder how this is done for video files since those have similar problems - is there just one video file handle that is always seeked to the position where the current process/strip is located? If you don’t understand this problem, think of this example: you have a simple scene with just one audio or video strip. In another scene you add a scene strip of this scene twice, overlapping and with a small offset. Now during playback, you have to read this one audio/video strip for both scene strips at different time points. That works badly if you have to seek within the file constantly.
Currently, with the dependency graph, especially the first issue got worse, where you basically have to hit the “Update Animation Cache” button every time something changes. Clearly, these issues here should be fixed. Here are some ideas:
- One option that I entertained before is to render audio into a buffer (has to be on disk for long animations) in a background thread, which would solve some issues. But keeping it up-to-date is not straightforward and it of course needs quite a bit of memory, raw audio is far from small.
- For animation I’m a bit out of ideas. Maybe it would already help to move the animation cache from audaspace to Blender, where the animation system could be responsible for keeping it up-to-date?
- For pitch animated strips we could store/compute a mapping between frame number and seeking position within the audio file.
- I wonder if the dependency graph can be used for the signal processing graph used for audio, covering also the case for scene strips and duplicating the graph for rendering/mixdown? Unfortunately, I so far don’t really understand how the dependency graph works.
With this information, I’d like to know what other developers think, especially @iss and @sybren and sergey (whom I cannot mention since apparently, I’m a new user here xD).
Cheers,
Jörg