Blend file load->save entropy

Hi, I’m Jonas and I develop BitWrk, a peer-to-peer rendering service for Blender.

For BitWrk to be able to dispatch its work to a swarm of computers on the internet, it needs to be good at pushing scene data as efficiently as possible. One strategy it employs is delta compression, i.e. sending only data that doesn’t yet exist on a peer. BitWrk looks for common sequences of bytes using a mechanism called content-based chunking.

One thing I noticed is that this doesn’t work as well as expected, and the reason seems to be Blender’s file format. Loading a .blend file into Blender and saving again would ideally result in an identical file. Surprisingly, the two files differ quite a lot.

Here is an example using a Blender demo file:

  1. Load barbershop_interior_cpu.blend (287574804 bytes)
  2. Save as barbershop_interior_cpu2.blend (287568012 bytes)
  3. Save again as barbershop_interior_cpu3.blend (again, 287568012 bytes)

The two latter files are mostly identical with some minor differences. The difference to the original file is substantial, though. BitWrk’s content based chunking finds that about a third of the file’s chunks differ. Chunks are, on average, 8192 bytes in size. On BitWrk this leads to a substantial retransmission of data in situation where one wouldn’t suspect it.

A while ago I tried to find out what caused this effect. I noticed that in order to build a graph structure, scene objects in the .blend file reference each other using memory addresses. Apparently, those memory addresses are re-assigned at load time, resulting in a new address for every object in the blend file. A single differing byte is sufficient in order for a chunk of (say) 8000 bytes to be detected as differing.

Do you think there is anything that can be done in order to have more coherency between load->save cycles? Maybe assigning persistent IDs to objects (instead of memory addresses) is feasible? Or any other strategy? Has this topic come up before?

It’s not only BitWrk users who would benefit from a solution, but also people who keep backups of their .blend files or who put them on services like DropBox.

Thanks,
Jonas

Even between minor versions the file format can undergo quite extensive changes in the form of new datatypes and small changes to existing structs. Therfore opening and resaving a file that has been created by a different Blender version can show up as quite different. I don’t see the current format going to change in that respect any time soon.

Memory addresses are indeed reassigned on load. This is essential since the format is quite close to being a memory dump, albeit somewhat glorified. The format is quite close to being a TLV-form format (Type Length Value). Additionally a .blend file carries the info describing all structures for that version at the end of the file, so that part no doubt differs greatly between different Blender versions.

For your chunking to work properly you could teach it more about the format.

An interesting tool to help you investigate is kaitai.io, here their version of the Blender file format . I think that with this, and the document written by @JeroenBakker may be a good place to get started.

You’re absolutely right in pointing out that there are very valid reasons for a .blend file undergoing major changes, such as when conversion from an older version is necessary. I should have mentioned that I’m not talking about this case.

But even if I open a .blend file I saved right before (and which therefore has the most current file format) and save it again, I observe the large change.

Back when I analyzed this issue, Kaitai didn’t have good support for .blend and I wrote my own small utility blendinfo.py (GitHub). I used Jeroen’s format description for reference.

I also found the format quite elegant and really wouldn’t propose any change to it. But I thought maybe it is worth discussing changing the way those pointers are treated. Instead of reassigning them on every load, they could be kept as persistent IDs to be reused the next time a .blend file is saved.

This would mean a small increase of memory use, as every struct would have to carry its own ID. Instead of containing memory addresses, references in the file would contain IDs. There are lots of details to discuss, but the file format could remain completely compatible both ways.

As a workaround for BitWrk’s chunking problem, I could overwrite all memory addresses in the .blend file with NULLs and keep the original addresses in a separate file. After transfer, the original .blend file could be restored. I’m hesitant to do this at it incurs considerable overhead and might be flaky with regard to future format changes. So I’d rather see a change in how Blender handles things as there is some benefit for others too and the performance hit is probably negligible.

1 Like

In my previous post I said I knew a workaround: Zeroing all bytes in the .blend file which belong to memory addresses and saving those bytes in an external file. Sounds insane? I did just that!

Here is a (horrible) python script for y’all to enjoy: strippointers.py.

Yes, it does work. Most differences between two .blend files disappear after memory addresses are zeroed. The barbershop file contains about 6.7 MB in memory addresses appearing all over the place.

Not all differences are gone. I didn’t trace the root causes and there are probably bugs in my script, but it is clear that the main culprit is eliminated.

The method I chose would actually work in my use case, but it is very hacky and comes with a significant performance overhead. Processing the barbershop file takes about five seconds.

Let me express my opinion by misquoting someone famous: This is all wrong! I shouldn’t be doing this! :anguished:

Well of course, if you use Python. Why not use a more performant programming language?

Here is a thought: replace the addresses with essentially a running number (same address gets same number). Since the addresses change because a .blend file will be read into a different location of your memory, but they really are just numbers for connecting up the dots so to say.

Instead of writing just zeroes with your script start a running number and for each new address you increment, starting at 1. Then write that number in the address’ stead. I think this even still should work when reading in such a file.

Would be interesting to hear what happens if you did that…

1 Like

Right, quite sure this will work.

1 Like

Nathan, Campbell, thanks for the suggestion. Re-assigning addresses sequentially. I also thought about this, as it would save me from having to store (and transfer) the pointer data separately.

The problem I see is that a single inserted or removed data block in the file is sufficient and all subsequent datablocks are assigned different IDs. Before I try, is there an obvious reason why a save->load->save cycle would do something like this? I.e., is it order-preserving?

Where this would, unfortunately, surely break is when the user actually did some editing. I never mentioned this as a use case, but in a distributed rendeing scenario it actually is. Render, make some small adjustment, render again. Not having to transfer dozens of MB each time is great for the user experience.

Since the datablocks have essentially a unique ID already (type ID + name in the form of string) you could calculate a hash and use that for the addresses instead of the sequential numbering. That way order doesn’t really have a (big) impact, since the address will be tightly coupled to the datablock ID. Would that work?

1 Like

Similar thing for Chrome software updates:

1 Like

Courgette looks interesting indeed. We did long ago (probably around 10 years?) some tests with bsdiff to figure out incremental updates for Blender binaries, but in the end that never got anywhere.

Would that work?

I really liked the idea of using ID information to have a quick way of generating a unique hash (mostly unique probably, but that’s enough for my purpose). Unfortunately, if my statistics are correct, it appears that this will not have the desired impact:

> blendinfo.py --id barbershop_interior_cpu.blend
5722 of 254794 datablocks containing 5722 of 10492836 objects totalling 7432040 of 281459712 bytes are ID

I.e., only a small percentage of datablocks corresponds to ID objects (those that have a struct ID as first field, correct?). I hope I didn’t make any mistake in my analysis.

This leaves me with three options:

  1. Extract pointers into a separate file as proposed above, exploiting the fact that this helps to achieve better efficiency under blockwise deduplication/compression. This is indeed similar to what Courgette does (interesting read, by the way!).
  2. Rewrite pointers based on a hash of the full or partial content of the data (exluding embedded pointers of course).
  3. Hack Blender to keep a current_pointer->old_pointer map around while the file is being edited. When saving, map pointers to their old values as far as possible. Opinions?

Options 1) and 2) sound like the cases that would be the most future-proof in going forward, since 3) would mean you’d have to keep that patch current, and ensure your users always use an adapted version. 1) and 2) can work with vanilla Blender.

I would probably go with 1) first, since the potential savings in bandwidth here appear to be the largest in the long run.