Hi, I’m Jonas and I develop BitWrk, a peer-to-peer rendering service for Blender.
For BitWrk to be able to dispatch its work to a swarm of computers on the internet, it needs to be good at pushing scene data as efficiently as possible. One strategy it employs is delta compression, i.e. sending only data that doesn’t yet exist on a peer. BitWrk looks for common sequences of bytes using a mechanism called content-based chunking.
One thing I noticed is that this doesn’t work as well as expected, and the reason seems to be Blender’s file format. Loading a .blend file into Blender and saving again would ideally result in an identical file. Surprisingly, the two files differ quite a lot.
Save as barbershop_interior_cpu2.blend (287568012 bytes)
Save again as barbershop_interior_cpu3.blend (again, 287568012 bytes)
The two latter files are mostly identical with some minor differences. The difference to the original file is substantial, though. BitWrk’s content based chunking finds that about a third of the file’s chunks differ. Chunks are, on average, 8192 bytes in size. On BitWrk this leads to a substantial retransmission of data in situation where one wouldn’t suspect it.
A while ago I tried to find out what caused this effect. I noticed that in order to build a graph structure, scene objects in the .blend file reference each other using memory addresses. Apparently, those memory addresses are re-assigned at load time, resulting in a new address for every object in the blend file. A single differing byte is sufficient in order for a chunk of (say) 8000 bytes to be detected as differing.
Do you think there is anything that can be done in order to have more coherency between load->save cycles? Maybe assigning persistent IDs to objects (instead of memory addresses) is feasible? Or any other strategy? Has this topic come up before?
It’s not only BitWrk users who would benefit from a solution, but also people who keep backups of their .blend files or who put them on services like DropBox.
Even between minor versions the file format can undergo quite extensive changes in the form of new datatypes and small changes to existing structs. Therfore opening and resaving a file that has been created by a different Blender version can show up as quite different. I don’t see the current format going to change in that respect any time soon.
Memory addresses are indeed reassigned on load. This is essential since the format is quite close to being a memory dump, albeit somewhat glorified. The format is quite close to being a TLV-form format (Type Length Value). Additionally a .blend file carries the info describing all structures for that version at the end of the file, so that part no doubt differs greatly between different Blender versions.
For your chunking to work properly you could teach it more about the format.
You’re absolutely right in pointing out that there are very valid reasons for a .blend file undergoing major changes, such as when conversion from an older version is necessary. I should have mentioned that I’m not talking about this case.
But even if I open a .blend file I saved right before (and which therefore has the most current file format) and save it again, I observe the large change.
Back when I analyzed this issue, Kaitai didn’t have good support for .blend and I wrote my own small utility blendinfo.py (GitHub). I used Jeroen’s format description for reference.
I also found the format quite elegant and really wouldn’t propose any change to it. But I thought maybe it is worth discussing changing the way those pointers are treated. Instead of reassigning them on every load, they could be kept as persistent IDs to be reused the next time a .blend file is saved.
This would mean a small increase of memory use, as every struct would have to carry its own ID. Instead of containing memory addresses, references in the file would contain IDs. There are lots of details to discuss, but the file format could remain completely compatible both ways.
As a workaround for BitWrk’s chunking problem, I could overwrite all memory addresses in the .blend file with NULLs and keep the original addresses in a separate file. After transfer, the original .blend file could be restored. I’m hesitant to do this at it incurs considerable overhead and might be flaky with regard to future format changes. So I’d rather see a change in how Blender handles things as there is some benefit for others too and the performance hit is probably negligible.
Here is a thought: replace the addresses with essentially a running number (same address gets same number). Since the addresses change because a .blend file will be read into a different location of your memory, but they really are just numbers for connecting up the dots so to say.
Instead of writing just zeroes with your script start a running number and for each new address you increment, starting at 1. Then write that number in the address’ stead. I think this even still should work when reading in such a file.
Would be interesting to hear what happens if you did that…
Nathan, Campbell, thanks for the suggestion. Re-assigning addresses sequentially. I also thought about this, as it would save me from having to store (and transfer) the pointer data separately.
The problem I see is that a single inserted or removed data block in the file is sufficient and all subsequent datablocks are assigned different IDs. Before I try, is there an obvious reason why a save->load->save cycle would do something like this? I.e., is it order-preserving?
Where this would, unfortunately, surely break is when the user actually did some editing. I never mentioned this as a use case, but in a distributed rendeing scenario it actually is. Render, make some small adjustment, render again. Not having to transfer dozens of MB each time is great for the user experience.
Since the datablocks have essentially a unique ID already (type ID + name in the form of string) you could calculate a hash and use that for the addresses instead of the sequential numbering. That way order doesn’t really have a (big) impact, since the address will be tightly coupled to the datablock ID. Would that work?
I really liked the idea of using ID information to have a quick way of generating a unique hash (mostly unique probably, but that’s enough for my purpose). Unfortunately, if my statistics are correct, it appears that this will not have the desired impact:
> blendinfo.py --id barbershop_interior_cpu.blend
5722 of 254794 datablocks containing 5722 of 10492836 objects totalling 7432040 of 281459712 bytes are ID
I.e., only a small percentage of datablocks corresponds to ID objects (those that have a struct ID as first field, correct?). I hope I didn’t make any mistake in my analysis.
This leaves me with three options:
Extract pointers into a separate file as proposed above, exploiting the fact that this helps to achieve better efficiency under blockwise deduplication/compression. This is indeed similar to what Courgette does (interesting read, by the way!).
Rewrite pointers based on a hash of the full or partial content of the data (exluding embedded pointers of course).
Hack Blender to keep a current_pointer->old_pointer map around while the file is being edited. When saving, map pointers to their old values as far as possible. Opinions?
Options 1) and 2) sound like the cases that would be the most future-proof in going forward, since 3) would mean you’d have to keep that patch current, and ensure your users always use an adapted version. 1) and 2) can work with vanilla Blender.
I would probably go with 1) first, since the potential savings in bandwidth here appear to be the largest in the long run.