Clarifications for Fast IO project; And the proposal

To be able to submit a proposal for the Fast IO project this soc, I wanted to have a clarification regarding the use of memory-mapping.

The current issue with PLY, STL and other importers is that python introduces a significant memory and time overhead. A simple profiling of a 23MB file import shows that lines like

  • ans = [mapper(x) for x in stream[:count]]
  • return [x.load(format, stream) for x in *self* .properties]
  • dict([(i.name, [i.load(format, stream) for j in range(i.count)]) for i in *self* .specs])

are the biggest offenders in terms of wall clock time. Also when importing even bigger files, copying the whole file in memory is an expensive operation. https://developer.blender.org/F8412052 Obtained via py-spy.

Instead of working with an inherently slow language (interpreted vs compiled) while increasing the dependencies on libraries like pandas or scipy, the better option seems to be porting the for loops to C/C++. This solves the time problem, but not the memory one. If the current python algorithm is to be used directly in c, cpp, the “copying the file into memory” would still be there.

Which brings <sys/mman.h> onto the table. The way I thought of this is to make an API layer between the reading processes and the disk. It indeed solves the problem of memory overhead by pretending that the file is in memory without actually doing so. But there are some grey areas:

  • Complexity: would it be worth it ? Since it would be a generic layer that doesn’t care about the contents of the files, it doesn’t matter which format is being transferred. So other import export processes can use it too.
  • Windows: There is a whole new world of functions that windows provides while discarding the unix header mman.h. So cross platform development adds some costs. CreateFileMappingW and MapViewOfFile are relevant here.
  • 32 bit systems: The theoretical limit of memory available here is ~4GiB which translates to even lesser address space due to OS pinning some processes for proper functioning. So a check might need to be performed while reading the file and the architecture using it to determine what is the largest chunk to read at a time or before that, whether to even use it or not and fallback to read and copy.
  • Network/ external drives/ drives going to sleep: There were some bug reports in software using memory mapping where the software crashes with such drives.

Are there more elegant solutions to this, or does it make sense at all ?

I’ll post the doc containing the formal proposal in this thread itself, after I get an idea of what to write. Please find the proposal below.

Thanks .

4 Likes

I like the concept of memory-mapped files, but I don’t have much pactical experience with them. I recently ran into Use mmap With Care by Benjamin Schaaf, and I think he makes some good points about the nasty errors you could get. Fortunately in the scenario of importing a file, it is only mmapped while the import is running, so certain issues are much less likely to arise.

Are you suggesting to port those to C/C++ but to keep the rest of the import code in Python?

yes I’ve given that blog a read too! I also went to their bug tracker to find the issue with NFS (the software crashing if PC goes to sleep). It won’t happen while importing, but for the entire duration the imported model is in use, before getting written somewhere, it can very well happen. If I’m understanding the fourth picture correctly.

Are you suggesting to port those to C/C++ but to keep the rest of the import code in Python?

Python doesn’t seem to add anything more than quick prototyping and ease of modification. If mmap is to be accepted, I’d suggest scraping all the functional code from python, leaving only UI buttons’ hooks. If not, then yes, let’s keep some part here, some there.

Memory mapped files can have significant performance issues when using network file systems, which is pretty common in production. Particularly on Windows it can be 10-100x slower when not using the right API functions and reading in random order. But also on Linux with NFS it can be slower.

This type of optimization should be done after the rest of the importer is working well and optimized, it’s not worthwhile to spend a lot of time on this upfront.

Latency is often the biggest issue reading from disk. If your code needs to reads the next part of a file and it’s not cached in memory, it will have to wait a long time while the CPU sits idle. Ideally the OS will be reading ahead on the disk while the CPU is busy, so that by the time you need the part it’s already cached in memory.

With memory mapping you rely on the OS to guess what you are going to read, and it doesn’t have a lot of information to make good guesses there. For optimal performance you want to make sure you only need to read every part of the file once, and read some amount of bytes ahead to hide the latency. There are ways to do this with async IO, multithreading, hints to the OS, … but really I would just start by reading the whole file into memory and not worrying about this much.

3 Likes

It is useful to abstract away some of the file reading, so you’re not directly using fopen, std::ifstream, etc. That makes it easier to experiment with different implementations later on, or read from memory for packed files, etc.

1 Like

I agree with this advise, given that so far I only know about madvise, among the said things.

Might I ask some suggestions about what else to include, except these:

  • Correctly working PLY & STL importers and exporters.
  • noting the performance of both the older and newer codes at multiple times.
  • Take as much things out of Python, that improves timings but also doesn’t add unnecessary complexity to cpp.
  • refactor to separate the -format specific- and -generic- areas.

@brecht Did your comment meant to reorder the proposed plan to put mmap for later or something else in its place?

As part of this project you would profile performance, including disk I/O. I expect you will not need mmap to optimize performance or memory usage, a regular file stream will probably work fine.

The main thing is just to reserve a chunk of time in the schedule for optimization, and you can write ideas about the kind of optimizations you might do to show you’ve thought about the problem. But in the end it’s not something we expect to very specific.

2 Likes

This is a slightly redacted version of my proposal that has been shared with the Blender Foundation on SoC site. Will submit on 27th. Feedback welcome till then (:


Name

Ankit Meel

Contact:
Synopsis

Among the 3D formats available, some are simple in theory, yet effective for a lot of different use cases and supported by a multitude of software in the industry. The challenge they offer is the number of iterables. Stanford PLY, for example, quickly gets over a million vertices. STL being a lossy format, has to be stored with extra details, making it enormous. Importing such models faster and doing so in the memory limits is the aim of this project.

Benefits

It will cut the import time by several folds, thus improving user experience. Also, it enables the baseline models with 4 GB RAM (requirements page) to process huge models and not run out of memory. For Blender, it provides a basic structure to facilitate implementing other file formats in the future, instead of addons being written from scratch, in Python again.

Deliverables
  • Working importers and exporters for PLY & STL for ASCII and Binary formats

  • Providing a cross-platform way to assess the performance of both C++ and Python codes.

  • Also documenting the performance at various iterations in the logs on the wiki.

  • Since there is no change on the user interface side, no additional documentation is needed. However, sufficient external documentation and internal comments for the code is expected.

  • If possible and decided after discussion with the UI team, add a progress bar, or an entry in the logger in Blender.

  • Add OBJ support as much as possible using the new framework and port the last year’s branch.

Project Details

Please find examples of all the file types in the appendix.

I applied subdivision modifier with Catmull-Clark nine times (6+3) on the default cube, on factory settings, and exported it to PLY, STL-Binary and STL-ASCII. Here are the stats.

Format Size Time (Export / Import) (s) Peak memory (Export / Import) (GB)
PLY ASCII 524 MB 111/ 146 5.09/ 4.31
STL ASCII 553 MB 27 / 59 .827/ 1.73
STL Binary 157 MB 14/ 31.5 .833/ 1.7

Also find some graphs here.

The biggest penalty in terms of time in the process are the loops, which get very time consuming when there are 6,291,456 vertices and 1,572,864 faces. I used py-spy for profiling.

  • return dict([(i.name, [i.load(format, stream) for j in range(i.count)]) for i in self.specs])

  • [x.load(format, stream) for x in self.properties]

  • for i, v in enumerate(ply_verts): # write vertices

  • for pf in ply_faces: # write faces

Following the precedent of multiple scientific libraries being written in C/C++ and using Cython to link the python wrapper, writing all the io operations in C++ is feasible. The blender/source/blender/io already contains the Alembic, AVI, Collada, and USD files, so the newer ones will also be put there. Also, blender/source/blender/editors/io will keep the operators’ linkage and handle the per-file-format preferences that are shown in the file browser.

I am reading the current approaches to iterate over mesh, textures, color, etc., in the files mentioned above. So I expect to keep things uniform and thus maintainable. The endian property in binary files would be handled similar to that in avi_endian.c. In week 7, during refactoring, the python addon is to be removed, keeping everything in one language and thus easing debugging, further improvements, etc.

Since Valgrind won’t work on 10.14, I’d be using Instruments.app. If necessary, high performance C++ profiler would be used for finer details.

Possible optimizations & plans:

  • Reading the file in chunks instead of all at once, using streams. Loop over all the lines only once.

  • Minimising copying of variables & using pointers to pass them around.

  • Using the knowledge about the format to read the data, instead of reading it in a different and later do conversions.

  • Minimising flush operations to the disk from the stream.

  • Separating lower level file reading operations in a separate layer for easy experimentation.

  • Using a minimal, bare bone data structure to store one vertex/ face/ any other property so it doesn’t add up to a much bigger number later.

Addressing memory mapping now, it isn’t a magic pill that improves performance in all cases. Many modern SSDs and networks provide read speed, which no longer is the bottleneck in the parsing. It has to be decided only after actual profiling, not directly applying memory map to the problem while making the bare minimum task that is to be done, complex. If the bottleneck turns out to be mesh processing, not the disk, I will look into distributing the file/ line reading process on multiple cores.

Project Schedule

The best time that I can work in is right now, which I am using to read the existing code of modifiers, iterations on mesh, modifiers, and the previous attempt. The college is closed and likely would remain so for at least 4-5 weeks. If it opens, it will overlap with the community bonding period, which I’ve already done (-: So that will not interfere with the rest of the timeline. The order of tasks, weekly is expected to remain as:

  • 1-2-3 PLY I/E for both binary and ASCII (initially linking the files and setting up build might take time)

  • 4-5 STL I/E for both binary and ASCII

  • 6 Initial benchmarking and documentation for profiling

  • 7 Major refactoring for separating low-level APIs from functional code & UI

  • 8 Regression testing, documenting results for multiple files using both exporters for both implementations.

  • 9-10 Optimisations, as discussed above and benchmarking, removing older python code using C++ based operators for wm events like in blender/source/blender/editors/io

  • 11 Deciding and adding the progress bar. After discussing with the mentors, port as much I can from the fast IO 19 branch to add OBJ support.

  • 12 Code Documentation, comments, review, merging, and buffer.

I will even further improve OBJ support after the GSoC is over. After which I intend to stay to help with bug triaging and fixing and learning new things in Blender.

Bio

I am Ankit Meel. I was introduced to C and C++ in the second semester, three years ago and have been using them since. Using Python, I’ve completed multiple assignments in machine learning, signal processing; I also completed a facial expression classification task on images, as a summer project. Other than that, I have done front end development, a server setup in NodeJS, socket programming using Python, and numerical analysis methods in Octave.

I’ve also gotten some exposure to Objective-C while doing a paper cut for alias redirection on macOS, D6679. In D6512, I made a partially working solution for icon theming. Been active for over eight months now, I’m also a member of the moderators group and coordinate with the bug triaging team and almost all other developers while triaging posts.

Interesting reads:
Appendix
  • PLY ASCII : Plane.
ply
format ascii 1.0
comment Created by Blender 2.83 (sub 10) - www.blender.org, source file: ''
element vertex 4
property float x
property float y
property float z
property float nx
property float ny
property float nz
property float s
property float t
element face 1
property list uchar uint vertex_indices
end_header
-1.000000 -1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000
1.000000 -1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000
1.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000
-1.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000
4 0 1 2 3
  • STL ASCII : Plane (brackets added)
solid Exported from Blender-2.83 (sub 10)
facet normal 0.000000 0.000000 1.000000
outer loop
vertex -1.000000 -1.000000 0.000000 (a)
vertex 1.000000 -1.000000 0.000000  (b)
vertex 1.000000 1.000000 0.000000   (c)
endloop
endfacet
facet normal 0.000000 0.000000 1.000000
outer loop
vertex -1.000000 -1.000000 0.000000  (a)
vertex 1.000000 1.000000 0.000000    (b)
vertex -1.000000 1.000000 0.000000   (d)
endloop
endfacet
endsolid Exported from Blender-2.83 (sub 10)
  • STL Binary : Plane
4578 706f 7274 6564 2066 726f 6d20 426c
656e 6465 722d 322e 3833 2028 7375 6220
3130 2900 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0200 0000 0000 0000 0000 0000 0000 803f
0000 80bf 0000 80bf 0000 0000 0000 803f
0000 80bf 0000 0000 0000 803f 0000 803f
0000 0000 0000 0000 0000 0000 0000 0000
803f 0000 80bf 0000 80bf 0000 0000 0000
803f 0000 803f 0000 0000 0000 80bf 0000
803f 0000 0000 0000