Speed Up C++ Compilation

Shorter build times improve developer productivity. Therefore, it is benefitial to reduce them if possible. This document explains why compilation can be slow and discusses different approaches to speed it up.

This is mostly meant to be a resource for other developers that are investigating compile time issues. However, it would be nice to hear other peoples thoughts for how we can or should improve build times as well.

Build Process

When analysing compile times, three parts of the build process are particularly interesting. Those are briefly described below.

Source Code Parsing

This is commonly referred to as the compiler front-end. First, it run the preprocessor per translation unit. This inlines all included header files, usually resulting in a fairly large file (often >100,000 lines).

The preprocessor itself is very fast, however parsing that many lines of code is slow. Parsing C++ code is slower than parsing C code because C++ is a more complex language. At the end of parsing, the compiler has an internal representation (e.g. an AST) of the current translation unit and all included headers.

Code Generation

Now the compiler checks which functions have to be generated, because they may be called by other translation units. Then it generates machine code for those functions, which also involves processing all the other functions in the same translation unit that are called indirectly. This is often referred to as the compiler back-end.

The compile time here mainly depends on how much machine code is generated (and not on the size of the preprocessed file). In C++ this often results in many (templated) inline functions to be generated. In C code, many of those functions would not be inline which makes C code generation generally faster given the same amount of code. In C++ they often have to be inline because of the templating mechanism.

Linking

The linker reads the symbol table of every translation unit, i.e. the list of functions and variables that a translation unit uses and provides. Then it copies every used function into the final executable ones and updates function pointers.

The linking time mainly depends on how fast the linker can read and deduplicate all the symbol tables and how fast it can copy and link all the functions in the final executable.

The more translation units and the larger their corresponding machine code size and symbol table is, the slower the linking process.

Redundant Work

The main reason for why building big C++ projects is slow is that the compiler and linker do a huge amount of redundant work. The same headers are parsed over and over again, and the same (templated) inlined functions are generated many times. This leads to a lot of duplicate functions in the generated object files and the linker’s job is to reduce the redundancy again. The redundant work is often much more than the work that is unique per translation unit.

Build Time Reduction Approaches

There are different techniques to reduce build times. Generally they work by either reducing redundant work, reducing unique work or by doing the same work faster. Reducing unique work usually often involves a run-time performance trade-off.

Faster Linker

Linking is the last step of the build process and has to be done even of only a single source file changed. Usually it has to relink everything from scratch after every change, unless something like incremental linking is used (which I don’t have experience with in Blender).

Many linkers are mostly single threaded and are therefore a huge serial bottleneck at the end of the build process.

If you can, use the mold linker. I’ve use it for a long time now and it’s significantly faster then everything else I’ve tried. People have reported issues with it doing wrong things, but I’ve never experienced those. Unfortunately, it’s not available on windows yet. The new linker in Xcode 15 on macos seems to be significantly faster than in previous versions as well.

Ccache

Build systems usually decide to rebuild files based on their last modification time on the file system. This becomes a problem when one often updates the last modification time without actually changing the source code. Ccache wraps another compiler like gcc and adds a caching layer on top of it. It first checks if the source code to be compiled actually changed. If it did, the underlying compiler is invoked. Otherwise, it just outputs the result of a previous compiler invocation.

This caching layer helps in a few situations. Obviously, it avoids recompilation when just saving over an existing file without any changes. It also helps when just changing comments, because Ccache can check if the preprocessor output has changed and the preprocessor removes comments. Most importantly, it helps when switching back and forth between branches. Recompilation after switching to a branch that has been compiled before uses the cache and is thus much faster.

Note that ccache only caches the output of the code generation step, but not the linker result. Therefore, the linker still has to run again even if source files haven’t changed.

When benchmarking compilation time, Ccache can get in the way because it makes the results unreliable. Ccache can be temporarily disabled by setting the CCACHE_DISABLE environment variable to 1 (export CCACHE_DISABLE=1).

Multiple Git Checkouts

Having multiple git checkouts can reduce compilation times by avoiding often branch switching which leads to many code changes. One could simply copy the source code folder or use git worktrees. Each checkout should have their own build folder.

Personally, I have a separate checkout that I don’t use for compilation but only to look at other branches while working on something or to checkout very old commits, e.g. when figuring out when a particular function has been added.

When working with multiple checkouts, it’s easy to get yourself confused with which checkout you are currently working on. I solved this locally, by using vscode’s ability to configure a different theme per folder. Essentially, when I’m not in my main checkout, vscode looks different. I never accidentally changed the wrong checkout since I set this up.

Multiple Build Folders

Having multiple build folders for different configurations reduces the compilation overhead when switching between them. For example, one should have separate build folders for release and debug builds. Partially this is setup automatically when using our make utilities. Visual Studio also creates separate build folders automatically afaik.

It can be benefitial to have even more independent build folders though. Like mentioned before, it can help to have separate build folders for each git checkout. Personally, I currently have build folders for the following configurations: release, debug, reldebug, asan, compile commands (only used for better autocompletion in vscode), clang release, (used to check for other warnings and performance differences), clang release optimized (enables extra compiler optimizations to see if they have an impact), release reference (used when comparing performance of two branches).

Forward Declarations of Types

Forward declarations allow using types from one header without actually including the header. For example, putting struct bNodeTree; in a header makes the bNodeTree type available without having to include DNA_node_types.h. The hope is that at least some translation units that use the forward declaration don’t include DNA_node_types.h anyway. This leads to less code being in the preprocessed translation units which reduces the parsing overhead.

Forward declaring types that end up being included in most translation units does not really provide any benefits for the compile times. E.g. forward declaring common containers like Vector rarely makes sense.

Headers that only work with a forward declared type are restricted with what they can do with the type. They can’t access any of its members. That includes indirect accesses by e.g. taking or returning the type by value. Generally, only pointers and references of the type can be used. Smart pointers are supported as well as long as they don’t have to access methods of the type (like the destructor). If only a single function in a header needs access to the members, the include ends up being necessary anyway.

Forward declaring some types is harder than others. C structs are generally easy to forward declare. Nested C++ structs are impossible to forward declare. Namespaces and templates also make forward declarations harder. For example, one can’t really forward declare std::string reliably outside of the standard library. That’s because it’s actually a typedef of another type. Some standard types have forward declarations in #include <iosfwd>, but not all. The existence of that header also indicates that it is non trivial to forward declare these types. Also see this.

In some headers, we currently access data members of structs even though they are forward declared. That works because the data members are used in preprocessor macros. For example IDP_Int. This is a problem when we want to replace macros with inline functions, which will require the include.

Seeing benefits from forward declarations is often difficult in C++ code, because often code changes only have a negilible effect on compile times, making it hard to measure the overall effectiveness. This is also because the translation units often end up being so large that a single header of our own does not make a big difference. Another problem is that it is very hard to consistently use forward declarations especially in C++, because many things defined in headers need the full type. It’s difficult to use forward declarations everywhere, but very easy to introduce an include later on when it becomes necessary, rendering the previous work useless.

Smaller Headers

The idea is to split up large headers into multiple smaller headers. The hope is that some translation units that previously included the large header, will now only include a subset of the smaller headers. This can reduce parsing time.

This can be especially benefitial when part of a header requires another header that is very large. For example, BKE_volume_openvdb.hh includes openvdb which is not used by all code that deals with volume data blocks.

Smaller headers can also help with making code easier to understand. Many large headers have sections already anyway. Splitting sections into separate headers might be a nice cleanup. This only improves compile times if the individual headers are not grouped into one bigger header again, like in ED_asset.h.

Type Erasure

Templates generally have to be defined in headers so that each translation unit can instantiate the required versions of them. Removing template parameters from a function allows it to be defined outside of a header, reducing parsing and code generation overhead.

Obviously, many functions are templates for good reasons, but in some cases type erasure can be used without sacrificing performance. For example, a function that currently takes a callable as a template could also take a FunctionRef instead. Functions taking a Span<T> could take a GSpan.

Sometimes the combination of a normal and type erased code path can make sense too. E.g. parallel_for has an inlined fast path, but a type erased more complex path that is not in the header. Under some circumstances, type erasure can also reduce final binary size significantly.

Explicit Instantiation

Sometimes, it’s not possible or desired to remove template parameters from a function. However, if the function does not really benefit from inlining because it is too large, there is no point in compiling it in every translation unit that uses it. Compiling it only once reduces code generation time.

Explicit instantiation can be used to make sure that a templated function is only instantiated once for a specific type. See e.g. the normalized_to_eul2 function.

This approach can also be used to move the definition of a template out of a header in which case this would also reduce parsing overhead. Then the templated function can only be used with the types that have been instantiated explicitly.

While explicit instantiation can definitly help, it’s not entirely trivial to find good candidates that have a measurable impact. Maybe some better tooling could help here.

Reuse Common Functions

When a header already includes the functionality that one would implement in a source file, it’s better to just use the shared functionality. This may seem obvious, but it’s mentioned here because it’s not necessarily simple and requires familiarizing yourself with the available utilities. For example, many geometry algorithms have a gather step.

Having more common function that are not recompiled for every use reduces the code generation redundancy. Having too many such utilities that are only rarely used can also increase the parsing overhead though.

Use Less Forced Inlining

The compiler is fairly good at deciding which functions should be inlined. It uses various heuristics to achieve a good balance of run-time performance and binary size. Generally speaking, the compiler considers all functions that are defined in a translation unit for inlining. Using the inline keyword is mostly just information for the linker to tell it that there may be multiple definitions of a function in different translation units and that they should be deduplicated. That said, depending on the compiler, the inline keyword may change the heuristics to make inlining more likely.

Compilers also support forced inlining e.g. using __forceinline. Now the compiler generally just skips the heuristics and inlines everything anyway. That’s generally the desired behavior, but can have bad consequences when used too liberally. It can result in huge functions that are slow to compile. Compiling functions often has slower-than-linear time complexity. Also see this.

Better just use normal inline functions and only use force inlining when you notice that the compiler is not inlining something that would lead to better performance when inlined.

Precompiled Headers

Most of the time spend parsing code comes from parsing the most commonly used header files. Precompiled headers allow the compiler to do some preprocessing on a set of header files so that they don’t have to parsed for every translation unit. Using precompiled headers can mostly be automated with cmake.

While precompiled headers can reduce a lot of redundant parsing time, they don’t reduce the redundancy in the code generation step.

When using precompiled headers, one can accidentally create source files that can’t be compiled on their own anymore. That’s because when a precompiled header is used in a module, all translation units include it. Fixing related issues usually just means adding more include statements. These issues are easy to find automatically by simply disabling precompiled headers temporarily.

When there are only a few translation units that use a precompiled header, using them can also hurt performance. That’s because precompiling the header is single threaded and compilation of the translation units only starts when the header compilation finished. This effect is most noticable when using unity builds with a relatively large unit size.

Unity Builds

The redundancy during the parsing and code generation step is bad because it takes a long time relative to the non-redundant work that is unique to every translation unit. Unity builds concatenate multiple source files to a single translation unit. Now, all the headers have to be parsed once and code for inline functions has to be generated once per e.g. 10 source files. So the time spend doing redundant work relative to unique work is greatly reduced, by a factor of 10 in this case.

Unity builds can be used together with precompiled headers. This can help in modules with a very larger number of source files to get rid of the remaining redundancy when parsing headers. Overall, precompiled headers become less effective with unity builds though, because parsing overhead is reduced so much already.

Similar to precompiled headers, using unity builds can result in accidentally having files that can’t be compiled on their own anymore. This is also typically fixed by adding missing includes.

Unity builds generally require some work to work and to be safe. The fundamental problem is that symbols that were local to a single source file before, are now also available in other source files in the same unit. This can lead to name collisions which break compilation. So one has to be careful with making all source files compatible with each other. Whether one prepared files successfully can be tested by putting all source files into a single unit.

Just hoping that there are no name collisions is generally not enough, at least it doesn’t give piece of mind when any local variable can suddenly clash with local variables in another file. A solution that gives more piece of mind is to put each source file into its own namespace. Only symbols that are explicitly exposed are not defined in that namespace. We already use this approach in most node implementations. E.g. see the node_geo_mesh_to_curve_cc namespace.

This approach of file specific namespace works particularly well for nodes, because generally there is only a single function that is exposed (e.g. register_node_type_geo_mesh_to_curve). Most other source files expose multiple functions, and often the exposed functions are interleaved with local functions. In such cases, it is less convinient because the file specific namespace has to be opened and closed many times as you can see in this test.

Potential solutions that make this approach more convenient could be to automate inserting the namespace scopes. Alternatively, one could organize source files so that local functions are all grouped together so that only a single namespace scope is necessary. Yet another approach could be to split up source files. For example, similar to how we have one node per file, we could also have one operator per file. Without unity builds or precompiled headers, having many small source files leads to a lot of parsing overhead, but with those, it’s less of an issue.

When using unity builds, one has to decide how many files should be concatenated per unit. This is a trade-off. Small units can allow more threads to compile the units at the same time at the cost of more parsing and code generation redundancy. Large units reduce the redundant work but less threads can be used overall. Large units are benefitial when the system has only a few cores or when the total number of files to be compiled is very large so that all cores will be busy anyway. Actually creating the units can be automated with cmake.

Often, one only works in a single source file. When changing only that one file, using unity builds lead to longer recompilation time, because all other files in the same unity are recompiled as well. Often that is not an issue, but it can be when one of the other files happens to take very long, e.g. because it uses openvdb. This can be solved in using SKIP_UNITY_BUILD_INCLUSION in cmake. It allows either excluding the files that are known to be slow to compile from units, or it can be added temporarily for the file that is currently being edited.

Using SKIP_UNITY_BUILD_INCLUSION it should also be possible to start introducing unity builds in a module gradually file by file. It also makes it possible to benefit from unity builds in modules that have some files that are very hard to prepare for unity builds.

Sometimes it’s harder for IDEs to provide good autocompletion when unity builds are used. For IDEs that use the compile commands generated by cmake (CMAKE_EXPORT_COMPILE_COMMANDS), it can be benefitial to have a separate build folder just to generate the compile commands. That build folder can have unity builds and precompiled headers disabled.

Using unity builds can also make the job of the linker easier. That’s because the linker has to parse fewer symbol tables. Within each units, are symbols are already made unique as part of the compilation process. This also reduces the total size of generated machine code, which reduces the size of the build folder.

Distributed Builds

When one has more compute resources available outside of the work PC, one can consider to use distributed builds. Those don’t reduce redundancy (they might actually increase it), but can still improve compile times by making use of more parallelism. I don’t have much experience with this myself, but tools like distcc shouldn’t be too hard to get working with Blender.

Distributed builds generally only help when building lots of files in parallel, so they are less effective when working on individual source files and only few files need to be recompiled after every change.

C++20 Modules

The compilation speedup that can be expected from C++20 modules is probably similar to that we could get from precompiled headers. So the parse overhead can be reduced, but the code generation redundancy likely remains.

Modules have the benefit over precompiled headers that source files stay independent and are not all forced to include the same precompiled header. Thus, the problem of ending up with source files that can’t be compiled on their own, does not not exist.

The downside is of course that we are not using C++20 yet. Even if we would use it, it’s unlikely that our build chains fully support them yet. And even if they would, we’d likely still have to make fairly significant changes to our code base to make use of them.

That said, switching to C++20 can also make compilation slower overall. It’s unclear whether using modules can offset that slowdown or whether that slowdown goes away if C++20 implementations become more mature.

Tools

Just like for profiling the run-time performance of a program, there are also tools that measure the build time. Those can be used to make educated guesses for which changes might impact build times the most. The tools help most to identify headers that one should avoid to include or functions whose code generation takes a significant portion of the overall build.

Feel free to suggest more tools that I should add here.

Summary

Always use the fastest linker that works for you. This is likely the most significant change you can do to improve your daily work in Blender. Forward declarations can be useful sometimes, but require significant discipline to get meaningful compile time improvements. It’s much harder in C++ than it is in C. Splitting up headers can be good for improved code quality, but likely don’t have a significant enough compile time to justify spending too much time on them.

Precompiled headers make sense in modules with many similar files. Their effectiveness is greatly reduced when unity builds are used as well. Overall, unity builds are the most effective tool to reduce redundant work and therefore improve compile times. Some code changes are necessary but their effectiveness is easily measurable.

Distributed builds are a nice thing to try for people who have the resources. C++20 modules could help like precompiled headers but are out of reach for the foreseeable future.

26 Likes

This is a great resource. Could be moved into the wiki.

A few points:

  • One thing that wasn’t mentioned is ccache. I find that it helps a lot when switching between main and some random branch or pull request that may be quite far from it. Going back to main is much quicker then.
  • In addition I also have multiple git worktrees and build folders, which helps reduce the amount of recompilation. One that tracks main, one that tracks the upcoming release during bcon3, and one to try out random pull requests. It’s easy to get yourself confused about which branch you are editing or which build you are running though.
  • A little trick I do when I need to update a branch that I’m not currently on, I do e.g. git fetch origin main:main. This prevents checking out an older version of the branch that may require more time recompiling.
  • For mold, the package versions in various Linux distributions are too old. Newer versions have important bugfixes. It’s relatively straightforward to build youself and set the path to it in the CMake configuration. mold on macOS was also announced to become free now, though Xcode 15 includes a new linker that is quite fast too.
  • Maybe somewhat surprisingly, just enabling the build flags for a new C++ version can increase compile times. So while C++20 modules may help, it’s not obvious it’s better overall.
8 Likes

I admit, this is fully personal preference but, I’d link the Build Insights SDK [1] and code examples [2], while build insights GUI tools on their own are very usable, them being gui tools it sure is a lot of clicking around and it’s not very friendly towards automation.

The code examples makes things a lot easier to tweak and filter things to your liking and automate a thing or two.

[1] C++ Build Insights SDK | Microsoft Learn

[2] GitHub - microsoft/cpp-build-insights-samples: Code samples for the C++ Build Insights SDK

Thanks for the feedback! I integrated it in the original post. We could move it to the wiki at some point, but for now it seems easiest to have it here and gather some more feedback.

1 Like

A brilliant article… thank you so much for posting it.

I have used Google’s Include-What-You-Use tool to great advantage. It assists in getting headers right, i.e. which #includes go in the header, which in the implementation, which should be forward references, etc.

Thanks, I’ve added it and also made some other minor changes.

1 Like

Maybe this tip is to obvious, but for me it is very helpful to only build what I am working on by deactivating as many cmake compile flags as possible. Rebuild time using make lite is about 3-4x faster than when using make.

The downside is dependencies are not always clear, and even if the code compiles and runs, you might get unexpected results that are hard to debug.

1 Like

lite used to build under 5 minutes years ago, it was perfect for troubleshooting build/linker issues, these days lite while still faster than a full build is sadly no where near as lean anymore

Some great suggestions here! Hoping to offer some feedback to the conversation:

On Unity Builds, my personal recommendation is to “just say no” if at all possible. The gain you perceive when timing a single build will be lost in the face of its many disadvantages:

  • Time chasing phantom compile errors that appear in a unity build and not a typical build, or vice versa.
  • Time lost due to worser incremental compile/link performance (addressable partially by adaptively globbing TUs, but then this runs into the next drawback)
  • Time lost because you now have lost built determinism and can no longer easily cache intermediate object files over network, locally, or through any other generalized build caching mechanism

I have worked in four distinct AAA game engines, all of which ultimately resorted to unity builds to at least partially alleviate build time performance woes. While unity builds are an unquestionable bandaid in the short term, they result in a long tail of problems that will require constant attention and maintenance. To say that I hate unity builds is putting it mildly, but I don’t believe I’m alone in this reaction among many developers that have dealt with them on a daily basis.

If at all possible, I would encourage proper header separation, avoiding huge template functions, actively using forward declarations where possible, and many other points already mentioned. Once you go unity, you do not come back and it is unlikely that many build-time issues will actually be resolved later.

On the subject of C++20 modules, note that there is currently a bad interaction between C++20 modules and MSVC DLLs. I wrote about this in this SO answer for reference: What is the expected relation of C++ modules and dynamic linkage? - Stack Overflow.

1 Like

Thanks for the feedback! I’m aware that unity build are discouraged by some developers. I think I understand the disadvantages of unity builds, but I think those can be addressed using approaches that I described in the post. Sometimes it’s easier and sometimes it’s harder. If they can’t be addressed for some particular project, I would not use unity builds either.

Time chasing phantom compile errors that appear in a unity build and not a typical build, or vice versa.

We are using unity builds for almost two years now, and I think it saved as significantly more time than it cost to fix compile errors. I’m only guessing but would say that it’s on the order of weeks saved vs hours spend. We don’t really get any build errors aside from the occasional missing include and those are usually obvious and easy to fix. I actually just counted, and found <15 commits that fixed issues with unity builds since they were introduced, and all of them were quite obvious (example).

That is only possible because we avoid symbol collisions by design using separate namespaces for each file. If that is not possible for some reason, then unity builds are indeed a bad option.

Time lost due to worser incremental compile/link performance (addressable partially by adaptively globbing TUs, but then this runs into the next drawback)

Highly depends on what kind of work you are doing of course, since that affects how many files you typically have to recompile after a change. When changing headers that affect many translation units, unity builds are generally faster to rebuild. When working in a single source file for an extended period of time, it can be annoying to also have to compile other files all the time. For me personally, that never was a problem, but I know that others ran into this. I mentioned that cmake’s SKIP_UNITY_BUILD_INCLUSION can be used to work around this issue.

Time lost because you now have lost built determinism and can no longer easily cache intermediate object files over network, locally, or through any other generalized build caching mechanism

The only caching mechanism I use is Ccache, and it is able to cache the intermediate object files perfectly well. If you have the luxury to use distributed builds than the situation might be a little bit worse, but it could just as well also be better depending on what you are doing because less data has to be transferred. Unity builds are always optional of course.

they result in a long tail of problems that will require constant attention and maintenance

Can’t speak of the long tail of problems yet, maybe that will come when I get wiser. At least for our setup in Blender, I think we don’t have to give it constant attention and the maintenance is minimal. Forward declarations and header separation don’t have a long tail of problems usually but they definitely require constant attention and maintenance.


Maybe I’m missing something and it is indeed a bit sad, but afaik unity builds are the only solution that actually remove the redundancy in the parsing and code generation stage. The 2-4x improvement we measured when we first introduced unity builds is just not something I can rationally say no to given how few problems it’s causing for us.

Just to make it clear again: I think taking a project as is and just enabling unity builds is a very bad idea. However, with a bit of upfront time investment it can be a very worthwhile approach to reducing build times. And by upfront time investment I don’t mean just enabling unity builds and fixing the issues that come up, but actually structuring the code in a way that works well with unity builds. One might say that code shouldn’t be structured differently just for unity builds, but I think this is actually very similar to e.g. proper header separation, avoiding huge templates and using forward declarations, all of which can and generally should still be done of course.

I’m curious, did the game engines you worked on just enable unity builds and fixed name collisions when they came up, or did they actually structure the code to work well with unity builds?

1 Like

The only caching mechanism I use is Ccache, and it is able to cache the intermediate object files perfectly well.

I was definitely referring to a distributed caching solution which would hash TU contents across developers. Unity builds tend to break since changing file structure (adding, removing, moving files) will result in different TUs being globbed. There are “solutions” to this of course, which result in additional complexity in how the Unity TUs are assembled. I agree that if there is no interest in distributed caching mechanisms, this drawback is less pronounced.

Maybe I’m missing something and it is indeed a bit sad, but afaik unity builds are the only solution that actually remove the redundancy in the parsing and code generation stage.

Arguably, I think the best thing to do is to have as little code in a header as possible. You mentioned type erasure early on which I think is the best approach. Arguably, libraries like the STL would be much much faster to compile if they performed type erasure in the template, before dropping down to a forward declared type-erased implementation, housed in exactly one translation unit.

I’m curious, did the game engines you worked on just enable unity builds and fixed name collisions when they came up, or did they actually structure the code to work well with unity builds?

Name collisions were indeed fixed as they arose, although there were some soft conventions in an attempt to try and avoid this sort of thing. The flavor of the solution was different in each engine though, although it was particularly tricky in some engines due to the presence of a custom preprocessor as well (notably, Unreal Engine’s custom preprocessor doesn’t support namespaces, and they are full adaptive-unity build).

In any case, if it’s working for you, it’s working for you! I have been personally afflicted by years of pain due to Unity builds, but I understood fully that my experience is in no way universal, and the full extent of that experience doesn’t apply in every situation. :slight_smile:

1 Like

I can’t contribute to solutions but compiling in MSVS on Windows without any cache is especially slow and painful when switching between branches. I remember the days of SCons when I had access to Linux. That was so quick for small changes.

For VS devs: VSColorOutput64 - Visual Studio Marketplace

This has a very useful option to stop compilation on errors.

Someone brought up the linker being stuck for “quite a while” on windows while doing a debug build, did some testing today and the following things popped out which were interesting enough to share:

First lets get some definitions out of the way:

  • Full link : All of blender is build, but there is no existing blender.exe/pdb yet, a full link has to occur to generate blender.exe

  • Incremental: some file changed (in this case buildinfo.obj) and it will have to be rebuild and relinked into an existing blender.exe, incremental links are faster than a full link since the linker only has to apply the changes.

The incremental workflow is what you’d normally experience while doing development, a small change has been done and needs to be incorporated into the existing binary, while the Full link is the price you have to pay at least once when you have a new build folder or you ran a clean operation.

  • Hot : Since linking is the linker is inherently an IO bound operation, the OS tends to cache files you often use, speeding up things, the hot numbers are taken by redoing the same operation a few times in a row, so whatever files could be cached are cached at this point.

  • Cold: All file caches have been flushed (using sysinternals RamApp to flush the standby list) and every IO has to be satisfied from disk, this is as slow as it gonna get.

The build in question is a full debug build without the unit tests enabled, done with VS 2022, using ninja as the build system. (make full 2022 ninja debug)

On to the baseline numbers: (time in seconds of linking blender.exe, best of 5 runs)

Disk Full Hot Full Cold Incremental Hot Incremental Cold
HDD 34.566 228.163 10.686 198.615
SDD 28.459 60.808 11.980 24.439

Given the definitions above, the results aren’t surprising, but interesting nonetheless, for cold even my mid range SSD (sata EVO 850) happily runs circles around my spinner (WD Red/CMR) for hot given most io requests will be served from ram anyhow they are much closer together.

** Improving things**

stripped pdb

To support stack traces on end user systems, we ship a smaller version of the debugging symbols to end users, this takes a little time to generate, and especially when you are locally working on things, it’s a bit wasteful as you don’t need the stipped PDB at all. To test if there’s any real cost, we turn off WITH_WINDOWS_STRIPPED_PDB in cmake

Disk Full Hot Full Cold Incremental Hot Incremental Cold
HDD 27.375 218.187 7.688 164.324
SDD 24.381 56.774 7.627 11.887

It’s not much but hey free performance is always nice, and i’m certainly not gonna complain about the incremental cold improvements!

Any further tests below are done with WITH_WINDOWS_STRIPPED_PDB=Off

/debug:fastlink

Ms had a blog post a while ago about /debug:fastlink which sounds great, the actual documentation on this switch has this wisdom to share

In Visual Studio 2015 and earlier versions, when you specify /DEBUG with no extra arguments, the linker defaults to /DEBUG:FULL for command line and makefile builds, for release builds in the Visual Studio IDE, and for both debug and release builds. Beginning in Visual Studio 2017, the build system in the IDE defaults to /DEBUG:FASTLINK when you specify the /DEBUG option for debug builds. Other defaults are unchanged to maintain backward compatibility.

Which i honestly don’t know what it means, it sounds like if you build with just /debug (like we are) from the IDE fastlink will be on for vs2017+, and if you build from the command line it’s off, however since there is a 600MB difference in PDB size it’s really easy to tell, when you supply /debug both the ide and command line use /debug:full at least on my VS 2022, so lets change the linker flag to /debug:fastlink instead

Disk Full Hot Full Cold Incremental Hot Incremental Cold
HDD 23.698 218.775 3.573 153.130
SDD 22.670 56.417 4.648 9.393

not too shabby! But I honestly cannot explain why the SDD is slower than the spinner in this test, it’s not a measurement error as it reliably reproduced.

Summary

Disk Test Full Hot Full Cold Incremental Hot Incremental Cold
HDD baseline 34.566 228.163 10.686 198.615
HDD WITH_WINDOWS_STRIPPED_PDB=OFF 27.375 218.187 7.688 164.324
HDD /debug:fastlink 23.698 218.775 3.573 153.130
Disk Test Full Hot Full Cold Incremental Hot Incremental Cold
SDD baseline 28.459 60.808 11.980 24.439
SDD WITH_WINDOWS_STRIPPED_PDB=OFF 24.381 56.774 7.627 11.887
SDD /debug:fastlink 22.670 56.417 4.648 9.393

for the most common scenario (Incremental hot) the time for a link has been reduced to about 1/3 of what it was, it may be just 7 seconds, but it does add up over a single day of development

None of these changes (as of now) have landed yet, but i’m planning on making /debug:fastlink the new default for debug builds, and renaming WITH_WINDOWS_STRIPPED_PDB to WITH_WINDOWS_STRIPPED_RELEASE_PDB and making it only apply to release builds as for debug builds it’s just not needed and there is a real cost associated with it.

changes landed in

0df3aedfa83c: CMake/MSVC: Only generate/install stripped PDB for release builds
d5e50460e7a6: CMake/MSVC: Use /debug:fastlink for debug builds

5 Likes