Jemalloc on Windows


I recently did some experiments about usage of jemalloc in Blender under Windows (done in Windows 10 pro 64).

As I’ve failed to set it up from the source code (see here), @bzztploink gave me a precompiled release version for Windows. That’s the one I’ve used for the tests.

  • First tests: simple program

Simply loops allocating and freeing memory so that can compare je_malloc to native malloc calls.

The loops execute iterations times allocating and freeing size each time. Very simple loop (same principle for je_malloc):

void test_malloc(int size, int iterations)
	for (int i = 0; i != iterations;i++) {
		int *mem = (int *)malloc(size * sizeof(int));

Results (release mode) giving iterations, size, je_malloc time, malloc time (time in seconds):
100 * 1000 * 1000 / 50 / 2.705 / 5.371
100 * 1000 * 1000 / 500 / 2.699 / 5.339
10 * 1000 * 1000 / 10 000 / 2.125 / 1.435
10 * 1000 * 1000 / 100 000 / 2.055 / 1.755
10 * 1000 / 5 000 000 / 0.003 / 2.361
1000 / 500 000 000 / 0.001 / 22.450

So there are ranges of sizes where malloc is best, others where jemalloc is best.

But, I was (too much) confident and decided to port it into Blender.

  • In Blender: how it is done

Did the same guarded alloc can be used: added a --jemalloc option that allows to replace MEM_xxx calls by MEM_je_xxx calls.
Then implemented MEM_je_xxx calls copying/porting the MEM_lockfree_xxx functions (I did not know at this moment that could have been done ‘under’ MEM_lockfree).
I did it for all calls except mmap.

  • In Blender: the results

First impression at running Blender: it seems faster at launch time, say ‘more reactive’.
So I played a bit in 3D view edit mode, object mode, grabbing duplicating, all seems fine.
Then I wanted to push it using array modifier… was fast (I don’t know how compared to standard malloc)… but I rapidly see that it does NOT free the memory it has allocated… :frowning:

  • In Blender: some (bad) attempts

I tried:
Adding synchronization at each call (mutex): the same (memory is not released).
Doing so only one thread is allocating memory (should be really slower but was the trial): the same.

  • Back to simple program:

I rewrite the test functions to:

void test_malloc_keep(int size, int iterations)
	int **array = (int **)je_malloc(iterations * sizeof(int *));
	for (int i = 0; i != iterations;i++) {
		array[i] = (int *)je_malloc(size * sizeof(int));
	for (int i = 0; i != iterations;i++) {

And effectively, the mem is not released.

Was in hope mallctl options/features were by default conveniently setup. I still don’t know if they are as some options can (seem to) be set only at compile time.

But this page indicates some typical setups (see the example section).

So I tried these kind of things, not knowing clearly if the given string are correct or not (documentation is a bit unclear to me):

void setup()
	bool background_thread = true;
	int opt1 = je_mallctl("background_thread", NULL, 0, &background_thread, sizeof(background_thread));
	int dirty_decay_ms = 0;
	//int opt2 = je_mallctl("arena." STRINGIFY(MALLCTL_ARENAS_ALL) ".dirty_decay_ms", NULL, 0, &dirty_decay_ms, sizeof(size_t));
	int opt2 = je_mallctl("arenas.dirty_decay_ms", NULL, 0, &dirty_decay_ms, sizeof(dirty_decay_ms));

	int muzzy_decay_ms = 0;
	//int opt3 = je_mallctl("arena." STRINGIFY(MALLCTL_ARENAS_ALL) ".muzzy_decay_ms", NULL, 0, &muzzy_decay_ms, sizeof(size_t));
	int opt3 = je_mallctl("arenas.muzzy_decay_ms", NULL, 0, &muzzy_decay_ms, sizeof(muzzy_decay_ms));

	int narenas = 1;
	int opt4 = je_mallctl("narenas", NULL, 0, &narenas, sizeof(narenas));

	bool tcache = false;
	int opt5 = je_mallctl("tcache", NULL, 0, &tcache, sizeof(tcache));

	printf("%d %d %d %d %d (EINVAL %d, ENOENT %d, EPERM %d, EAGAIN %d, EFAULT %d)\n", opt1, opt2, opt4, opt5, EINVAL, ENOENT, EPERM, EAGAIN, EFAULT);

	//narenas:1, tcache : false, dirty_decay_ms : 0, muzzy_decay_ms : 0

But I obtain EINVAL or other error codes (included unknown 22 value) for all.

  • Conclusion (?)

What am I doing wrong?
How to correctly setup jemalloc for Windows?

Thanks for your feedback.

Before getting into too many details, it would be interesting to find some complex scenes and time them (FPS, render time, modifier stack calculation… etc). To see if there are real world, user noticeable performance gains.

1 Like