Bpy.data.images perf issues

I’m generating some textures that i want to stuff into bpy.data.images, however i am running into severe perf issues setting the actual content.

I’ve narrowed it down to setting image.pixels being the largest time hog, i did some googling and found that setting img.pixels[:] would be faster, and indeed it is, but it’s still far far too slow

texture size pixels = data pixels[:] = data pixels=data.tolist pixels[:]=data.tolist
512px 0.101513s 0.069509s 0.063008s 0.041005s
1024px 0.406052s 0.280535s 0.23653s 0.168522s
2048px 1.560198s 1.159647s 0.94512s 0.656084s
4096px 6.460321s 4.418561s 3.999507s 2.809357s

Repro case :

import bpy
import numpy as np
from datetime import datetime

def benchmark(size):

    img = bpy.data.images.get("output")
    if (img == None):
        img = bpy.data.images.new("output", width=size, height=size, alpha=False, float_buffer=True)
    if (size != img.generated_width):
        img.generated_width = size
    if (size != img.generated_height):
        img.generated_height = size 

    pixel_data = np.ones(size*size*4)
    t1 = datetime.now()

    t0 = datetime.now()
    img.pixels = pixel_data
    t1 = datetime.now()
    meth1 = (t1-t0).total_seconds()

    t0 = datetime.now()
    img.pixels[:] = pixel_data
    t1 = datetime.now()
    meth2 = (t1-t0).total_seconds()

    t0 = datetime.now()
    img.pixels = pixel_data.tolist()
    t1 = datetime.now()
    meth3 = (t1-t0).total_seconds()

    t0 = datetime.now()
    img.pixels[:] = pixel_data.tolist()
    t1 = datetime.now()
    meth4 = (t1-t0).total_seconds()

    
    print("|%spx|%ss|%ss|%ss|%ss|" % ( size , meth1, meth2,meth3,meth4))

benchmark(512)
benchmark(1024)
benchmark(2048)
benchmark(4096)

Any ideas here? I have found some previous work in this area but that seem to be dealing with the reading part, also i don’t think that ever went anywhere.

edit1:
Updated the benchmark code to include .tolist() as @kaio suggested

1 Like

Looks like one performance issue is reading from the numpy array.

Try changing:

pixel_data = np.ones(size*size*4)

to:

pixel_data = np.ones(size*size*4).tolist()

About 5 times faster using direct assignment.
About 8 times faster using slice assignment.

Edit:On further testing with size = 8192 noticed something else:

  • Initializing a numpy array of ones: 3.8 secs
  • Initializing a python list of ones: 0.6 secs (pixel_data = [1] * size ** 2 * 4)

  • Numpy array with slice assignment: 5.49 secs
  • Python list total with slice assignment: 3.89 secs

  • Mem peak during test with numpy array: 12GB
  • Mem peak during test with python list: 4GB

A list converted from numpy array does seem to have its data assigned twice as fast to img.pixels, than a list that originates from python. I’m taking a wild guess that this has to do with the data from the array is contiguous, which also explains the extra mem alloc time, but I could be wrong.

1 Like

Updated the top post to include tolist() , and there’s no arguing it is faster, but still i’d like this to be in the milliseconds, not seconds, i can’t lock up the ui for 10-60 seconds when i update a few 4-8k textures while blender does whatever it does for what should be essentially a glorified memcpy.

1 Like

All right, try this:

img.pixels[:] = mathutils.Vector.Fill(4, 1)[:] * size * size

This finishes in 2.19 secs on 8K to me

Not exactly in the milliseconds as you wanted, though.

1 Like

The buffer that has my data is compatible with the python buffer protocol. I just used numpy as an example in the benchmark since it’s compatible with it without having to complicate the example any further than I have to.

So changing much on the buffer side of things isn’t going to help me all that much, the generating part of the problem has been efficiently taken care of (150 ms or so for a 4k texture) it’s really the getting it in the blender part that is agonizingly slow

Aha. Looks like I misunderstood the issue.

Nah you understood the issue, you just started optimizing the whole thing including part of the repro case i don’t have any issues with

can’t blame you, optimizing is fun :wink:

1 Like

I took a quick stab at using the buffer protocol, it’s working and looking rather hopeful, but it’s gonna need some spit and polish and testing before i can submit it.

size time in seconds
512px 0.004001s
1024px 0.015502s
2048px 0.056007s
4096px 0.208527s
8192px 0.86661s

just in case anyone wants to play

 source/blender/python/intern/bpy_rna_array.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/source/blender/python/intern/bpy_rna_array.c b/source/blender/python/intern/bpy_rna_array.c
index 4cc3c4c0fae..4a6943fd41f 100644
--- a/source/blender/python/intern/bpy_rna_array.c
+++ b/source/blender/python/intern/bpy_rna_array.c
@@ -303,6 +303,15 @@ static int validate_array(PyObject *rvalue, PointerRNA *ptr, PropertyRNA *prop,
 
 	/* validate type first because length validation may modify property array length */
 
+	/* TODO:  This needs some actual checking of things. */
+	if (PyObject_CheckBuffer(rvalue)) {
+		Py_buffer buf;
+		if (PyObject_GetBuffer(rvalue, &buf, PyBUF_C_CONTIGUOUS) == 0) {
+			*totitem = buf.len / buf.itemsize;
+			PyBuffer_Release(&buf);
+			return 0;
+		}
+	}
 
 #ifdef USE_MATHUTILS
 	if (lvalue_dim == 0) { /* only valid for first level array */
@@ -394,6 +403,15 @@ static char *copy_values(
 		return NULL;
 	}
 
+	if (PyObject_CheckBuffer(seq))
+	{
+		Py_buffer buf;
+		if (PyObject_GetBuffer(seq, &buf, PyBUF_C_CONTIGUOUS) == 0) {
+			memcpy(data, buf.buf, min(buf.len, seq_size * sizeof(float))); 
+			PyBuffer_Release(&buf);
+			return data;
+		}
+	}
 
 #ifdef USE_MATHUTILS
 	if (dim == 0) {

also updated benchmark code, cause the one in the opening post was accidentally producing doubles which needed conversion, so that took time/memory as well

import bpy
import numpy as np
import mathutils as ms 
from datetime import datetime

def benchmark(size):

    img = bpy.data.images.get("output")
    if (img == None):
        img = bpy.data.images.new("output", width=size, height=size, alpha=False, float_buffer=True)
    if (size != img.generated_width):
        img.generated_width = size
    if (size != img.generated_height):
        img.generated_height = size 

    pixel_data = np.ones(size*size*4, dtype=np.float32)

    t0 = datetime.now()
    img.pixels = pixel_data
    t1 = datetime.now()
    meth1 = (t1-t0).total_seconds()

    print("|%spx|%ss" % ( size , meth1 ))

benchmark(512)
benchmark(1024)
benchmark(2048)
benchmark(4096)
benchmark(8192)
6 Likes

Is there any way to get the ball rolling on this? Trying this out, in current Blender 2.80, moving around 8k textures consumes all of the memory on my 16GB system, while your version increases memory usage barely in a noticiable way and is 50x faster.

It would be a huge improvement to be able to move image data this way. I tried creating also a writing version of the Python buffer protocol implementation, but so far couldn’t figure it out. (pixel_data = np.array(img.pixels) in addition to img.pixels = pixel_data)

glTF export currently relies on this, and is quite painful. https://developer.blender.org/T68822

2 Likes

this is an really interesting topic !

D7053 (available in 2.83+) has brought some massive improvements in this area by adding a foreach_set method to the pixels.

size pixels = data pixels.foreach_set(data)
512px 0.227485s 0.001956s
1024px 0.92182s 0.006835s
2048px 3.464349s 0.026355s
4096px 14.39048s 0.092298s
8192px 56.567422s 0.346667s
7 Likes

great! what if were sending the array to a gl buffer? it seems to suffer from the same issues.

@LazyDodo
Hello my beautiful friend, I have a question about very specific part of your bench code:

if (size != img.generated_width):
    img.generated_width = size
if (size != img.generated_height):
    img.generated_height = size 

This strikes me as odd. Is this some edge case? What are you guarding against? I’ve also noticed that my DDS textures 2048x2048 have generated H and W of 1024 each and this puzzled me ever since. Blender docs, as usual, absolute useless.

On the subject of this thread, my speedup from direct buffer assignment to slice-assignment saved me 0.5s as well.

Now, the problem though - while .foreach_set is indeed fast, why is both numpy performance and buffer assignment under 2.9x way slowen than under 2.8x? I get 4s total instead of 2s total (numpy perf differs by 0.3-0.4s) between these versions. What gives?

I have a question about very specific part of your bench code:

It uses the same image for benchmarking all sizes, if you want to benchmark a 1024x1024 image, and you just ran the 512 benchmark, you’d want to update the image size so it is actually the size you want to test on.

1 Like