Recursive function versus performance consideration

Hi,

I’ve tried 3 variants of a function that is calculating loose parts from a bmesh structure.

The benchmark is on an cube arrayed x400 twice (so 160000 loose parts).

The first one is iterative and uses a std::queue. Result: about 4 seconds.

The second is recursive. Result: 0.35 seconds

The third is iterative and use a simple array (size is amount of vertices) as replacement for the std::queue. Result: 0.5 seconds

So recursive seems to be faster. But I do not know about its limits (stack size) in Blender’s compilation context (I don’t know either if the compiler is able to translate by itself from recursive to iterative).

Do you have any information about it?

The recursive function code:

Main call:

	/* Returns an array of int where each int value corresponds to a loose part number of the corresponding vertex index (from 0 to < part_count) */
	int *bm_get_loose_parts(BMesh *bmesh, int *part_count)
	{
		clock_t t = clock();

		int num_verts = bmesh->totvert;

		int *loose_part_map = (int *)MEM_malloc_arrayN(num_verts, sizeof(int), __func__);
		memset(loose_part_map, -1, num_verts * sizeof(int));

		BMVert *vert;
		BMIter vert_iter;
		BM_ITER_MESH(vert, &vert_iter, bmesh, BM_VERTS_OF_MESH)
		{
			if (loose_part_map[BM_elem_index_get(vert)] == -1)
			{
				walk_linked_edges(bmesh, vert, (*part_count), loose_part_map);
				(*part_count)++;
			}
		}

		t = clock() - t;
		float seconds = (float)t / CLOCKS_PER_SEC;

		return loose_part_map;
	}

Recursive part:

	void walk_linked_edges(BMesh *bmesh, BMVert *vert, int loose_part, int *loose_part_map)
	{
		loose_part_map[BM_elem_index_get(vert)] = loose_part;

		BMEdge *edge;
		BMIter edge_iter;
		BM_ITER_ELEM(edge, &edge_iter, vert, BM_EDGES_OF_VERT)
		{
			BMVert *other = BM_edge_other_vert(edge, vert);

			if (loose_part_map[BM_elem_index_get(other)] == -1)
			{
				walk_linked_edges(bmesh, other, loose_part, loose_part_map);
			}
		}
	}

That does not look like the kind of recursion that a compiler can optimize away (since there are multiple recursive calls inside a loop). Your recursion can make a stack depth that is O(#vertices) in the worst case. I would personally stay away from such an algorithm. For example, there may be environments where you only get about 1 MB of stack per thread. I am surprised that using a simple array made it go slower than the recursive solution. What does the code for that look like?

@Howard_Trickey, thanks for this information.

I’ve finally used this, which is close in perf to the recursive version (but a little bit slower):

	int *bm_get_loose_parts( BMesh *bmesh, int *part_count)
	{
		int num_verts = bmesh->totvert;

		BMVert **vert_stack = (BMVert **)MEM_malloc_arrayN(num_verts, sizeof(BMVert *), __func__);
		STACK_DECLARE(vert_stack);
		STACK_INIT(vert_stack, num_verts);

		int *loose_part_map = (int *)MEM_malloc_arrayN(num_verts, sizeof(int), __func__);
		memset(loose_part_map, -1, num_verts * sizeof(int));

		BMVert *vert, *walk_vert;
		BMIter vert_iter;
		BM_ITER_MESH(vert, &vert_iter, bmesh, BM_VERTS_OF_MESH)
		{
			if (loose_part_map[BM_elem_index_get(vert)] == -1)
			{
				STACK_PUSH(vert_stack, vert);

				while (STACK_SIZE(vert_stack))
				{
					walk_vert = STACK_POP(vert_stack);

					loose_part_map[BM_elem_index_get(walk_vert)] = (*part_count);

					BMEdge *edge;
					BMIter edge_iter;
					BM_ITER_ELEM(edge, &edge_iter, walk_vert, BM_EDGES_OF_VERT)
					{
						BMVert *other = BM_edge_other_vert(edge, walk_vert);

						if (loose_part_map[BM_elem_index_get(other)] == -1)
						{
							STACK_PUSH(vert_stack, other);
						}
					}

				}

				(*part_count)++;
			}
		}

		MEM_freeN(vert_stack);

		return loose_part_map;
	}

But this is really bad (10x times slower):

int *bm_get_loose_parts(BMesh *bmesh, int *part_count)
{
	int num_verts = bmesh->totvert;

	std::queue<BMVert *> vert_queue;

	int *loose_part_map = (int *)MEM_malloc_arrayN(num_verts, sizeof(int), __func__);
	memset(loose_part_map, -1, num_verts * sizeof(int));

	BMVert *vert;
	BMIter vert_iter;
	BM_ITER_MESH(vert, &vert_iter, bmesh, BM_VERTS_OF_MESH)
	{
		if (loose_part_map[BM_elem_index_get(vert)] == -1)
		{
			walk_linked_edges(bmesh, vert, (*part_count), loose_part_map, vert_queue);
			(*part_count)++;
		}
	}

	return loose_part_map;
}

void walk_linked_edges(BMesh *bmesh, BMVert *base_vert, int loose_part, int *loose_part_map, std::queue<BMVert *> &vert_queue)
{
	vert_queue.push(base_vert);

	while (!vert_queue.empty())
	{
		BMVert *vert = vert_queue.front();
		vert_queue.pop();

		loose_part_map[BM_elem_index_get(vert)] = loose_part;

		BMEdge *edge;
		BMIter edge_iter;
		BM_ITER_ELEM(edge, &edge_iter, vert, BM_EDGES_OF_VERT)
		{
			BMVert *other = BM_edge_other_vert(edge, vert);

			if (loose_part_map[BM_elem_index_get(other)] == -1)
			{
				vert_queue.push(other);
			}
		}

	}
}

OK, that makes more sense (that the version using the STACK macros is close the the performance of the recursive version). I assume it is close enough that you don’t care about slight difference in performance relative to the recursive version now. I’m not sure if it would make much difference but the STACK_POP macro is also testing the stack index that STACK_SIZE just tested, and you could avoid the duplicate check by making a STACK_POP_NO_CHECK macro that does’t do the check. I wouldn’t bother, myself. Also, did you do these timings in Optimized (not Debug) builds? That could make a difference too.

You’re “really bad” method using std::queue (I’m guessing) is doing lots of allocates/frees.

@Howard_Trickey, effectively all was tested in debug.

Concerning std::queue, that was surprising to me. I mainly work with C# usually and this kind of structures are much more optimized it seems. I’ve used std libs really a long time ago (more than 15 years ago now) and did not remember that. But never mind, except for the allocation size of the STACK all is good!

Thanks