Althought the documentation is weird, I would not compare CPU with GPUs. GPU’s aren’t standardized and fall faster behind than CPU’s following a standardized instruction set with selected extensions.
I also agree with LazyDodo, we can discuss, but would wait for a major version bump to actually implement it. Until then would be good to have some figures to the benefits.
Have you tested how much difference it makes? I can imagine the new mesh code would benefit from it wider registries, but most code wouldn’t. Lib c functions should already use wider registries when available.
My experiences with coding algorithms towards SSE4/AVX is that it can help, but restructuring code is needed to get better results.
So not against it, just timing and some figures to actually see the impact.