为 GCC 向量扩展加载数据
GCC 的向量扩展提供了一种很好的、相当可移植的方式来访问不同硬件架构上的一些 SIMD 指令,而无需诉诸硬件特定的内在函数(或自动矢量化)。
一个真实的用例是计算简单的附加校验和。不清楚的一件事是如何安全地将数据加载到向量中。
typedef char v16qi __attribute__ ((vector_size(16)));
static uint8_t checksum(uint8_t *buf, size_t size)
{
assert(size%16 == 0);
uint8_t sum = 0;
vec16qi vec = {0};
for (size_t i=0; i<(size/16); i++)
{
// XXX: Yuck! Is there a better way?
vec += *((v16qi*) buf+i*16);
}
// Sum up the vector
sum = vec[0] + vec[1] + vec[2] + vec[3] + vec[4] + vec[5] + vec[6] + vec[7] + vec[8] + vec[9] + vec[10] + vec[11] + vec[12] + vec[13] + vec[14] + vec[15];
return sum;
}
将指针转换为向量类型似乎可行,但我担心如果 SIMD 硬件期望向量类型正确对齐,这可能会以可怕的方式爆炸。
我想到的唯一其他选项是使用临时向量并显式加载值(通过 memcpy 或按元素赋值),但在测试中,这抵消了使用 SIMD 指令获得的大部分加速。理想情况下,我想象这将类似于通用的 __builtin_load() 函数,但似乎不存在。
将数据加载到存在对齐问题风险的向量中的更安全方法是什么?
GCC's vector extensions offer a nice, reasonably portable way of accessing some SIMD instructions on different hardware architectures without resorting to hardware specific intrinsics (or auto-vectorization).
A real use case, is calculating a simple additive checksum. The one thing that isn't clear is how to safely load data into a vector.
typedef char v16qi __attribute__ ((vector_size(16)));
static uint8_t checksum(uint8_t *buf, size_t size)
{
assert(size%16 == 0);
uint8_t sum = 0;
vec16qi vec = {0};
for (size_t i=0; i<(size/16); i++)
{
// XXX: Yuck! Is there a better way?
vec += *((v16qi*) buf+i*16);
}
// Sum up the vector
sum = vec[0] + vec[1] + vec[2] + vec[3] + vec[4] + vec[5] + vec[6] + vec[7] + vec[8] + vec[9] + vec[10] + vec[11] + vec[12] + vec[13] + vec[14] + vec[15];
return sum;
}
Casting a pointer to the vector type appears to work, but I'm worried this might explode in a horrible fashion if SIMD hardware expects the vector types to be correctly aligned.
The only other option I've thought of is use a temp vector and explicitly load the values (via either a memcpy or element-wise assignment), but in testing this counteract most of speedup gained use of SIMD instructions. Ideally I'd imagine this would be something like a generic __builtin_load()
function, but none seems to exist.
What's a safer way of loading data into a vector risking alignment issues?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
编辑(感谢 Peter Cordes)您可以强制转换指针:
这将编译为
vmovdqa
进行加载,并编译为vmovups
进行存储。如果未知数据是否对齐,请设置aligned (1)
以生成vmovdqu
。 (godbolt)请注意,还有几个用于加载和卸载这些寄存器的专用内置函数 (编辑2):
似乎有必要使用
-flax-vector-conversions
从char
s转到v16qi
具有此功能。另请参阅:C - 如何访问使用 GCC SSE 向量扩展的向量元素
另请参阅:SSE 将整数加载到 __m128
(提示:谷歌最好的短语是就像“gcc 加载 __m128i”。)
Edit (thanks Peter Cordes) You can cast pointers:
This compiles to
vmovdqa
to load andvmovups
to store. If the data isn't known to be aligned, setaligned (1)
to generatevmovdqu
. (godbolt)Note that there are also several special-purpose builtins for loading and unloading these registers (Edit 2):
It seems to be necessary to use
-flax-vector-conversions
to go fromchar
s tov16qi
with this function.See also: C - How to access elements of vector using GCC SSE vector extension
See also: SSE loading ints into __m128
(Tip: The best phrase to google is something like "gcc loading __m128i".)
您可以使用初始化程序来加载值,即执行
并希望 GCC 将其转换为 SSE 加载指令。不过,我会用反汇编程序来验证这一点;-)。此外,为了获得更好的性能,您尝试使
buf
16 字节对齐,并通过aligned
属性通知编译器。如果您可以保证输入缓冲区对齐,请按字节处理它,直到达到 16 字节边界。You could use an initializer to load the values, i.e. do
and hope that GCC turns this into a SSE load instruction. I'd verify that with a dissassembler, though ;-). Also, for better performance, you try to make
buf
16-byte aligned, and inform that compiler via analigned
attribute. If you can guarantee that the input buffer will be aligned, process it bytewise until you've reached a 16-byte boundard.