为 GCC 向量扩展加载数据

发布于 2025-01-06 13:09:15 字数 1040 浏览 1 评论 0原文

GCC 的向量扩展提供了一种很好的、相当可移植的方式来访问不同硬件架构上的一些 SIMD 指令，而无需诉诸硬件特定的内在函数（或自动矢量化）。

一个真实的用例是计算简单的附加校验和。不清楚的一件事是如何安全地将数据加载到向量中。

typedef char v16qi __attribute__ ((vector_size(16)));

static uint8_t checksum(uint8_t *buf, size_t size)
{
    assert(size%16 == 0);
    uint8_t sum = 0;

    vec16qi vec = {0};
    for (size_t i=0; i<(size/16); i++)
    {
        // XXX: Yuck! Is there a better way?
        vec += *((v16qi*) buf+i*16);
    }

    // Sum up the vector
    sum = vec[0] + vec[1] + vec[2] + vec[3] + vec[4] + vec[5] + vec[6] + vec[7] + vec[8] + vec[9] + vec[10] + vec[11] + vec[12] + vec[13] + vec[14] + vec[15];

    return sum;
}

将指针转换为向量类型似乎可行，但我担心如果 SIMD 硬件期望向量类型正确对齐，这可能会以可怕的方式爆炸。

我想到的唯一其他选项是使用临时向量并显式加载值（通过 memcpy 或按元素赋值），但在测试中，这抵消了使用 SIMD 指令获得的大部分加速。理想情况下，我想象这将类似于通用的 __builtin_load() 函数，但似乎不存在。

将数据加载到存在对齐问题风险的向量中的更安全方法是什么？

原文

GCC's vector extensions offer a nice, reasonably portable way of accessing some SIMD instructions on different hardware architectures without resorting to hardware specific intrinsics (or auto-vectorization).

A real use case, is calculating a simple additive checksum. The one thing that isn't clear is how to safely load data into a vector.

typedef char v16qi __attribute__ ((vector_size(16)));

static uint8_t checksum(uint8_t *buf, size_t size)
{
    assert(size%16 == 0);
    uint8_t sum = 0;

    vec16qi vec = {0};
    for (size_t i=0; i<(size/16); i++)
    {
        // XXX: Yuck! Is there a better way?
        vec += *((v16qi*) buf+i*16);
    }

    // Sum up the vector
    sum = vec[0] + vec[1] + vec[2] + vec[3] + vec[4] + vec[5] + vec[6] + vec[7] + vec[8] + vec[9] + vec[10] + vec[11] + vec[12] + vec[13] + vec[14] + vec[15];

    return sum;
}

Casting a pointer to the vector type appears to work, but I'm worried this might explode in a horrible fashion if SIMD hardware expects the vector types to be correctly aligned.

The only other option I've thought of is use a temp vector and explicitly load the values (via either a memcpy or element-wise assignment), but in testing this counteract most of speedup gained use of SIMD instructions. Ideally I'd imagine this would be something like a generic __builtin_load() function, but none seems to exist.

What's a safer way of loading data into a vector risking alignment issues?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

双手揣兜 2025-01-13 13:09:15

编辑（感谢 Peter Cordes）您可以强制转换指针：

typedef char v16qi __attribute__ ((vector_size (16), aligned (16)));

v16qi vec = *(v16qi*)&buf[i]; // load
*(v16qi*)(buf + i) = vec; // store whole vector

这将编译为 vmovdqa 进行加载，并编译为 vmovups 进行存储。如果未知数据是否对齐，请设置 aligned (1) 以生成 vmovdqu。 (godbolt)

请注意，还有几个用于加载和卸载这些寄存器的专用内置函数 (编辑2）：

v16qi vec = _mm_loadu_si128((__m128i*)&buf[i]); // _mm_load_si128 for aligned
_mm_storeu_si128((__m128i*)&buf[i]), vec); // _mm_store_si128 for aligned

似乎有必要使用-flax-vector-conversions从chars转到v16qi 具有此功能。

另请参阅：C - 如何访问使用 GCC SSE 向量扩展的向量元素
另请参阅：SSE 将整数加载到 __m128

（提示：谷歌最好的短语是就像“gcc 加载 __m128i”。）

Edit (thanks Peter Cordes) You can cast pointers:

typedef char v16qi __attribute__ ((vector_size (16), aligned (16)));

v16qi vec = *(v16qi*)&buf[i]; // load
*(v16qi*)(buf + i) = vec; // store whole vector

This compiles to vmovdqa to load and vmovups to store. If the data isn't known to be aligned, set aligned (1) to generate vmovdqu. (godbolt)

Note that there are also several special-purpose builtins for loading and unloading these registers (Edit 2):

v16qi vec = _mm_loadu_si128((__m128i*)&buf[i]); // _mm_load_si128 for aligned
_mm_storeu_si128((__m128i*)&buf[i]), vec); // _mm_store_si128 for aligned

It seems to be necessary to use -flax-vector-conversions to go from chars to v16qi with this function.

(Tip: The best phrase to google is something like "gcc loading __m128i".)

回复收藏 0 原文

失而复得 2025-01-13 13:09:15

您可以使用初始化程序来加载值，即执行

const vec16qi e = { buf[0], buf[1], ... , buf[15] }

并希望 GCC 将其转换为 SSE 加载指令。不过，我会用反汇编程序来验证这一点;-)。此外，为了获得更好的性能，您尝试使 buf 16 字节对齐，并通过 aligned 属性通知编译器。如果您可以保证输入缓冲区对齐，请按字节处理它，直到达到 16 字节边界。

You could use an initializer to load the values, i.e. do

const vec16qi e = { buf[0], buf[1], ... , buf[15] }

and hope that GCC turns this into a SSE load instruction. I'd verify that with a dissassembler, though ;-). Also, for better performance, you try to make buf 16-byte aligned, and inform that compiler via an aligned attribute. If you can guarantee that the input buffer will be aligned, process it bytewise until you've reached a 16-byte boundard.

回复收藏 0 原文

~没有更多了~