如何使用 SSE 内在函数将值存储在不连续的内存位置?

发布于 2024-09-27 19:00:22 字数 1020 浏览 4 评论 0 原文

我对 SSE 非常陌生,并且已经使用内在函数优化了一段代码。我对操作本身很满意,但我正在寻找更好的方法来编写结果。结果最终包含在三个 _m128i 变量中。

我想要做的是将结果值中的特定字节存储到非连续的内存位置。我目前正在这样做:

__m128i values0,values1,values2;

/*Do stuff and store the results in values0, values1, and values2*/

y[0]        = (BYTE)_mm_extract_epi16(values0,0);
cb[2]=cb[3] = (BYTE)_mm_extract_epi16(values0,2);
y[3]        = (BYTE)_mm_extract_epi16(values0,4);
cr[4]=cr[5] = (BYTE)_mm_extract_epi16(values0,6);

cb[0]=cb[1] = (BYTE)_mm_extract_epi16(values1,0);
y[1]        = (BYTE)_mm_extract_epi16(values1,2);
cr[2]=cr[3] = (BYTE)_mm_extract_epi16(values1,4);
y[4]        = (BYTE)_mm_extract_epi16(values1,6);

cr[0]=cr[1] = (BYTE)_mm_extract_epi16(values2,0);
y[2]        = (BYTE)_mm_extract_epi16(values2,2);
cb[4]=cb[5] = (BYTE)_mm_extract_epi16(values2,4);
y[5]        = (BYTE)_mm_extract_epi16(values2,6);

其中 ycbcr 是字节(unsigned char)数组。由于我无法定义的原因,这对我来说似乎是错误的。有人对更好的方法有什么建议吗?

谢谢!

I'm very new to SSE and have optimized a section of code using intrinsics. I'm pleased with the operation itself, but I'm looking for a better way to write the result. The results end up in three _m128i variables.

What I'm trying to do is store specific bytes from the result values to non-contiguous memory locations. I'm currently doing this:

__m128i values0,values1,values2;

/*Do stuff and store the results in values0, values1, and values2*/

y[0]        = (BYTE)_mm_extract_epi16(values0,0);
cb[2]=cb[3] = (BYTE)_mm_extract_epi16(values0,2);
y[3]        = (BYTE)_mm_extract_epi16(values0,4);
cr[4]=cr[5] = (BYTE)_mm_extract_epi16(values0,6);

cb[0]=cb[1] = (BYTE)_mm_extract_epi16(values1,0);
y[1]        = (BYTE)_mm_extract_epi16(values1,2);
cr[2]=cr[3] = (BYTE)_mm_extract_epi16(values1,4);
y[4]        = (BYTE)_mm_extract_epi16(values1,6);

cr[0]=cr[1] = (BYTE)_mm_extract_epi16(values2,0);
y[2]        = (BYTE)_mm_extract_epi16(values2,2);
cb[4]=cb[5] = (BYTE)_mm_extract_epi16(values2,4);
y[5]        = (BYTE)_mm_extract_epi16(values2,6);

Where y, cb, and cr are byte (unsigned char) arrays. This seems wrong to me for reasons I can't define. Does anyone have any suggestions for a better way?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

柳若烟 2024-10-04 19:00:22

你基本上不能——SSE 没有分散存储,它的设计理念是在连续数据流上进行矢量化工作。实际上,制作 SIMD 所涉及的大部分工作就是重新排列数据,使其连续且可矢量化。因此,最好的办法是重新排列数据结构,以便一次可以写入 16 个字节。不要忘记,在将 SIMD 向量内的组件提交到内存之前,您可以对其进行重新排序。

如果做不到这一点,PEXTRW 操作(_mm_extract_epi16 内在函数)几乎是从 SSE 寄存器中拉出短路并将其存储到整数寄存器中的唯一方法。另一种可用的方法是使用解包和洗牌操作(_mm_shuffle_ps 等)将数据旋转到寄存器的低位字,然后使用 MOVSS/_mm_store_ss () 一次将低位字存储到内存中。

您可能会发现,使用联合或在 SSE 和通用寄存器之间移动数据将提供非常差的性能,因为称为 加载-点击-商店摊位。基本上,没有直接的方法可以在寄存器类型之间移动数据;处理器必须首先将 SSE 数据写入内存,然后再次将其读回 GPR。在许多情况下,这意味着它必须停止加载操作并等待存储清除,然后才能运行任何进一步的指令。

You basically can't -- SSE doesn't have a scatter store, and it's sort of all designed around the idea of doing vectorized work on contiguous data streams. Really, most of the work involved in making something SIMD is rearranging your data so that it is contiguous and vectorizable. So the best thing to do is rearrange your data structures so that you can write to them 16 bytes at a time. Don't forget that you can reorder the components inside your SIMD vector before you commit them to memory.

Failing that, the PEXTRW op (_mm_extract_epi16 intrinsic) is pretty much the only way to pull a short from an SSE register and store into an integer register. The other approach available to you is to use the unpack and shuffle ops (_mm_shuffle_ps etc) to rotate data into the low word of the register and then MOVSS/_mm_store_ss() to store that low word to memory one at a time.

You will probably find that using a union, or moving data between the SSE and general purpose registers, will provide very poor performance due to a subtle CPU implementation detail called a load-hit-store stall. Basically, there's no direct way to move data between the register types; the processor has to first write the SSE data to memory, and then read it back again into the GPR. In many cases, this means it has to stall the load operation and wait until the store clears before any further instructions can be run.

豆芽 2024-10-04 19:00:22

我具体不了解 SSE,但一般来说,矢量化单元的全部要点是,只要数据遵循特定的对齐和格式,它们就可以非常快速地运行。因此,您需要以正确的格式和对齐方式提供和提取数据。

I don't know about SSE specifically, but generally the whole point of vectorised units is that they can operate very fast provided the data obeys particular alignment and formatting. So it's up to you to provide and extract the data in the correct format and alignment.

娇女薄笑 2024-10-04 19:00:22

SSE 不具备您需要的分散/聚集功能,尽管这可能会出现在未来的 SIMD 架构中。

正如已经建议的,您可以使用联合,例如:

typedef union
{
    __m128i v;
    uint8_t a8[16];
    uint16_t a16[8];
    uint32_t a32[4];
} U128;

理想情况下,这种操作仅发生在任何关键循环之外,因为与连续数据元素上的简单 SIMD 操作相比,它的效率非常低。

SSE does not have the scatter/gather functionality that you need, although this is probably coming in future SIMD architectures.

As has already been suggested, you can use a union, e.g.:

typedef union
{
    __m128i v;
    uint8_t a8[16];
    uint16_t a16[8];
    uint32_t a32[4];
} U128;

Ideally this kind of manipulation only happens outside any critical loops, as it's very inefficient compared to straightforward SIMD operations on contiguous data elements.

猛虎独行 2024-10-04 19:00:22

您可以尝试使用 union 来提取字节。

union
{
    float value;
    unsigned char ch[8];
};

然后根据需要分配字节
尝试使用 union-idea,也许可以用匿名结构替换 unsigned char ch[8]?
也许您可以从此处获得更多想法

You could try to use union's to extract the bytes.

union
{
    float value;
    unsigned char ch[8];
};

and then assign the bytes as needed
Play around with union-idea, maybe replace the unsigned char ch[8] with a anonymous struct?
Maybe you can get some more ideas from here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文