SSE2内在函数:直接访问内存
许多 SSE 指令允许源操作数是 16 字节对齐的内存地址。例如,各种(拆)包指令。 PUNCKLBW 具有以下签名:
PUNPCKLBW xmm1,xmm2/m128
现在这对于内在函数来说似乎根本不可能。看起来必须使用 _mm_load* 内在函数来读取内存中的任何内容。这是 PUNPCKLBW 的内在:
__m128i _mm_unpacklo_epi8 (__m128i a, __m128i b);
(据我所知,__m128i 类型始终引用 XMM 寄存器。)
现在,这是为什么?这是相当悲伤的,因为我看到了直接寻址内存的一些优化潜力......
Many SSE instructions allow the source operand to be a 16-byte aligned memory address. For example, the various (un)pack instructions. PUNCKLBW has the following signature:
PUNPCKLBW xmm1, xmm2/m128
Now this doesn't seem to be possible at all with intrinsics. It looks like it's mandatory to use _mm_load* intrinsics to read anything in memory. This is the intrinsic for PUNPCKLBW:
__m128i _mm_unpacklo_epi8 (__m128i a, __m128i b);
(As far as I know, the __m128i type always refers to an XMM register.)
Now, why is this? It's rather sad since I see some optimization potential by addressing memory directly...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
内在函数相对直接地对应于实际指令,但编译器没有义务发出相应的指令。将操作(即使以内在函数编写)后的加载优化为操作的内存形式是所有受人尊敬的编译器在这样做有利时执行的常见优化。
TLDR:将加载和操作写入内在函数中,然后让编译器对其进行优化。
编辑:简单示例:
使用
gcc -Os -fomit-frame-pointer
编译给出:看到了吗?优化器会对其进行排序。
The intrinsics correspond relatively directly to actual instructions, but compilers are not obligated to issue the corresponding instructions. Optimizing a load followed by an operation (even when written in intrinsics) into the memory form of the operation is a common optimization performed by all respectable compilers when it is advantageous to do so.
TLDR: write the load and the operation in intrinsics, and let the compiler optimize it.
Edit: trivial example:
Compiling with
gcc -Os -fomit-frame-pointer
gives:See? The optimizer will sort it out.
您可以直接使用您的内存值。例如:
结果的有趣部分:
所以编译器做得有点糟糕——或者也许这种方式更快和/或使用选项可以解决这个问题——但它生成了有效的代码,并且C++ 代码相当直接地说明了它想要什么。
You can just use your memory values directly. For example:
The interesting part of the result:
So the compiler is doing a bit of a poor job -- or perhaps this way is faster and/or playing with the options would fix that -- but it generates code that works, and the C++ code is stating what it wants fairly directly.