如何使用另一个 XMM 寄存器条目中的 4 个相同的浮点数填充 x86 XMM 寄存器?
我正在尝试实现一些内联汇编程序(在 C/C++ 代码中)以利用 SSE。我想将值(从 XMM 寄存器或内存)复制并复制到另一个 XMM 寄存器。例如,假设内存中有一些值 {1, 2, 3, 4}。我想复制这些值,以便 xmm1 填充为 {1, 1, 1, 1},xmm2 填充为 {2, 2, 2, 2},依此类推。
翻阅英特尔参考手册,我找不到执行此操作的说明。我是否只需要使用重复的 MOVSS 和旋转的组合(通过 PSHUFD?)?
I'm trying to implement some inline assembler (in C/C++ code) to take advantage of SSE. I'd like to copy and duplicate values (from an XMM register, or from memory) to another XMM register. For example, suppose I have some values {1, 2, 3, 4} in memory. I'd like to copy these values such that xmm1 is populated with {1, 1, 1, 1}, xmm2 with {2, 2, 2, 2}, and so on and so forth.
Looking through the Intel reference manuals, I couldn't find an instruction to do this. Do I just need to use a combination of repeated MOVSS and rotates (via PSHUFD?)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
有两种方法:
专门使用
shufps
:让编译器选择使用
_mm_set1_ps
和_mm_cvtss_f32
的最佳方式: p>请注意,第二种方法将在 MSVC 上生成糟糕的代码,如此处讨论,并且只会产生“xxxx”结果,与第一个选项。
这是非常不可移植的。使用内在函数。
There are two ways:
Use
shufps
exclusively:Let the compiler choose the best way using
_mm_set1_ps
and_mm_cvtss_f32
:Note that the 2nd method will produce horrible code on MSVC, as discussed here, and will only produce 'xxxx' as result, unlike the first option.
This is highly unportable. Use intrinsics.
将源寄存器移至目标寄存器。使用“shufps”,只需使用新的目标寄存器两次,然后选择适当的掩码。
以下示例将 XMM2.x 的值广播到 XMM0.xyzw
Move the source to the dest register. Use 'shufps' and just use the new dest register twice and then select the appropriate mask.
The following example broadcasts the values of XMM2.x to XMM0.xyzw
如果您的值在内存中是 16 字节对齐的:
如果不是,您可以执行未对齐加载或四个标量加载。在较新的平台上,未对齐加载应该更快;在较旧的平台上,标量负载可能会获胜。
正如其他人所指出的,您还可以使用 shufps。
If your values are 16 byte aligned in memory:
If not, you can do an unaligned load, or four scalar loads. On newer platforms, the unaligned load should be faster; on older platforms the scalar loads may win.
As others have noted, you can also use
shufps
.