关于XMM寄存器位图的困惑

发布于 2024-12-18 07:57:15 字数 1040 浏览 1 评论 0原文

抱歉,我没有一个好的标题...

我正在阅读此主题: SSE中的向量矩阵乘法

原贴有以下代码

// xmm0 = (v0,v1,v2,v3)
movups xmm0, [eax]

// xmm0 = (v0,v0,v0,v0)
// xmm1 = (v1,v1,v1,v1)
// xmm2 = (v2,v2,v2,v2)
// xmm3 = (v3,v3,v3,v3)
shufps xmm3, xmm0, 255
shufps xmm2, xmm0, 170
shufps xmm1, xmm0, 85
shufps xmm0, xmm0, 0

有人说了以下内容:

但根据手册实际发生的情况是:(a, b, c, d) 表示 a 是位 0 到 31,b 是位 32 到 63,依此类推

// xmm0 = (v0,v1,v2,v3)
movups xmm0, [eax]

// xmm0 = (v0, v0, v0, v0)
shufps xmm0, xmm0, 0

这对我来说很有意义,因为在线性数组模型中 [elt0, elt1, elt2, ....] elt0 是数组[0]。

令我困惑的是,根据手册,xmm寄存器的位图是[127...0](见下图)。

我就像原始海报看着位图一样,认为 [elt0, elt2, elt3, elt4] 的最左边是位“11”。

那么如果我想要 xmm0 只包含 v0

shufps xmm0, xmm0, 0xFF  // 11 11 11 11  === 0xFF

哪个解释是正确的?

在此处输入图像描述

Sorry I don't have a good title...

I was reading this thread: Vector Matrix Multiplication In SSE

The original poster had the following code

// xmm0 = (v0,v1,v2,v3)
movups xmm0, [eax]

// xmm0 = (v0,v0,v0,v0)
// xmm1 = (v1,v1,v1,v1)
// xmm2 = (v2,v2,v2,v2)
// xmm3 = (v3,v3,v3,v3)
shufps xmm3, xmm0, 255
shufps xmm2, xmm0, 170
shufps xmm1, xmm0, 85
shufps xmm0, xmm0, 0

Someone said the followings:

But what really happens according to the manual: (a, b, c, d) means a are bits 0 to 31, b are bits 32 to 63 and so on

// xmm0 = (v0,v1,v2,v3)
movups xmm0, [eax]

// xmm0 = (v0, v0, v0, v0)
shufps xmm0, xmm0, 0

This makes sense to me since in linear array model [elt0, elt1, elt2, ....] elt0 is Array[0].

What confuses me is, according to the manual the bitmap of xmm register is [127...0] (see the picture below).

I was like the original poster looking at the bitmap and thought the leftmost of [elt0, elt2, elt3, elt4] was the bit "11".

So if I want xmm0 contains only v0

shufps xmm0, xmm0, 0xFF  // 11 11 11 11  === 0xFF

Which explanation is correct?

enter image description here

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

思念绕指尖 2024-12-25 07:57:15

可能会出现一些混乱,因为 xmm 寄存器(以及所有其他寄存器 BTW)中的位是从右到左编号的,即最低位在右侧,最高位在左侧:

xmm0 = [bit 127, bit 126, ..., bit 1, bit 0]

如果您考虑 xmm 的内容注册为 32 位双字,它们也是从右到左排列:

xmm0 = [dword 3, dword 2, dword 1, dword 0]

这种混乱的根源是,如果内存中有一个数组

float A[4] = { 0.0f, 1.0f, 2.0f, 3.0f };

并将该数组加载到 xmm 寄存器中,则元素以相反的方式出现在 xmm 寄存器中命令:

; xmm0 = (A3 = 3.0f, A2 = 2.0f, A1 = 1.0f, A0 = 0.0f) after the load
movups xmm0, [A]

因此,将第一个双字复制到 xmm 寄存器中的所有双字中的正确方法是

shufps xmm0, xmm0, 0

另外,如果您想将单个浮点数加载并广播到 xmm 寄存器的所有元素中,出于性能原因,最好使用

; MOVSS can be much faster than MOVUPS, and is never slower
; Load A[0] into low dword of xmm0
movss xmm0, [A]
; Copy low dword of xmm0 to all dwords of xmm0
shufps xmm0, xmm0, 0

AVX指令集(在最近的Intel Sandy Bridge和AMD Bulldozer CPU中支持)有一个特殊的指令vbroadcastss,它执行加载和广播:

; xmm0 = (A[0], A[0], A[0], A[0]) after execution of vbroadcastss
vbroadcastss xmm0, [A]

SSE3指令集包括一个类似的指令MOVDDUP,它,然而,仅适用于双打

const double B = 2.718281828459045;

; xmm0 = ( 2.718281828459045, 2.718281828459045 ) after execution of movddup
movddup xmm0, [B]

There may be some confusion because bits in xmm registers (and all other registers BTW) are numbered right-to-left, i.e. the lowest bit is on the right, and the highest bit is on the left:

xmm0 = [bit 127, bit 126, ..., bit 1, bit 0]

If you consider the content of xmm register as 32-bit dwords, they are also arranged right-to-left:

xmm0 = [dword 3, dword 2, dword 1, dword 0]

The source of this confusion is that if you have an array in memory

float A[4] = { 0.0f, 1.0f, 2.0f, 3.0f };

and you load this array into xmm register, the elements appear in the xmm register in the reversed order:

; xmm0 = (A3 = 3.0f, A2 = 2.0f, A1 = 1.0f, A0 = 0.0f) after the load
movups xmm0, [A]

Therefore, the right way to copy the first dword into all dwords in an xmm register is

shufps xmm0, xmm0, 0

Also, if you want to do load-and-broadcast of a single float into all elements of an xmm register, for performance reasons it is better to use

; MOVSS can be much faster than MOVUPS, and is never slower
; Load A[0] into low dword of xmm0
movss xmm0, [A]
; Copy low dword of xmm0 to all dwords of xmm0
shufps xmm0, xmm0, 0

AVX instruction set (supported in the recent Intel Sandy Bridge and AMD Bulldozer CPUs) has a special instruction vbroadcastss which performs load-and-broadcast:

; xmm0 = (A[0], A[0], A[0], A[0]) after execution of vbroadcastss
vbroadcastss xmm0, [A]

SSE3 instruction set includes a similar instruction MOVDDUP, which, however, only works for doubles

const double B = 2.718281828459045;

; xmm0 = ( 2.718281828459045, 2.718281828459045 ) after execution of movddup
movddup xmm0, [B]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文