当前位置：文江博客话题详情

使用MMWord PTR 64位内存操作数时，使用PunPCKLWD无效操作。

发布于 2025-02-02 06:05:17 字数 501 浏览 7 评论 0 原文

目前正在处理一些旧的装配代码，并使用此线路出现MASM错误。

punpcklwd MM3, MMWORD PTR [8+EBP+ECX*2]

给我：错误A2070：无效的说明操作数，

但是，这应该有效，对吗？从编译副本中分解的代码基本上与此相同。

另外，根据此PDF的说法，这是应该写的。 -mpeg1-audio-kernels-140701.pdf“ rel =” nofollow noreferrer“> https://www.intel.com/content/content/content/dam/develic/extervel/extern/enternal/external/en/en/en/documents/mmmx-app-mpeg-mpeg1-app-mpeg1-abpeg1-audio-audio-audio-audio-audio-audio-audio-audio-audio-audio------------------内核140701.pdf

原文

Currently working on some old assembly code, and MASM errors out with this line.

punpcklwd MM3, MMWORD PTR [8+EBP+ECX*2]

Gives me: error A2070: invalid instruction operands

But, this should be valid, right? The disassembled code from a compiled copy is basically identical to this.

Also, according to this PDF, this is how it's supposed to be written... https://www.intel.com/content/dam/develop/external/us/en/documents/mmx-app-mpeg1-audio-kernels-140701.pdf

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无所的.畏惧 2025-02-09 06:05:17

内存源操作数为32位dword，而不是mmword或qword。
参见

PUNPCKLWD mm, mm/m32                MMX
PUNPCKLWD xmm1, xmm2/m128           SSE2

” XMM版本：它确实是一个128位负载，如果它延伸到未上限的页面或未对准的情况下，则有故障。

描述部分备份了：

当源数据来自128位内存操作数时，实现只能获取适当的64位；但是，仍将执行与16字节边界和正常段检查的对齐。

传统SSE版本64位操作数：源操作数可以是MMX技术寄存器或32位内存位置。目的地操作数是MMX技术寄存器。

128位行为是SSE1/SSE2中许多愚蠢的设计决策之一。我想知道Pentium 4是否对商店 - 转向的局限性或某种东西有某种方式使第一代实现的效率降低了，就像 movq 加载一样。有 movhps xmm3，qword ptr [ecx] 加载到上半部分以替换 punpcklqdq ，但是您只需要一个单独的 movq 即可缩小交通。

仅采用其使用宽度的操作数的MMX行为是明智的。我不知道您链接的英特尔文档为何使用mmword；也许有些召集人当时接受了这一点。当前的MASM拒绝它确实有意义，但无论哪种方式都可以进行。

请注意， punpckhwd mm3，mm0 可以用 movq [esi]，mm0 /替换 punpckhwd mm3，[esi] 并运行相同，而不是需要 [esi+4] 。

这也让他们构建HW，只需将64位的负载带到洗牌单元，而无需广播或移动的负载即可将数据获取正确的位置，以便输入到Alu。现代英特尔负载端口可以进行广播负载（例如 movddup 或 vbroadcastss ，存储器源以单个UOP的单个UOP运行，而无需涉及ALU），但这是很多东西比P5 Pentium更新。

当源数据来自64位内存操作数时，可以从内存访问完整的64位操作数，但该指令仅使用高级32位。当源数据来自128位内存操作数时，实现只能获取适当的64位；但是，仍将执行与16字节边界和正常段检查的对齐。

完全省略DWord / Mmword PTR

，顺便说一句， punpcklwd mm3，[8+EBP+ECX*2] < / code>应该可以与大多数Intel-Syntax汇编器组装好代码> .intel_syntax noprefix 。寄存器目的地（以及助记符）意味着内存操作数的大小。

gnu binutils objdump -drwc -mintel 同意英特尔的手册，它是32位内存操作数。我认为MASM需要相同的语法。

 8049000:       0f 61 5c 4d 08          punpcklwd mm3,DWORD PTR [ebp+ecx*2+0x8]
 8049005:       66 0f 61 5c 4d 08       punpcklwd xmm3,XMMWORD PTR [ebp+ecx*2+0x8]

The memory source operand is 32-bit DWORD, not MMWORD or QWORD.
See Intel's asm manual entry:

PUNPCKLWD mm, mm/m32                MMX
PUNPCKLWD xmm1, xmm2/m128           SSE2

Unfortunately, the same is not true for the XMM version: it does count as a 128-bit load, faulting if it extends into an unmapped page or is misaligned.

The Description section backs this up:

When the source data comes from a 128-bit memory operand, an implementation may fetch only the appropriate 64 bits; however, alignment to a 16-byte boundary and normal segment checking will still be enforced.

Legacy SSE versions 64-bit operand: The source operand can be an MMX technology register or a 32-bit memory location. The destination operand is an MMX technology register.

The 128-bit behaviour is one of many dumb design decisions in SSE1/SSE2. I wonder if Pentium 4 had limitations on store-forwarding or something that would have somehow made it less efficient in that first-gen implementation to be like a movq load. There is movhps xmm3, qword ptr [ecx] to load into the upper half to replace punpcklqdq, but you just need a separate movq for narrower interleaves.

The MMX behaviour of only taking an operand of the width it uses is the sensible one. I don't know why the Intel doc you linked uses MMWORD with it; maybe some assemblers accepted that at the time. It does make sense that current MASM rejects it, but that could have gone either way.

Do note that punpckHwd and so on want a register-width memory operand, I guess so it more closely matches the register source version, e.g. punpckhwd mm3, mm0 could be replaced with movq [esi], mm0 / punpckhwd mm3, [esi] and run the same, rather than needing [esi+4].

That also let them build HW that just feeds a 64-bit load to the shuffle unit, without needing a broadcast or shifted load to get the data at the right place for input to the ALU. Modern Intel load ports can do broadcast loads (e.g. movddup or vbroadcastss with a memory source run as a single uop for a load port, no ALU involved), but that's something much more recent than P5 Pentium.

When the source data comes from a 64-bit memory operand, the full 64-bit operand is accessed from memory, but the instruction uses only the high-order 32 bits. When the source data comes from a 128-bit memory operand, an implementation may fetch only the appropriate 64 bits; however, alignment to a 16-byte boundary and normal segment checking will still be enforced.

Omit the DWORD / MMWORD PTR entirely

And BTW, punpcklwd MM3, [8+EBP+ECX*2] should assemble just fine with most Intel-syntax assemblers, including MASM as well as NASM and GAS with .intel_syntax noprefix. The register destination (along with the mnemonic) implies the size of the memory operand.

GNU Binutils objdump -drwC -Mintel agrees with Intel's manual that it's a 32-bit memory operand. I assume MASM would want the same syntax.

 8049000:       0f 61 5c 4d 08          punpcklwd mm3,DWORD PTR [ebp+ecx*2+0x8]
 8049005:       66 0f 61 5c 4d 08       punpcklwd xmm3,XMMWORD PTR [ebp+ecx*2+0x8]

回复收藏 0 原文

~没有更多了~