add 与 mul (IA32-汇编)

发布于 2024-09-18 20:55:04 字数 614 浏览 11 评论 0原文

我知道 addmul 函数更快。

我想知道如何在以下代码中使用 add 而不是 mul 来实现它更有效率。

示例代码:

            mov eax, [ebp + 8]              #eax = x1
            mov ecx, [ebp + 12]             #ecx = x2
            mov edx, [ebp + 16]             #edx = y1
            mov ebx, [ebp + 20]             #ebx = y2

            sub eax,ecx                     #eax = x1-x2
            sub edx,ebx                     #edx = y1-y2

            mul edx                         #eax = (x1-x2)*(y1-y2)

I know that add is faster as compared to mul function.

I want to know how to go about using add instead of mul in the following code in order to make it more efficient.

Sample code:

            mov eax, [ebp + 8]              #eax = x1
            mov ecx, [ebp + 12]             #ecx = x2
            mov edx, [ebp + 16]             #edx = y1
            mov ebx, [ebp + 20]             #ebx = y2

            sub eax,ecx                     #eax = x1-x2
            sub edx,ebx                     #edx = y1-y2

            mul edx                         #eax = (x1-x2)*(y1-y2)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

如何视而不见 2024-09-25 20:55:04

addmul 更快,但如果你想将两个一般值相乘,mul 比任何循环迭代 add< 快得多/strong> 操作。

您不能认真地使用 add 来使该代码比使用 mul 运行得更快。如果您需要乘以一些小的常数值(例如 2),那么也许您可以使用 add 来加快速度。但对于一般情况 - 不。

add is faster than mul, but if you want to multiply two general values, mul is far faster than any loop iterating add operations.

You can't seriously use add to make that code go faster than it will with mul. If you needed to multiply by some small constant value (such as 2), then maybe you could use add to speed things up. But for the general case - no.

小嗷兮 2024-09-25 20:55:04

如果您将两个事先不知道的值相乘,那么实际上不可能击败 x86 汇编器中的乘法指令。

如果您提前知道其中一个操作数的值,则可以通过使用少量加法来击败乘法指令。当已知操作数很小并且其二进制表示中只有几个位时,这种方法特别有效。要将未知值 x 乘以由 2^p+2^q+...2^r 组成的已知值,只需添加 x*2^p+x*2^q+..x*2*r 如果位 p,q , ... 和 r 已设定。这可以通过左移和添加在汇编器中轻松完成:

;  x in EDX
;  product to EAX
xor  eax,eax
shl  edx,r ; x*2^r
add  eax,edx
shl  edx,q-r ; x*2^q
add  eax,edx
shl  edx,p-q ; x*2^p
add  eax,edx

关键问题是至少需要 4 个时钟才能完成此操作,假设
受寄存器依赖性约束的超标量 CPU。乘法通常需要
现代 CPU 上的时钟数为 10 个或更少,并且如果该序列的时间比该序列长
你不妨做一个乘法。

乘以 9:

mov  eax,edx ; same effect as xor eax,eax/shl edx 1/add eax,edx
shl  edx,3 ; x*2^3
add  eax,edx

这胜过乘法;应该只需要2个时钟。

不太为人所知的是 LEA(加载有效地址)指令的使用,
完成快速乘以小常数。
LEA 仅需要一个时钟,最坏情况下其执行时间通常可以
通过超标量 CPU 与其他指令重叠。

LEA 本质上是“用小的常数乘数将两个值相加”。
它针对 t、x 和 y 计算 t=2^k*x+y,其中 k=1,2,3(请参阅英特尔参考手册)
是任何寄存器。如果x==y,则可以得到1,2,3,4,5,8,9乘以x,
但使用 x 和 y 作为单独的寄存器可以组合中间结果
移动到其他寄存器(例如,到t),事实证明这非常方便。
使用它,您可以使用一条指令完成乘以 9:

lea  eax,[edx*8+edx]  ; takes 1 clock

仔细使用 LEA,您可以在少量周期内乘以各种特殊常数:

lea  eax,[edx*4+edx] ; 5 * edx
lea  eax,[eax*2+edx] ; 11 * edx
lea  eax,[eax*4] ; 44 * edx

为此,您必须将常数乘数分解为各种因子/涉及金额
1、2、3、4、5、8 和 9。值得注意的是,您可以对多少个小常数执行此操作,而且仍然
仅使用 3-4 条指令。

如果允许使用其他典型的单时钟指令(例如,SHL/SUB/NEG/MOV)
你可以乘以一些纯 LEA 不能的常数值
自己做同样有效。乘以 31:

lea  eax,[4*edx]
lea  eax,[8*eax]  ; 32*edx
sub  eax,edx; 31*edx ; 3 clocks

相应的 LEA 序列更长:

lea  eax,[edx*4+edx]
lea  eax,[edx*2+eax] ; eax*7
lea  eax,[eax*2+edx] ; eax*15
lea  eax,[eax*2+edx] ; eax*31 ; 4 clocks

弄清楚这些序列有点棘手,但你可以设置有组织的攻击。

由于 LEA、SHL、SUB、NEG、MOV 都是最差的单时钟指令
如果它们不依赖于其他指令,则时钟为零,您可以计算任何此类序列的执行成本。这意味着您可以实现动态编程算法来生成此类指令的最佳可能序列。
仅当时钟计数小于特定 CPU 的整数乘法时,这才有用
(我使用 5 个时钟作为经验法则),并且它不会用完所有寄存器,或者
至少它不会用完已经繁忙的寄存器(避免任何溢出)。

实际上,我已将其内置到我们的 PARLANSE 编译器中,它对于计算数组偏移量非常有效结构 A[i],其中 A 中结构元素的大小是已知常数。聪明的人可能会缓存答案,这样就不会
每次乘以相同的常数时都必须重新计算;我实际上并没有这样做,因为
生成此类序列的时间比您预期的要少。

打印出乘以所有常量所需的指令序列有点有趣
从 1 到 10000。最坏情况下,大多数可以用 5-6 条指令完成。
因此,即使是最糟糕的索引,PARLANSE 编译器也几乎不会使用实际的乘法。
嵌套结构数组。

If you are multiplying two values that you don't know in advance, it is effectively impossible to beat the multiply instruction in x86 assembler.

If you know the value of one of the operands in advance, you may be able beat the multiply instruction by using a small number of adds. This works particularly well when the known operand is small, and only has a few bits in its binary representation. To multiply an unknown value x by a known value consisting 2^p+2^q+...2^r you simply add x*2^p+x*2^q+..x*2*r if bits p,q, ... and r are set. This is easily accomplished in assembler by left shifting and adding:

;  x in EDX
;  product to EAX
xor  eax,eax
shl  edx,r ; x*2^r
add  eax,edx
shl  edx,q-r ; x*2^q
add  eax,edx
shl  edx,p-q ; x*2^p
add  eax,edx

The key problem with this is that it takes at least 4 clocks to do this, assuming
a superscalar CPU constrained by register dependencies. Multiply typically takes
10 or fewer clocks on modern CPUs, and if this sequence gets longer than that in time
you might as well do a multiply.

To multiply by 9:

mov  eax,edx ; same effect as xor eax,eax/shl edx 1/add eax,edx
shl  edx,3 ; x*2^3
add  eax,edx

This beats multiply; should only take 2 clocks.

What is less well known is the use of the LEA (load effective address) instruction,
to accomplish fast multiply-by-small-constant.
LEA which takes only a single clock worst case its execution time can often
by overlapped with other instructions by superscalar CPUs.

LEA is essentially "add two values with small constant multipliers".
It computes t=2^k*x+y for k=1,2,3 (see the Intel reference manual) for t, x and y
being any register. If x==y, you can get 1,2,3,4,5,8,9 times x,
but using x and y as seperate registers allows for intermediate results to be combined
and moved to other registers (e.g., to t), and this turns out to be remarkably handy.
Using it, you can accomplish a multiply by 9 using a single instruction:

lea  eax,[edx*8+edx]  ; takes 1 clock

Using LEA carefully, you can multiply by a variety of peculiar constants in a small number of cycles:

lea  eax,[edx*4+edx] ; 5 * edx
lea  eax,[eax*2+edx] ; 11 * edx
lea  eax,[eax*4] ; 44 * edx

To do this, you have to decompose your constant multiplier into various factors/sums involving
1,2,3,4,5,8 and 9. It is remarkable how many small constants you can do this for, and still
only use 3-4 instructions.

If you allow the use other typically single-clock instructions (e.g, SHL/SUB/NEG/MOV)
you can multiply by some constant values that pure LEA can't
do as efficiently by itself. To multiply by 31:

lea  eax,[4*edx]
lea  eax,[8*eax]  ; 32*edx
sub  eax,edx; 31*edx ; 3 clocks

The corresponding LEA sequence is longer:

lea  eax,[edx*4+edx]
lea  eax,[edx*2+eax] ; eax*7
lea  eax,[eax*2+edx] ; eax*15
lea  eax,[eax*2+edx] ; eax*31 ; 4 clocks

Figuring out these sequences is a bit tricky, but you can set up an organized attack.

Since LEA, SHL, SUB, NEG, MOV are all single-clock instructions worst
case, and zero clocks if they have no dependences on other instructions, you can compute the exeuction cost of any such sequence. This means you can implement a dynamic programmming algorithm to generate the best possible sequence of such instructions.
This is only useful if the clock count is smaller than the integer multiply for your particular CPU
(I use 5 clocks as rule of thumb), and it doesn't use up all the registers, or
at least it doesn't use up registers that are already busy (avoiding any spills).

I've actually built this into our PARLANSE compiler, and it is very effective for computing offsets into arrays of structures A[i], where the size of the structure element in A is the known constant. A clever person would possibly cache the answer so it doesn't
have to be recomputed each time multiplying the same constant occurs; I didn't actually do that because
the time to generate such sequences is less than you'd expect.

Its is mildly interesting to print out the sequences of instructions needed to multiply by all constants
from 1 to 10000. Most of them can be done in 5-6 instructions worst case.
As a consequence, the PARLANSE compiler hardly ever uses an actual multiply when indexing even the nastiest
arrays of nested structures.

鹿! 2024-09-25 20:55:04

除非您的乘法相当简单,否则 add 很可能不会优于 mul。话虽如此,您可以使用add来进行乘法:

Multiply by 2:
    add eax,eax          ; x2
Multiply by 4:
    add eax,eax          ; x2
    add eax,eax          ; x4
Multiply by 8:
    add eax,eax          ; x2
    add eax,eax          ; x4
    add eax,eax          ; x8

它们对于2的幂非常有效。我并不是说它们更快。在奇特的乘法指令出现之前,它们当然是必要的。这是来自一个灵魂在 Mostek 6502、Zilog z80 和 RCA1802 的地狱之火中锻造出来的人:-)

你甚至可以通过简单地存储临时结果来乘以非幂:

Multiply by 9:
    push ebx              ; preserve
    push eax              ; save for later
    add  eax,eax          ; x2
    add  eax,eax          ; x4
    add  eax,eax          ; x8
    pop  ebx              ; get original eax into ebx
    add  eax,ebx          ; x9
    pop  ebx              ; recover original ebx

我通常建议你编写代码主要是为了可读性仅在需要时才担心性能。但是,如果您正在使用汇编程序,那么您可能已经在这一点上了。但我不确定我的“解决方案”是否真的适用于您的情况,因为您有一个任意的被乘数。

但是,您应该始终在目标环境中分析您的代码,以确保您所做的实际上更快。汇编器根本不会改变优化的这一方面。


如果您确实想查看一些更通用的汇编程序来使用 add 进行乘法,这里有一个例程,它将在 axbx 中采用两个无符号值code> 并将产品退回到 ax 中。它不会优雅地处理溢出。

START:  MOV    AX, 0007    ; Load up registers
        MOV    BX, 0005
        CALL   MULT        ; Call multiply function.
        HLT                ; Stop.

MULT:   PUSH   BX          ; Preserve BX, CX, DX.
        PUSH   CX
        PUSH   DX

        XOR    CX,CX       ; CX is the accumulator.

        CMP    BX, 0       ; If multiplying by zero, just stop.
        JZ     FIN

MORE:   PUSH   BX          ; Xfer BX to DX for bit check.
        POP    DX

        AND    DX, 0001    ; Is lowest bit 1?
        JZ     NOADD       ; No, do not add.
        ADD    CX,AX

NOADD:  SHL    AX,1        ; Shift AX left (double).
        SHR    BX,1        ; Shift BX right (integer halve, next bit).
        JNZ    MORE        ; Keep going until no more bits in BX.

FIN:    PUSH   CX          ; Xfer product from CX to AX.
        POP    AX

        POP    DX          ; Restore registers and return.
        POP    CX
        POP    BX
        RET

它依赖于这样一个事实:123 乘以 456 等于:

    123 x 6
+  1230 x 5
+ 12300 x 4

这与您在小学/小学时教乘法的方式相同。使用二进制更容易,因为您只乘以零或一(换句话说,加或不加)。

它是相当老式的 x86(8086,来自 DEBUG 会话 - 我不敢相信他们实际上仍然在 XP 中包含这个东西),因为那是我最后一次直接在汇编器中编码。对于高级语言有一些话要说:-)

Unless your multiplications are fairly simplistic, the add most likely won't outperform a mul. Having said that, you can use add to do multiplications:

Multiply by 2:
    add eax,eax          ; x2
Multiply by 4:
    add eax,eax          ; x2
    add eax,eax          ; x4
Multiply by 8:
    add eax,eax          ; x2
    add eax,eax          ; x4
    add eax,eax          ; x8

They work nicely for powers of two. I'm not saying they're faster. They were certainly necessary in the days before fancy multiplication instructions. That's from someone whose soul was forged in the hell-fires that were the Mostek 6502, Zilog z80 and RCA1802 :-)

You can even multiply by non-powers by simply storing interim results:

Multiply by 9:
    push ebx              ; preserve
    push eax              ; save for later
    add  eax,eax          ; x2
    add  eax,eax          ; x4
    add  eax,eax          ; x8
    pop  ebx              ; get original eax into ebx
    add  eax,ebx          ; x9
    pop  ebx              ; recover original ebx

I generally suggest that you write your code primarily for readability and only worry about performance when you need it. However, if you're working in assembler, you may well already at that point. But I'm not sure my "solution" is really applicable to your situation since you have an arbitrary multiplicand.

You should, however, always profile your code in the target environment to ensure that what you're doing is actually faster. Assembler doesn't change that aspect of optimisation at all.


If you really want to see some more general purpose assembler for using add to do multiplication, here's a routine that will take two unsigned values in ax and bx and return the product in ax. It will not handle overflow elegantly.

START:  MOV    AX, 0007    ; Load up registers
        MOV    BX, 0005
        CALL   MULT        ; Call multiply function.
        HLT                ; Stop.

MULT:   PUSH   BX          ; Preserve BX, CX, DX.
        PUSH   CX
        PUSH   DX

        XOR    CX,CX       ; CX is the accumulator.

        CMP    BX, 0       ; If multiplying by zero, just stop.
        JZ     FIN

MORE:   PUSH   BX          ; Xfer BX to DX for bit check.
        POP    DX

        AND    DX, 0001    ; Is lowest bit 1?
        JZ     NOADD       ; No, do not add.
        ADD    CX,AX

NOADD:  SHL    AX,1        ; Shift AX left (double).
        SHR    BX,1        ; Shift BX right (integer halve, next bit).
        JNZ    MORE        ; Keep going until no more bits in BX.

FIN:    PUSH   CX          ; Xfer product from CX to AX.
        POP    AX

        POP    DX          ; Restore registers and return.
        POP    CX
        POP    BX
        RET

It relies on the fact that 123 multiplied by 456 is identical to:

    123 x 6
+  1230 x 5
+ 12300 x 4

which is the same way you were taught multiplication back in grade/primary school. It's easier with binary since you're only ever multiplying by zero or one (in other words, either adding or not adding).

It's pretty old-school x86 (8086, from a DEBUG session - I can't believe they still actually include that thing in XP) since that was about the last time I coded directly in assembler. There's something to be said for high level languages :-)

作死小能手 2024-09-25 20:55:04

当涉及到汇编指令时,执行任何指令的速度都是使用时钟周期来衡量的。 Mul 指令总是需要更多的时钟周期然后进行加法运算,但是如果您在循环中执行相同的加法指令,那么使用加法指令进行乘法的总时钟周期将比单个 mul 指令多得多。你可以看一下下面的网址,里面讲的是单个add/mul指令的时钟周期。这样你就可以计算一下,哪个会更快。

http://home.comcast.net/~fbui/intel_a.html#add< /a>

http://home.comcast.net/~fbui/intel_m.html #mul

我的建议是使用 mul 指令,而不是将 add 放入循环中,后者是非常低效的解决方案。

When it comes to assembly instruction,speed of executing any instruction is measured using the clock cycle. Mul instruction always take more clock cycle's then add operation,but if you execute the same add instruction in a loop then the overall clock cycle to do multiplication using add instruction will be way more then the single mul instruction. You can have a look on the following URL which talks about the clock cycle of single add/mul instruction.So that way you can do your math,which one will be faster.

http://home.comcast.net/~fbui/intel_a.html#add

http://home.comcast.net/~fbui/intel_m.html#mul

My recommendation is to use mul instruction rather then putting add in loop,the later one is very inefficient solution.

仙女山的月亮 2024-09-25 20:55:04

我必须回应你已经有了的回应 - 对于一般的乘法,你最好使用 MUL - 毕竟它就是它的用途!

在某些特定情况下,如果您知道每次都需要乘以特定的固定值(例如,计算位图中的像素索引),那么您可以考虑中断乘法分解为一小部分 SHL 和 ADD - 例如:

1280 x 1024 显示 - 每一行
显示为1280像素。

1280 = 1024 + 256 = 2^10 + 2^8

y * 1280 = y * (2 ^ 10) + y * (2 ^ 8)
= 添加 (SHL y, 10), (SHL y, 8)

...鉴于图形处理可能需要快速,这种方法可能为您节省宝贵的时钟周期。

I'd have to echo the responses you have already - for a general multiply you're best off using MUL - after all it's what it's there for!

In some specific cases, where you know you'll be wanting to multiply by a specific fixed value each time (for example, in working out a pixel index in a bitmap) then you can consider breaking the multiply down into a (small) handful of SHLs and ADDs - e.g.:

1280 x 1024 display - each line on the
display is 1280 pixels.

1280 = 1024 + 256 = 2^10 + 2^8

y * 1280 = y * (2 ^ 10) + y * (2 ^ 8)
= ADD (SHL y, 10), (SHL y, 8)

...given that graphics processing is likely to need to be speedy, such an approach may save you precious clock cycles.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文