在 C 中使用移位运算符的乘法和除法实际上更快吗?

发布于 2024-11-15 16:07:28 字数 249 浏览 4 评论 0原文

例如可以使用位运算符

i*2 = i<<1
i*3 = (i<<1) + i;
i*10 = (i<<3) + (i<<1)

等来实现乘法和除法。

使用 (i<<3)+(i<<1) 乘以 10 实际上比直接使用 i*10 更快吗?是否有任何类型的输入不能以这种方式相乘或相除?

Multiplication and division can be achieved using bit operators, for example

i*2 = i<<1
i*3 = (i<<1) + i;
i*10 = (i<<3) + (i<<1)

and so on.

Is it actually faster to use say (i<<3)+(i<<1) to multiply with 10 than using i*10 directly? Is there any sort of input that can't be multiplied or divided in this way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(19

不必在意 2024-11-22 16:07:28

简短回答:不太可能。

长答案:
您的编译器中有一个优化器,它知道如何按照目标处理器架构的速度进行乘法运算。最好的选择是清楚地告诉编译器您的意图(即 i*2 而不是 i << 1),并让它决定最快的汇编/机器代码序列是什么。甚至有可能处理器本身已将乘法指令实现为一系列移位和乘法指令。添加微代码。

最重要的是——不要花太多时间担心这个。如果你想改变,那就改变吧。如果你想乘法,就乘法。做语义上最清晰的事情——你的同事稍后会感谢你。或者,更可能的是,如果你不这样做,稍后就会诅咒你。

Short answer: Not likely.

Long answer:
Your compiler has an optimizer in it that knows how to multiply as quickly as your target processor architecture is capable. Your best bet is to tell the compiler your intent clearly (i.e. i*2 rather than i << 1) and let it decide what the fastest assembly/machine code sequence is. It's even possible that the processor itself has implemented the multiply instruction as a sequence of shifts & adds in microcode.

Bottom line--don't spend a lot of time worrying about this. If you mean to shift, shift. If you mean to multiply, multiply. Do what is semantically clearest--your coworkers will thank you later. Or, more likely, curse you later if you do otherwise.

你在我安 2024-11-22 16:07:28

只是一个具体的衡量标准:很多年前,我对两个指标进行了基准测试
我的哈希算法的版本:

unsigned
hash( char const* s )
{
    unsigned h = 0;
    while ( *s != '\0' ) {
        h = 127 * h + (unsigned char)*s;
        ++ s;
    }
    return h;
}

unsigned
hash( char const* s )
{
    unsigned h = 0;
    while ( *s != '\0' ) {
        h = (h << 7) - h + (unsigned char)*s;
        ++ s;
    }
    return h;
}

我对其进行基准测试的每台机器上,第一个至少与
第二个。有点令人惊讶的是,它有时会更快(例如在
太阳斯帕克)。当硬件不支持快速乘法时(并且
大多数人当时没有),编译器会转换乘法
转变为适当的班次和添加/子组合。又因为它
知道最终目标,有时可以用比
当您明确编写班次和添加/替换时。

请注意,这大约是 15 年前的事了。希望编译器
从那时起,情况才变得更好,所以你几乎可以信赖
编译器做了正确的事情,可能比你更好。 (还,
代码之所以看起来如此 C'ish,是因为它是 15 年前的事了。
我今天显然会使用 std::string 和迭代器。)

Just a concrete point of measure: many years back, I benchmarked two
versions of my hashing algorithm:

unsigned
hash( char const* s )
{
    unsigned h = 0;
    while ( *s != '\0' ) {
        h = 127 * h + (unsigned char)*s;
        ++ s;
    }
    return h;
}

and

unsigned
hash( char const* s )
{
    unsigned h = 0;
    while ( *s != '\0' ) {
        h = (h << 7) - h + (unsigned char)*s;
        ++ s;
    }
    return h;
}

On every machine I benchmarked it on, the first was at least as fast as
the second. Somewhat surprisingly, it was sometimes faster (e.g. on a
Sun Sparc). When the hardware didn't support fast multiplication (and
most didn't back then), the compiler would convert the multiplication
into the appropriate combinations of shifts and add/sub. And because it
knew the final goal, it could sometimes do so in less instructions than
when you explicitly wrote the shifts and the add/subs.

Note that this was something like 15 years ago. Hopefully, compilers
have only gotten better since then, so you can pretty much count on the
compiler doing the right thing, probably better than you could. (Also,
the reason the code looks so C'ish is because it was over 15 years ago.
I'd obviously use std::string and iterators today.)

深居我梦 2024-11-22 16:07:28

除了这里所有其他好的答案之外,让我指出当您指的是除法或乘法时不使用移位的另一个原因。我从未见过有人因忘记乘法和加法的相对优先级而引入错误。我见过一些错误,因为维护程序员忘记了通过移位进行“乘法”在逻辑上是乘法,但在语法上与乘法的优先级不同。 x * 2 + zx << 1 + z 非常不同!

如果您正在处理数字,请使用算术运算符,例如+ - * / %。如果您正在处理位数组,请使用位旋转运算符,例如 & ^ | >> 。不要混合它们;同时具有位运算和算术运算的表达式是一个等待发生的错误。

In addition to all the other good answers here, let me point out another reason to not use shift when you mean divide or multiply. I have never once seen someone introduce a bug by forgetting the relative precedence of multiplication and addition. I have seen bugs introduced when maintenance programmers forgot that "multiplying" via a shift is logically a multiplication but not syntactically of the same precedence as multiplication. x * 2 + z and x << 1 + z are very different!

If you're working on numbers then use arithmetic operators like + - * / %. If you're working on arrays of bits, use bit twiddling operators like & ^ | >> . Don't mix them; an expression that has both bit twiddling and arithmetic is a bug waiting to happen.

裸钻 2024-11-22 16:07:28

这取决于处理器和编译器。一些编译器已经以这种方式优化代码,而另一些则没有。
因此,每次需要以这种方式优化代码时,您都需要进行检查。

除非您迫切需要优化,否则我不会仅仅为了节省汇编指令或处理器周期而扰乱我的源代码。

This depends on the processor and the compiler. Some compilers already optimize code this way, others don't.
So you need to check each time your code needs to be optimized this way.

Unless you desperately need to optimize, I would not scramble my source code just to save an assembly instruction or processor cycle.

萤火眠眠 2024-11-22 16:07:28

使用 (i<<3)+(i<<1) 乘以 10 实际上比直接使用 i*10 更快吗?

它可能在也可能不在您的机器上 - 如果您关心的话,请根据您的实际使用情况进行测量。

案例研究 - 从 486 到酷睿 i7 基准

测试很难进行有意义的测试,但我们可以看一些事实。来自http://www.penguin.cz/~literakl/intel/s.html#SAL和<一个href="http://www.penguin.cz/~literakl/intel/i.html#IMUL">http://www.penguin.cz/~literakl/intel/i.html#IMUL 我们了解算术移位和乘法所需的 x86 时钟周期。假设我们坚持使用“486”(列出的最新的)、32 位寄存器和立即数,IMUL 需要 13-42 个周期,IDIV 44。每个 SAL 需要 2,然后加 1,所以即使其中一些一起移位,表面上看起来也会发生变化。就像赢家一样。

如今,使用核心 i7:(

来自 http://software.intel.com/ en-us/forums/showthread.php?t=61481)

整数加法的延迟为 1 个周期,整数乘法的延迟为 3 个周期。您可以在“英特尔® 64 和 IA-32 架构优化参考手册”的附录 C 中找到延迟和吞吐量,该手册位于 http://www.intel.com/products/processor/manuals/

(来自一些英特尔简介)

使用 SSE,Core i7 可以同时发出加法和乘法指令,从而实现每个时钟周期 8 次浮点运算 (FLOP) 的峰值速率

这让您了解事情已经取得了多大进展。优化琐事(例如位移与 *)甚至在 90 年代就被认真对待,但现在已经过时了。位移仍然更快,但对于非 2 乘方/除法,当您完成所有移位并添加结果时,它又变慢了。然后,更多的指令意味着更多的缓存故障、更多的流水线潜在问题、更多地使用临时寄存器可能意味着更多地从堆栈中保存和恢复寄存器内容……它很快就会变得太复杂,无法明确量化所有影响,但它们是主要是负面的。

源代码中的功能与实现

更一般地说,您的问题被标记为 C 和 C++。作为第三代语言,它们专门设计用于隐藏底层 CPU 指令集的细节。为了满足他们的语言标准,他们必须支持乘法和移位操作(以及许多其他操作)即使底层硬件不支持。在这种情况下,他们必须使用许多其他指令来综合所需的结果。同样,如果 CPU 缺乏浮点运算并且没有 FPU,它们也必须提供对浮点运算的软件支持。现代 CPU 都支持 *<<,因此这可能看起来荒谬的理论和历史,但重要的是选择实现的自由是双向的:甚至如果 CPU 具有实现一般情况下源代码中请求的操作的指令,则编译器可以自由选择它喜欢的其他内容,因为它更适合编译器面临的特定情况。

示例(使用假设的汇编语言)

source           literal approach         optimised approach
#define N 0
int x;           .word x                xor registerA, registerA
x *= N;          move x -> registerA
                 move x -> registerB
                 A = B * immediate(0)
                 store registerA -> x
  ...............do something more with x...............

像exclusive or (xor) 这样的指令与源代码没有关系,但是与自身进行异或运算会清除所有位,因此它可以用于设置某些内容到 0。暗示内存地址的源代码可能不需要任何使用。

自从计算机出现以来,此类黑客技术就一直被使用。在 3GL 的早期,为了确保开发人员采用,编译器输出必须满足现有的硬核手动优化汇编语言开发。社区认为生成的代码并没有更慢、更冗长或更糟糕。编译器很快采用了许多伟大的优化 - 它们成为比任何单独的汇编语言程序员更好的集中存储,尽管它们总是有可能错过在特定情况下恰好至关重要的特定优化 - 人类有时可以胡言乱语,摸索出更好的东西,而编译器只是按照他们被告知的去做,直到有人将经验反馈给他们。

因此,即使移位和添加在某些特定硬件上仍然更快,那么编译器编写者很可能已经准确地计算出何时既安全又有益。

可维护性

如果您的硬件发生变化,您可以重新编译,它会查看目标 CPU 并做出另一个最佳选择,而您不太可能想要重新访问您的“优化”或列出哪些编译环境应该使用乘法,哪些应该改变。想想 10 多年前编写的所有非二次方位移“优化”,现在它们在现代处理器上运行时正在减慢它们所在的代码......!

值得庆幸的是,当启用任何优化时,像 GCC 这样的优秀编译器通常可以用直接乘法替换一系列位移和算术(即 ...main(...) { return (argc << 4) + ( argc << 2) + argc; } -> imull $21, 8(%ebp), %eax) 因此重新编译可能会有所帮助修复代码,但这并不能保证。

实现乘法或除法的奇怪位移代码远不能表达您在概念上试图实现的目标,因此其他开发人员会对此感到困惑,而困惑的程序员更有可能引入错误或删除一些重要的内容,以努力恢复看似理智的状态。如果你只在真正有实际好处的时候做一些不明显的事情,然后很好地记录它们(但不要记录其他直观的东西),每个人都会更高兴。

通用解决方案与部分解决方案

如果您有一些额外的知识,例如您的 int 实际上只会存储值 xyz,那么您也许能够制定出一些适用于这些值的指令,并比编译器没有这种洞察力并需要适用于所有 的实现更快地获得结果int 值。例如,考虑您的问题:

可以使用位运算符来实现乘法和除法...

您说明了乘法,但是除法怎么样?

int x;
x >> 1;   // divide by 2?

根据 C++ 标准 5.8:

-3-E1的值>> E2是E1右移E2位的位置。如果 E1 具有无符号类型,或者 E1 具有有符号类型和非负值,则结果值是 E1 的商除以 2 的 E2 次方的商的整数部分。如果 E1 具有有符号类型和负值,则结果值是实现定义的。

因此,当x为负数时,您的位移位具有实现定义的结果:它在不同机器上的工作方式可能不同。但是,/ 的工作方式更加可预测。(它也可能不完全一致,因为不同的机器可能有不同的负数表示,因此范围也不同即使组成表示的位数相同。)

您可能会说“我不在乎... int 正在存储员工的年龄,它永远不会是负数” 。如果您有这种特殊的洞察力,那么是的 - 您的 >> 安全优化可能会被编译器忽略,除非您在代码中明确执行此操作。但是,这是有风险的,而且很少有用,因为很多时候你不会有这种洞察力,而其他处理相同代码的程序员不会知道你已经把赌注押在了某些东西上。对您将要处理的数据的不寻常的期望...对它们来说看似完全安全的更改可能会因为您的“优化”而适得其反。

是否有任何类型的输入不能以这种方式相乘或除法?

是的......如上所述,负数在被位移“除”时具有实现定义的行为。

Is it actually faster to use say (i<<3)+(i<<1) to multiply with 10 than using i*10 directly?

It might or might not be on your machine - if you care, measure in your real-world usage.

A case study - from 486 to core i7

Benchmarking is very difficult to do meaningfully, but we can look at a few facts. From http://www.penguin.cz/~literakl/intel/s.html#SAL and http://www.penguin.cz/~literakl/intel/i.html#IMUL we get an idea of x86 clock cycles needed for arithmetic shift and multiplication. Say we stick to "486" (the newest one listed), 32 bit registers and immediates, IMUL takes 13-42 cycles and IDIV 44. Each SAL takes 2, and adding 1, so even with a few of those together shifting superficially looks like a winner.

These days, with the core i7:

(from http://software.intel.com/en-us/forums/showthread.php?t=61481)

The latency is 1 cycle for an integer addition and 3 cycles for an integer multiplication. You can find the latencies and thoughput in Appendix C of the "Intel® 64 and IA-32 Architectures Optimization Reference Manual", which is located on http://www.intel.com/products/processor/manuals/.

(from some Intel blurb)

Using SSE, the Core i7 can issue simultaneous add and multiply instructions, resulting in a peak rate of 8 floating-point operations (FLOP) per clock cycle

That gives you an idea of how far things have come. The optimisation trivia - like bit shifting versus * - that was been taken seriously even into the 90s is just obsolete now. Bit-shifting is still faster, but for non-power-of-two mul/div by the time you do all your shifts and add the results it's slower again. Then, more instructions means more cache faults, more potential issues in pipelining, more use of temporary registers may mean more saving and restoring of register content from the stack... it quickly gets too complicated to quantify all the impacts definitively but they're predominantly negative.

functionality in source code vs implementation

More generally, your question is tagged C and C++. As 3rd generation languages, they're specifically designed to hide the details of the underlying CPU instruction set. To satisfy their language Standards, they must support multiplication and shifting operations (and many others) even if the underlying hardware doesn't. In such cases, they must synthesize the required result using many other instructions. Similarly, they must provide software support for floating point operations if the CPU lacks it and there's no FPU. Modern CPUs all support * and <<, so this might seem absurdly theoretical and historical, but the significance thing is that the freedom to choose implementation goes both ways: even if the CPU has an instruction that implements the operation requested in the source code in the general case, the compiler's free to choose something else that it prefers because it's better for the specific case the compiler's faced with.

Examples (with a hypothetical assembly language)

source           literal approach         optimised approach
#define N 0
int x;           .word x                xor registerA, registerA
x *= N;          move x -> registerA
                 move x -> registerB
                 A = B * immediate(0)
                 store registerA -> x
  ...............do something more with x...............

Instructions like exclusive or (xor) have no relationship to the source code, but xor-ing anything with itself clears all the bits, so it can be used to set something to 0. Source code that implies memory addresses may not entail any being used.

These kind of hacks have been used for as long as computers have been around. In the early days of 3GLs, to secure developer uptake the compiler output had to satisfy the existing hardcore hand-optimising assembly-language dev. community that the produced code wasn't slower, more verbose or otherwise worse. Compilers quickly adopted lots of great optimisations - they became a better centralised store of it than any individual assembly language programmer could possibly be, though there's always the chance that they miss a specific optimisation that happens to be crucial in a specific case - humans can sometimes nut it out and grope for something better while compilers just do as they've been told until someone feeds that experience back into them.

So, even if shifting and adding is still faster on some particular hardware, then the compiler writer's likely to have worked out exactly when it's both safe and beneficial.

Maintainability

If your hardware changes you can recompile and it'll look at the target CPU and make another best choice, whereas you're unlikely to ever want to revisit your "optimisations" or list which compilation environments should use multiplication and which should shift. Think of all the non-power-of-two bit-shifted "optimisations" written 10+ years ago that are now slowing down the code they're in as it runs on modern processors...!

Thankfully, good compilers like GCC can typically replace a series of bitshifts and arithmetic with a direct multiplication when any optimisation is enabled (i.e. ...main(...) { return (argc << 4) + (argc << 2) + argc; } -> imull $21, 8(%ebp), %eax) so a recompilation may help even without fixing the code, but that's not guaranteed.

Strange bitshifting code implementing multiplication or division is far less expressive of what you were conceptually trying to achieve, so other developers will be confused by that, and a confused programmer's more likely to introduce bugs or remove something essential in an effort to restore seeming sanity. If you only do non-obvious things when they're really tangibly beneficial, and then document them well (but don't document other stuff that's intuitive anyway), everyone will be happier.

General solutions versus partial solutions

If you have some extra knowledge, such as that your int will really only be storing values x, y and z, then you may be able to work out some instructions that work for those values and get you your result more quickly than when the compiler's doesn't have that insight and needs an implementation that works for all int values. For example, consider your question:

Multiplication and division can be achieved using bit operators...

You illustrate multiplication, but how about division?

int x;
x >> 1;   // divide by 2?

According to the C++ Standard 5.8:

-3- The value of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 divided by the quantity 2 raised to the power E2. If E1 has a signed type and a negative value, the resulting value is implementation-defined.

So, your bit shift has an implementation defined result when x is negative: it may not work the same way on different machines. But, / works far more predictably. (It may not be perfectly consistent either, as different machines may have different representations of negative numbers, and hence different ranges even when there are the same number of bits making up the representation.)

You may say "I don't care... that int is storing the age of the employee, it can never be negative". If you have that kind of special insight, then yes - your >> safe optimisation might be passed over by the compiler unless you explicitly do it in your code. But, it's risky and rarely useful as much of the time you won't have this kind of insight, and other programmers working on the same code won't know that you've bet the house on some unusual expectations of the data you'll be handling... what seems a totally safe change to them might backfire because of your "optimisation".

Is there any sort of input that can't be multiplied or divided in this way?

Yes... as mentioned above, negative numbers have implementation defined behaviour when "divided" by bit-shifting.

诗酒趁年少 2024-11-22 16:07:28

刚刚在我的机器上尝试编译这个:

int a = ...;
int b = a * 10;

当反汇编时,它会产生输出:

MOV EAX,DWORD PTR SS:[ESP+1C] ; Move a into EAX
LEA EAX,DWORD PTR DS:[EAX+EAX*4] ; Multiply by 5 without shift !
SHL EAX, 1 ; Multiply by 2 using shift

这个版本比您手动优化的纯移位和加法代码更快。

你真的永远不知道编译器会产生什么,所以最好简单地编写一个正常乘法并让他按照他想要的方式进行优化,除非在非常精确的情况下知道编译器无法优化。

Just tried on my machine compiling this :

int a = ...;
int b = a * 10;

When disassembling it produces output :

MOV EAX,DWORD PTR SS:[ESP+1C] ; Move a into EAX
LEA EAX,DWORD PTR DS:[EAX+EAX*4] ; Multiply by 5 without shift !
SHL EAX, 1 ; Multiply by 2 using shift

This version is faster than your hand-optimized code with pure shifting and addition.

You really never know what the compiler is going to come up with, so it's better to simply write a normal multiplication and let him optimize the way he wants to, except in very precise cases where you know the compiler cannot optimize.

何处潇湘 2024-11-22 16:07:28

在指令级别,移位通常比乘法快得多,但您很可能会浪费时间进行过早的优化。编译器可以在编译时很好地执行这些优化。自己做会影响可读性,并且可能对性能没有影响。如果您已经分析并发现这是一个瓶颈,那么这样做可能才值得。

实际上,被称为“魔法除法”的除法技巧实际上可以产生巨大的回报。同样,您应该首先进行分析,看看是否需要。但是,如果您确实使用它,那么周围有一些有用的程序可以帮助您弄清楚相同的除法语义需要哪些指令。这是一个示例:http://www.masm32.com/board/index。 php?topic=12421.0

我从 MASM32 上的 OP 线程中提取的示例:

include ConstDiv.inc
...
mov eax,9999999
; divide eax by 100000
cdiv 100000
; edx = quotient

将生成:

mov eax,9999999
mov edx,0A7C5AC47h
add eax,1
.if !CARRY?
    mul edx
.endif
shr edx,16

Shifting is generally a lot faster than multiplying at an instruction level but you may well be wasting your time doing premature optimisations. The compiler may well perform these optimisations at compiletime. Doing it yourself will affect readability and possibly have no effect on performance. It's probably only worth it to do things like this if you have profiled and found this to be a bottleneck.

Actually the division trick, known as 'magic division' can actually yield huge payoffs. Again you should profile first to see if it's needed. But if you do use it there are useful programs around to help you figure out what instructions are needed for the same division semantics. Here is an example : http://www.masm32.com/board/index.php?topic=12421.0

An example which I have lifted from the OP's thread on MASM32:

include ConstDiv.inc
...
mov eax,9999999
; divide eax by 100000
cdiv 100000
; edx = quotient

Would generate:

mov eax,9999999
mov edx,0A7C5AC47h
add eax,1
.if !CARRY?
    mul edx
.endif
shr edx,16
澜川若宁 2024-11-22 16:07:28

移位和整数乘法指​​令在大多数现代 CPU 上具有相似的性能 - 整数乘法指​​令在 20 世纪 80 年代相对较慢,但通常情况已不再如此。整数乘法指​​令可能具有较高的延迟,因此仍然可能存在移位更可取的情况。对于可以让更多执行单元忙碌的情况也是如此(尽管这可能是双向的)。

不过,整数除法仍然相对较慢,因此使用移位而不是除以 2 的幂仍然是一种胜利,并且大多数编译器都会将其实现为优化。 但请注意,要使此优化有效,股息必须是无符号的或必须已知为正数。对于负被除数,移位和除法并不等价!

#include <stdio.h>

int main(void)
{
    int i;

    for (i = 5; i >= -5; --i)
    {
        printf("%d / 2 = %d, %d >> 1 = %d\n", i, i / 2, i, i >> 1);
    }
    return 0;
}

输出:

5 / 2 = 2, 5 >> 1 = 2
4 / 2 = 2, 4 >> 1 = 2
3 / 2 = 1, 3 >> 1 = 1
2 / 2 = 1, 2 >> 1 = 1
1 / 2 = 0, 1 >> 1 = 0
0 / 2 = 0, 0 >> 1 = 0
-1 / 2 = 0, -1 >> 1 = -1
-2 / 2 = -1, -2 >> 1 = -1
-3 / 2 = -1, -3 >> 1 = -2
-4 / 2 = -2, -4 >> 1 = -2
-5 / 2 = -2, -5 >> 1 = -3

因此,如果您想帮助编译器,请确保被除数中的变量或表达式显式无符号。

Shift and integer multiply instructions have similar performance on most modern CPUs - integer multiply instructions were relatively slow back in the 1980s but in general this is no longer true. Integer multiply instructions may have higher latency, so there may still be cases where a shift is preferable. Ditto for cases where you can keep more execution units busy (although this can cut both ways).

Integer division is still relatively slow though, so using a shift instead of division by a power of 2 is still a win, and most compilers will implement this as an optimisation. Note however that for this optimisation to be valid the dividend needs to be either unsigned or must be known to be positive. For a negative dividend the shift and divide are not equivalent!

#include <stdio.h>

int main(void)
{
    int i;

    for (i = 5; i >= -5; --i)
    {
        printf("%d / 2 = %d, %d >> 1 = %d\n", i, i / 2, i, i >> 1);
    }
    return 0;
}

Output:

5 / 2 = 2, 5 >> 1 = 2
4 / 2 = 2, 4 >> 1 = 2
3 / 2 = 1, 3 >> 1 = 1
2 / 2 = 1, 2 >> 1 = 1
1 / 2 = 0, 1 >> 1 = 0
0 / 2 = 0, 0 >> 1 = 0
-1 / 2 = 0, -1 >> 1 = -1
-2 / 2 = -1, -2 >> 1 = -1
-3 / 2 = -1, -3 >> 1 = -2
-4 / 2 = -2, -4 >> 1 = -2
-5 / 2 = -2, -5 >> 1 = -3

So if you want to help the compiler then make sure the variable or expression in the dividend is explicitly unsigned.

套路撩心 2024-11-22 16:07:28

它完全取决于目标设备、语言、用途等。

视频卡驱动程序中的像素处理?很有可能,是的!

.NET 业务应用程序适合您的部门吗?绝对没有理由去研究它。

对于移动设备的高性能游戏来说,这可能值得研究,但前提是进行了更简单的优化。

It completely depends on target device, language, purpose, etc.

Pixel crunching in a video card driver? Very likely, yes!

.NET business application for your department? Absolutely no reason to even look into it.

For a high performance game for a mobile device it might be worth looking into, but only after easier optimizations have been performed.

玩套路吗 2024-11-22 16:07:28

除非您绝对需要并且您的代码意图需要移位而不是乘法/除法,否则不要这样做。

在典型的一天中 - 您可能会节省几个机器周期(或宽松,因为编译器更好地知道要优化什么),但成本不值得 - 您将时间花在次要细节上而不是实际工作上,维护代码变得更加困难并且你的同事会咒骂你。

您可能需要为高负载计算执行此操作,其中每个保存的周期意味着运行时间的分钟数。但是,您应该一次优化一个地方,并每次都进行性能测试,看看是否真的让它变得更快或破坏了编译器逻辑。

Don't do unless you absolutely need to and your code intent requires shifting rather than multiplication/division.

In typical day - you could potentialy save few machine cycles (or loose, since compiler knows better what to optimize), but the cost doesn't worth it - you spend time on minor details rather than actual job, maintaining the code becomes harder and your co-workers will curse you.

You might need to do it for high-load computations, where each saved cycle means minutes of runtime. But, you should optimize one place at a time and do performance tests each time to see if you really made it faster or broke compilers logic.

伴随着你 2024-11-22 16:07:28

据我所知,在某些机器上,乘法可能需要多达 16 到 32 个机器周期。所以是的,根据机器类型,位移运算符比乘法/除法更快。

然而,某些机器确实有数学处理器,其中包含乘法/除法的特殊指令。

As far as I know in some machines multiplication can need upto 16 to 32 machine cycle. So Yes, depending on the machine type, bitshift operators are faster than multiplication / division.

However certain machine do have their math processor, which contains special instructions for multiplication/division.

梦途 2024-11-22 16:07:28

在有符号整数和右移与除法的情况下,它可以产生影响。对于负数,移位向负无穷大舍入,而除法向零舍入。当然,编译器会将除法更改为更便宜的东西,但它通常会将其更改为与除法具有相同舍入行为的东西,因为它要么无法证明变量不会为负数,要么根本就不会关心。
因此,如果您可以证明一个数字不会为负数,或者您不关心它的舍入方式,那么您可以以更有可能产生影响的方式进行优化。

In the case of signed integers and right shift vs division, it can make a difference. For negative numbers, the shift rounds rounds towards negative infinity whereas division rounds towards zero. Of course the compiler will change the division to something cheaper, but it will usually change it to something that has the same rounding behavior as division, because it is either unable to prove that the variable won't be negative or it simply doesn't care.
So if you can prove that a number won't be negative or if you don't care which way it will round, you can do that optimization in a way that is more likely to make a difference.

一腔孤↑勇 2024-11-22 16:07:28

我同意德鲁·霍尔的标记答案。不过,答案可能需要一些额外的注释。

对于绝大多数软件开发人员来说,处理器和编译器不再与问题相关。我们大多数人都远远超出了 8088 和 MS-DOS。它可能只与那些仍在开发嵌入式处理器的人相关......

在我的软件公司,数学(加/减/乘/除)应该用于所有数学。
而在数据类型之间转换时应使用 Shift,例如。 ushort 为字节,如 n>>8 而不是 n/256。

I agree with the marked answer by Drew Hall. The answer could use some additional notes though.

For the vast majority of software developers the processor and compiler are no longer relevant to the question. Most of us are far beyond the 8088 and MS-DOS. It is perhaps only relevant for those who are still developing for embedded processors...

At my software company Math (add/sub/mul/div) should be used for all mathematics.
While Shift should be used when converting between data types eg. ushort to byte as n>>8 and not n/256.

许你一世情深 2024-11-22 16:07:28

Python 测试对相同的随机数执行 1 亿次相同的乘法。

>>> from timeit import timeit
>>> setup_str = 'import scipy; from scipy import random; scipy.random.seed(0)'
>>> N = 10*1000*1000
>>> timeit('x=random.randint(65536);', setup=setup_str, number=N)
1.894096851348877 # Time from generating the random #s and no opperati

>>> timeit('x=random.randint(65536); x*2', setup=setup_str, number=N)
2.2799630165100098
>>> timeit('x=random.randint(65536); x << 1', setup=setup_str, number=N)
2.2616429328918457

>>> timeit('x=random.randint(65536); x*10', setup=setup_str, number=N)
2.2799630165100098
>>> timeit('x=random.randint(65536); (x << 3) + (x<<1)', setup=setup_str, number=N)
2.9485139846801758

>>> timeit('x=random.randint(65536); x // 2', setup=setup_str, number=N)
2.490908145904541
>>> timeit('x=random.randint(65536); x / 2', setup=setup_str, number=N)
2.4757170677185059
>>> timeit('x=random.randint(65536); x >> 1', setup=setup_str, number=N)
2.2316000461578369

因此,在 python 中进行移位而不是乘以 2 的幂时,有轻微的改进(除法约为 10%;乘法约为 1%)。如果它不是 2 的幂,则可能会出现相当大的减速。

同样,这些 # 会根据您的处理器、编译器(或解释器——为了简单起见,在 python 中执行)而改变。

与其他人一样,不要过早优化。编写非常可读的代码,分析其速度是否不够快,然后尝试优化缓慢的部分。请记住,您的编译器在优化方面比您要好得多。

Python test performing same multiplication 100 million times against the same random numbers.

>>> from timeit import timeit
>>> setup_str = 'import scipy; from scipy import random; scipy.random.seed(0)'
>>> N = 10*1000*1000
>>> timeit('x=random.randint(65536);', setup=setup_str, number=N)
1.894096851348877 # Time from generating the random #s and no opperati

>>> timeit('x=random.randint(65536); x*2', setup=setup_str, number=N)
2.2799630165100098
>>> timeit('x=random.randint(65536); x << 1', setup=setup_str, number=N)
2.2616429328918457

>>> timeit('x=random.randint(65536); x*10', setup=setup_str, number=N)
2.2799630165100098
>>> timeit('x=random.randint(65536); (x << 3) + (x<<1)', setup=setup_str, number=N)
2.9485139846801758

>>> timeit('x=random.randint(65536); x // 2', setup=setup_str, number=N)
2.490908145904541
>>> timeit('x=random.randint(65536); x / 2', setup=setup_str, number=N)
2.4757170677185059
>>> timeit('x=random.randint(65536); x >> 1', setup=setup_str, number=N)
2.2316000461578369

So in doing a shift rather than multiplication/division by a power of two in python, there's a slight improvement (~10% for division; ~1% for multiplication). If its a non-power of two, there's likely a considerable slowdown.

Again these #s will change depending on your processor, your compiler (or interpreter -- did in python for simplicity).

As with everyone else, don't prematurely optimize. Write very readable code, profile if its not fast enough, and then try to optimize the slow parts. Remember, your compiler is much better at optimization than you are.

梦幻的心爱 2024-11-22 16:07:28

有些优化是编译器无法执行的,因为它们仅适用于减少的输入集。

下面是 C++ 示例代码,可以通过 64 位“倒数乘法”进行更快的除法。分子和分母都必须低于某个阈值。请注意,必须将其编译为使用 64 位指令,才能真正比普通除法更快。

#include <stdio.h>
#include <chrono>

static const unsigned s_bc = 32;
static const unsigned long long s_p = 1ULL << s_bc;
static const unsigned long long s_hp = s_p / 2;

static unsigned long long s_f;
static unsigned long long s_fr;

static void fastDivInitialize(const unsigned d)
{
    s_f = s_p / d;
    s_fr = s_f * (s_p - (s_f * d));
}

static unsigned fastDiv(const unsigned n)
{
    return (s_f * n + ((s_fr * n + s_hp) >> s_bc)) >> s_bc;
}

static bool fastDivCheck(const unsigned n, const unsigned d)
{
    // 32 to 64 cycles latency on modern cpus
    const unsigned expected = n / d;

    // At least 10 cycles latency on modern cpus
    const unsigned result = fastDiv(n);

    if (result != expected)
    {
        printf("Failed for: %u/%u != %u\n", n, d, expected);
        return false;
    }

    return true;
}

int main()
{
    unsigned result = 0;

    // Make sure to verify it works for your expected set of inputs
    const unsigned MAX_N = 65535;
    const unsigned MAX_D = 40000;

    const double ONE_SECOND_COUNT = 1000000000.0;

    auto t0 = std::chrono::steady_clock::now();
    unsigned count = 0;
    printf("Verifying...\n");
    for (unsigned d = 1; d <= MAX_D; ++d)
    {
        fastDivInitialize(d);
        for (unsigned n = 0; n <= MAX_N; ++n)
        {
            count += !fastDivCheck(n, d);
        }
    }
    auto t1 = std::chrono::steady_clock::now();
    printf("Errors: %u / %u (%.4fs)\n", count, MAX_D * (MAX_N + 1), (t1 - t0).count() / ONE_SECOND_COUNT);

    t0 = t1;
    for (unsigned d = 1; d <= MAX_D; ++d)
    {
        fastDivInitialize(d);
        for (unsigned n = 0; n <= MAX_N; ++n)
        {
            result += fastDiv(n);
        }
    }
    t1 = std::chrono::steady_clock::now();
    printf("Fast division time: %.4fs\n", (t1 - t0).count() / ONE_SECOND_COUNT);

    t0 = t1;
    count = 0;
    for (unsigned d = 1; d <= MAX_D; ++d)
    {
        for (unsigned n = 0; n <= MAX_N; ++n)
        {
            result += n / d;
        }
    }
    t1 = std::chrono::steady_clock::now();
    printf("Normal division time: %.4fs\n", (t1 - t0).count() / ONE_SECOND_COUNT);

    getchar();
    return result;
}

There are optimizations the compiler can't do because they only work for a reduced set of inputs.

Below there is c++ sample code that can do a faster division doing a 64bits "Multiplication by the reciprocal". Both numerator and denominator must be below certain threshold. Note that it must be compiled to use 64 bits instructions to be actually faster than normal division.

#include <stdio.h>
#include <chrono>

static const unsigned s_bc = 32;
static const unsigned long long s_p = 1ULL << s_bc;
static const unsigned long long s_hp = s_p / 2;

static unsigned long long s_f;
static unsigned long long s_fr;

static void fastDivInitialize(const unsigned d)
{
    s_f = s_p / d;
    s_fr = s_f * (s_p - (s_f * d));
}

static unsigned fastDiv(const unsigned n)
{
    return (s_f * n + ((s_fr * n + s_hp) >> s_bc)) >> s_bc;
}

static bool fastDivCheck(const unsigned n, const unsigned d)
{
    // 32 to 64 cycles latency on modern cpus
    const unsigned expected = n / d;

    // At least 10 cycles latency on modern cpus
    const unsigned result = fastDiv(n);

    if (result != expected)
    {
        printf("Failed for: %u/%u != %u\n", n, d, expected);
        return false;
    }

    return true;
}

int main()
{
    unsigned result = 0;

    // Make sure to verify it works for your expected set of inputs
    const unsigned MAX_N = 65535;
    const unsigned MAX_D = 40000;

    const double ONE_SECOND_COUNT = 1000000000.0;

    auto t0 = std::chrono::steady_clock::now();
    unsigned count = 0;
    printf("Verifying...\n");
    for (unsigned d = 1; d <= MAX_D; ++d)
    {
        fastDivInitialize(d);
        for (unsigned n = 0; n <= MAX_N; ++n)
        {
            count += !fastDivCheck(n, d);
        }
    }
    auto t1 = std::chrono::steady_clock::now();
    printf("Errors: %u / %u (%.4fs)\n", count, MAX_D * (MAX_N + 1), (t1 - t0).count() / ONE_SECOND_COUNT);

    t0 = t1;
    for (unsigned d = 1; d <= MAX_D; ++d)
    {
        fastDivInitialize(d);
        for (unsigned n = 0; n <= MAX_N; ++n)
        {
            result += fastDiv(n);
        }
    }
    t1 = std::chrono::steady_clock::now();
    printf("Fast division time: %.4fs\n", (t1 - t0).count() / ONE_SECOND_COUNT);

    t0 = t1;
    count = 0;
    for (unsigned d = 1; d <= MAX_D; ++d)
    {
        for (unsigned n = 0; n <= MAX_N; ++n)
        {
            result += n / d;
        }
    }
    t1 = std::chrono::steady_clock::now();
    printf("Normal division time: %.4fs\n", (t1 - t0).count() / ONE_SECOND_COUNT);

    getchar();
    return result;
}
灼疼热情 2024-11-22 16:07:28

我认为在一种情况下,您想要乘以或除以 2 的幂,使用位移运算符不会出错,即使编译器将它们转换为 MUL/DIV,因为某些处理器微代码(实际上,宏)无论如何,所以对于这些情况,您将实现改进,特别是如果移位大于 1。或者更明确地说,如果 CPU 没有位移运算符,那么无论如何它都将是 MUL/DIV,但如果 CPU 有位移位运算符,您可以避免微代码分支,这少了一些指令。

我现在正在编写一些代码,需要大量的加倍/减半操作,因为它正在处理密集的二叉树,并且我怀疑还有一个操作可能比加法更优化 - 左(两个乘法的幂) ) 通过加法进行移位。如果移位比要添加的位数更宽,则可以用左移和异或来替换,简单的示例是 (i<<1)^1,它将 1 添加到双倍值。这当然不适用于右移(二除法的幂),因为只有左移(小端)才能用零填充间隙。

在我的代码中,这些乘/除二和二次幂运算被非常频繁地使用,并且由于公式已经很短,因此可以消除的每条指令都可以带来巨大的收益。如果处理器不支持这些位移运算符,则不会产生增益,但也不会产生损失。

此外,在我正在编写的算法中,它们直观地表示了发生的运动,因此从这个意义上说,它们实际上更清晰。二叉树的左侧较大,右侧较小。除此之外,在我的代码中,奇数和偶数具有特殊的意义,树中的所有左手孩子都是奇数,所有右手孩子和根都是偶数。在某些情况下,我还没有遇到过,但是可能,哦,实际上,我什至没有想到这一点,与x%2相比,x&1可能是一个更优化的操作。 x&1 在偶数上将产生 0,但在奇数上将产生 1。

比奇数/偶数识别更进一步,如果 x&3 的结果为零,我就知道 4 是我们数字的因数,对于 8 的 x%7 也是如此,依此类推。我知道这些情况的实用性可能有限,但很高兴知道您可以避免模数运算并改用按位逻辑运算,因为按位运算几乎总是最快的,并且最不可能对编译器产生歧义。

我几乎发明了密集二叉树领域,所以我预计人们可能无法理解这个评论的价值,因为很少有人只想只对二的幂进行因式分解,或者只对二的幂进行乘法/除法。

I think in the one case that you want to multiply or divide by a power of two, you can't go wrong with using bitshift operators, even if the compiler converts them to a MUL/DIV, because some processors microcode (really, a macro) them anyway, so for those cases you will achieve an improvement, especially if the shift is more than 1. Or more explicitly, if the CPU has no bitshift operators, it will be a MUL/DIV anyway, but if the CPU has bitshift operators, you avoid a microcode branch and this is a few instructions less.

I am writing some code right now that requires a lot of doubling/halving operations because it is working on a dense binary tree, and there is one more operation that I suspect might be more optimal than an addition - a left (power of two multiply) shift with an addition. This can be replaced with a left shift and an xor if the shift is wider than the number of bits you want to add, easy example is (i<<1)^1, which adds one to a doubled value. This does not of course apply to a right shift (power of two divide) because only a left (little endian) shift fills the gap with zeros.

In my code, these multiply/divide by two and powers of two operations are very intensively used and because the formulae are quite short already, each instruction that can be eliminated can be a substantial gain. If the processor does not support these bitshift operators, no gain will happen but neither will there be a loss.

Also, in the algorithms I am writing, they visually represent the movements that occur so in that sense they are in fact more clear. The left hand side of a binary tree is bigger, and the right is smaller. As well as that, in my code, odd and even numbers have a special significance, and all left-hand children in the tree are odd and all right hand children, and the root, are even. In some cases, which I haven't encountered yet, but may, oh, actually, I didn't even think of this, x&1 may be a more optimal operation compared to x%2. x&1 on an even number will produce zero, but will produce 1 for an odd number.

Going a bit further than just odd/even identification, if I get zero for x&3 I know that 4 is a factor of our number, and same for x%7 for 8, and so on. I know that these cases have probably got limited utility but it's nice to know that you can avoid a modulus operation and use a bitwise logic operation instead, because bitwise operations are almost always the fastest, and least likely to be ambiguous to the compiler.

I am pretty much inventing the field of dense binary trees so I expect that people may not grasp the value of this comment, as very rarely do people want to only perform factorisations on only powers of two, or only multiply/divide powers of two.

月寒剑心 2024-11-22 16:07:28

它是否实际上更快取决于实际使用的硬件和编译器。

Whether it is actually faster depends on the hardware and compiler actually used.

染火枫林 2024-11-22 16:07:28

如果您在 gcc 编译器上比较 x+x 、 x*2 和 x<<1 语法的输出,那么您将在 x86 程序集中得到相同的结果: https://godbolt.org/z/JLpp0j

        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     eax, DWORD PTR [rbp-4]
        add     eax, eax
        pop     rbp
        ret

所以你可以认为 gcc 足够聪明,足以确定自己的最佳状态解决方案独立于您输入的内容。

If you compare output for x+x , x*2 and x<<1 syntax on a gcc compiler, then you would get the same result in x86 assembly : https://godbolt.org/z/JLpp0j

        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     eax, DWORD PTR [rbp-4]
        add     eax, eax
        pop     rbp
        ret

So you can consider gcc as smart enought to determine his own best solution independently from what you typed.

听闻余生 2024-11-22 16:07:28

我也想看看我是否能击败众议院。这是任意数乘任意数的更通用的按位运算。我制作的宏比普通 * 乘法慢大约 25% 到两倍。正如其他人所说,如果它接近 2 的倍数或由几个 2 的倍数组成,您可能会获胜。像由 (X<<4)+(X<<2)+(X<<1)+X 组成的 X*23 会比由 (X<<6) 组成的 X*65 慢)+X。

#include <stdio.h>
#include <time.h>

#define MULTIPLYINTBYMINUS(X,Y) (-((X >> 30) & 1)&(Y<<30))+(-((X >> 29) & 1)&(Y<<29))+(-((X >> 28) & 1)&(Y<<28))+(-((X >> 27) & 1)&(Y<<27))+(-((X >> 26) & 1)&(Y<<26))+(-((X >> 25) & 1)&(Y<<25))+(-((X >> 24) & 1)&(Y<<24))+(-((X >> 23) & 1)&(Y<<23))+(-((X >> 22) & 1)&(Y<<22))+(-((X >> 21) & 1)&(Y<<21))+(-((X >> 20) & 1)&(Y<<20))+(-((X >> 19) & 1)&(Y<<19))+(-((X >> 18) & 1)&(Y<<18))+(-((X >> 17) & 1)&(Y<<17))+(-((X >> 16) & 1)&(Y<<16))+(-((X >> 15) & 1)&(Y<<15))+(-((X >> 14) & 1)&(Y<<14))+(-((X >> 13) & 1)&(Y<<13))+(-((X >> 12) & 1)&(Y<<12))+(-((X >> 11) & 1)&(Y<<11))+(-((X >> 10) & 1)&(Y<<10))+(-((X >> 9) & 1)&(Y<<9))+(-((X >> 8) & 1)&(Y<<8))+(-((X >> 7) & 1)&(Y<<7))+(-((X >> 6) & 1)&(Y<<6))+(-((X >> 5) & 1)&(Y<<5))+(-((X >> 4) & 1)&(Y<<4))+(-((X >> 3) & 1)&(Y<<3))+(-((X >> 2) & 1)&(Y<<2))+(-((X >> 1) & 1)&(Y<<1))+(-((X >> 0) & 1)&(Y<<0))
#define MULTIPLYINTBYSHIFT(X,Y) (((((X >> 30) & 1)<<31)>>31)&(Y<<30))+(((((X >> 29) & 1)<<31)>>31)&(Y<<29))+(((((X >> 28) & 1)<<31)>>31)&(Y<<28))+(((((X >> 27) & 1)<<31)>>31)&(Y<<27))+(((((X >> 26) & 1)<<31)>>31)&(Y<<26))+(((((X >> 25) & 1)<<31)>>31)&(Y<<25))+(((((X >> 24) & 1)<<31)>>31)&(Y<<24))+(((((X >> 23) & 1)<<31)>>31)&(Y<<23))+(((((X >> 22) & 1)<<31)>>31)&(Y<<22))+(((((X >> 21) & 1)<<31)>>31)&(Y<<21))+(((((X >> 20) & 1)<<31)>>31)&(Y<<20))+(((((X >> 19) & 1)<<31)>>31)&(Y<<19))+(((((X >> 18) & 1)<<31)>>31)&(Y<<18))+(((((X >> 17) & 1)<<31)>>31)&(Y<<17))+(((((X >> 16) & 1)<<31)>>31)&(Y<<16))+(((((X >> 15) & 1)<<31)>>31)&(Y<<15))+(((((X >> 14) & 1)<<31)>>31)&(Y<<14))+(((((X >> 13) & 1)<<31)>>31)&(Y<<13))+(((((X >> 12) & 1)<<31)>>31)&(Y<<12))+(((((X >> 11) & 1)<<31)>>31)&(Y<<11))+(((((X >> 10) & 1)<<31)>>31)&(Y<<10))+(((((X >> 9) & 1)<<31)>>31)&(Y<<9))+(((((X >> 8) & 1)<<31)>>31)&(Y<<8))+(((((X >> 7) & 1)<<31)>>31)&(Y<<7))+(((((X >> 6) & 1)<<31)>>31)&(Y<<6))+(((((X >> 5) & 1)<<31)>>31)&(Y<<5))+(((((X >> 4) & 1)<<31)>>31)&(Y<<4))+(((((X >> 3) & 1)<<31)>>31)&(Y<<3))+(((((X >> 2) & 1)<<31)>>31)&(Y<<2))+(((((X >> 1) & 1)<<31)>>31)&(Y<<1))+(((((X >> 0) & 1)<<31)>>31)&(Y<<0))
int main()
{
    int randomnumber=23;
    int randomnumber2=23;
    int checknum=23;
    clock_t start, diff;
    srand(time(0));
    start = clock();
    for(int i=0;i<1000000;i++)
    {
        randomnumber = rand() % 10000;
        randomnumber2 = rand() % 10000;
        checknum=MULTIPLYINTBYMINUS(randomnumber,randomnumber2);
        if (checknum!=randomnumber*randomnumber2)
        {
            printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
        }
    }
    diff = clock() - start;
    int msec = diff * 1000 / CLOCKS_PER_SEC;
    printf("MULTIPLYINTBYMINUS Time %d milliseconds", msec);
    start = clock();
    for(int i=0;i<1000000;i++)
    {
        randomnumber = rand() % 10000;
        randomnumber2 = rand() % 10000;
        checknum=MULTIPLYINTBYSHIFT(randomnumber,randomnumber2);
        if (checknum!=randomnumber*randomnumber2)
        {
            printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
        }
    }
    diff = clock() - start;
    msec = diff * 1000 / CLOCKS_PER_SEC;
    printf("MULTIPLYINTBYSHIFT Time %d milliseconds", msec);
    start = clock();
    for(int i=0;i<1000000;i++)
    {
        randomnumber = rand() % 10000;
        randomnumber2 = rand() % 10000;
        checknum= randomnumber*randomnumber2;
        if (checknum!=randomnumber*randomnumber2)
        {
            printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
        }
    }
    diff = clock() - start;
    msec = diff * 1000 / CLOCKS_PER_SEC;
    printf("normal * Time %d milliseconds", msec);
    return 0;
}

I too wanted to see if I could Beat the House. this is a more general bitwise for any-number by any number multiplication. the macros I made are about 25% more to twice as slower than normal * multiplication. as said by others if it's close to a multiple of 2 or made up of few multiples of 2 you might win. like X*23 made up of (X<<4)+(X<<2)+(X<<1)+X is going to be slower then X*65 made up of (X<<6)+X.

#include <stdio.h>
#include <time.h>

#define MULTIPLYINTBYMINUS(X,Y) (-((X >> 30) & 1)&(Y<<30))+(-((X >> 29) & 1)&(Y<<29))+(-((X >> 28) & 1)&(Y<<28))+(-((X >> 27) & 1)&(Y<<27))+(-((X >> 26) & 1)&(Y<<26))+(-((X >> 25) & 1)&(Y<<25))+(-((X >> 24) & 1)&(Y<<24))+(-((X >> 23) & 1)&(Y<<23))+(-((X >> 22) & 1)&(Y<<22))+(-((X >> 21) & 1)&(Y<<21))+(-((X >> 20) & 1)&(Y<<20))+(-((X >> 19) & 1)&(Y<<19))+(-((X >> 18) & 1)&(Y<<18))+(-((X >> 17) & 1)&(Y<<17))+(-((X >> 16) & 1)&(Y<<16))+(-((X >> 15) & 1)&(Y<<15))+(-((X >> 14) & 1)&(Y<<14))+(-((X >> 13) & 1)&(Y<<13))+(-((X >> 12) & 1)&(Y<<12))+(-((X >> 11) & 1)&(Y<<11))+(-((X >> 10) & 1)&(Y<<10))+(-((X >> 9) & 1)&(Y<<9))+(-((X >> 8) & 1)&(Y<<8))+(-((X >> 7) & 1)&(Y<<7))+(-((X >> 6) & 1)&(Y<<6))+(-((X >> 5) & 1)&(Y<<5))+(-((X >> 4) & 1)&(Y<<4))+(-((X >> 3) & 1)&(Y<<3))+(-((X >> 2) & 1)&(Y<<2))+(-((X >> 1) & 1)&(Y<<1))+(-((X >> 0) & 1)&(Y<<0))
#define MULTIPLYINTBYSHIFT(X,Y) (((((X >> 30) & 1)<<31)>>31)&(Y<<30))+(((((X >> 29) & 1)<<31)>>31)&(Y<<29))+(((((X >> 28) & 1)<<31)>>31)&(Y<<28))+(((((X >> 27) & 1)<<31)>>31)&(Y<<27))+(((((X >> 26) & 1)<<31)>>31)&(Y<<26))+(((((X >> 25) & 1)<<31)>>31)&(Y<<25))+(((((X >> 24) & 1)<<31)>>31)&(Y<<24))+(((((X >> 23) & 1)<<31)>>31)&(Y<<23))+(((((X >> 22) & 1)<<31)>>31)&(Y<<22))+(((((X >> 21) & 1)<<31)>>31)&(Y<<21))+(((((X >> 20) & 1)<<31)>>31)&(Y<<20))+(((((X >> 19) & 1)<<31)>>31)&(Y<<19))+(((((X >> 18) & 1)<<31)>>31)&(Y<<18))+(((((X >> 17) & 1)<<31)>>31)&(Y<<17))+(((((X >> 16) & 1)<<31)>>31)&(Y<<16))+(((((X >> 15) & 1)<<31)>>31)&(Y<<15))+(((((X >> 14) & 1)<<31)>>31)&(Y<<14))+(((((X >> 13) & 1)<<31)>>31)&(Y<<13))+(((((X >> 12) & 1)<<31)>>31)&(Y<<12))+(((((X >> 11) & 1)<<31)>>31)&(Y<<11))+(((((X >> 10) & 1)<<31)>>31)&(Y<<10))+(((((X >> 9) & 1)<<31)>>31)&(Y<<9))+(((((X >> 8) & 1)<<31)>>31)&(Y<<8))+(((((X >> 7) & 1)<<31)>>31)&(Y<<7))+(((((X >> 6) & 1)<<31)>>31)&(Y<<6))+(((((X >> 5) & 1)<<31)>>31)&(Y<<5))+(((((X >> 4) & 1)<<31)>>31)&(Y<<4))+(((((X >> 3) & 1)<<31)>>31)&(Y<<3))+(((((X >> 2) & 1)<<31)>>31)&(Y<<2))+(((((X >> 1) & 1)<<31)>>31)&(Y<<1))+(((((X >> 0) & 1)<<31)>>31)&(Y<<0))
int main()
{
    int randomnumber=23;
    int randomnumber2=23;
    int checknum=23;
    clock_t start, diff;
    srand(time(0));
    start = clock();
    for(int i=0;i<1000000;i++)
    {
        randomnumber = rand() % 10000;
        randomnumber2 = rand() % 10000;
        checknum=MULTIPLYINTBYMINUS(randomnumber,randomnumber2);
        if (checknum!=randomnumber*randomnumber2)
        {
            printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
        }
    }
    diff = clock() - start;
    int msec = diff * 1000 / CLOCKS_PER_SEC;
    printf("MULTIPLYINTBYMINUS Time %d milliseconds", msec);
    start = clock();
    for(int i=0;i<1000000;i++)
    {
        randomnumber = rand() % 10000;
        randomnumber2 = rand() % 10000;
        checknum=MULTIPLYINTBYSHIFT(randomnumber,randomnumber2);
        if (checknum!=randomnumber*randomnumber2)
        {
            printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
        }
    }
    diff = clock() - start;
    msec = diff * 1000 / CLOCKS_PER_SEC;
    printf("MULTIPLYINTBYSHIFT Time %d milliseconds", msec);
    start = clock();
    for(int i=0;i<1000000;i++)
    {
        randomnumber = rand() % 10000;
        randomnumber2 = rand() % 10000;
        checknum= randomnumber*randomnumber2;
        if (checknum!=randomnumber*randomnumber2)
        {
            printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
        }
    }
    diff = clock() - start;
    msec = diff * 1000 / CLOCKS_PER_SEC;
    printf("normal * Time %d milliseconds", msec);
    return 0;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文