AND 比整数模运算更快?
可以将:
- i % m
重新表达为:
- i & (m-1)
其中,
- i 是无符号整数
- m 是 2 的幂
我的问题是:AND 运算更快吗?现代 CPU 不支持单指令硬件中的整数模吗?我对 ARM 感兴趣,但在其指令集中没有看到模运算。
It is possible to re-express:
- i % m
as:
- i & (m-1)
where,
- i is an unsigned integer
- m is a power of 2
My question is: is the AND operation any faster? Don't modern CPUs support integer modulo in hardware in a single instruction? I'm interested in ARM, but don't see the modulo operation in its instruction set.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
如今它比“单一指令”更复杂。现代 CPU 是复杂的野兽,需要将指令分解为发出/执行/延迟。它通常还取决于除法/模数的宽度 - 涉及多少位。
无论如何,我不知道 32 位除法在任何内核(无论是否为 ARM)上都会出现单周期延迟。在“现代”ARM 上有整数除法指令,但仅在某些实现上,尤其是在最常见的实现上 - Cortex A8 和 A9 上没有。
在某些情况下,编译器可以省去将除法/模数转换为位移位/掩码运算的麻烦。但是,只有当该值在编译时已知时,这才有可能。在您的情况下,如果编译器可以肯定地看到“m”始终是二的幂,那么它会将其优化为位操作,但如果它是传递给函数的变量(或否则计算),那么它不能,并且将诉诸完全除法/模数。这种代码构造通常有效(但并非总是有效 - 取决于优化器的智能程度):
技巧是让编译器知道“page_size”是 2 的幂。我知道 gcc 和变体会对此进行特殊处理,但我不确定其他编译器。
作为任何核心(无论是否为 ARM)(甚至 x86)的经验法则,优先选择移位/掩码而不是除法/取模, 特别是对于不是编译时常量的任何内容。即使您的核心有硬件鸿沟,手动执行也会更快。
(此外,有符号除法必须向 0 截断,并且 div / 余数必须能够产生负数,因此即使
x % 4
也比x & 3
更昂贵对于有符号的int x
。)It's more complicated than "single instruction" these days. Modern CPUs are complex beasts and need their instructions broken down into issue/execute/latency. It also usually depends on the width of the divide/modulo - how many bits are involved.
In any case, I'm not aware of 32 bit division being single cycle latency on any core, ARM or not. On "modern" ARM there are integer divide instructions, but only on some implementations, and most notably not on the most common ones - Cortex A8 and A9.
In some cases, the compiler can save you the trouble of converting a divide/modulo into bit shift/mask operations. However, this is only possible if the value is known at compile time. In your case, if the compiler can see for sure that 'm' is always a power a two, then it'll optimize it to bit ops, but if it's a variable passed into a function (or otherwise computed), then it can't, and will resort to a full divide/modulo. This kind of code construction often works (but not always - depends how smart your optimizer is):
The trick is to let the compiler know that the "page_size" is a power of two. I know that gcc and variants will special-case this, but I'm not sure about other compilers.
As a rule of thumb for any core - ARM or not (even x86), prefer bit shift/mask to divide/modulo, especially for anything that isn't a compile-time constant. Even if your core has hardware divide, it'll be faster to do it manually.
(Also, signed division has to truncate towards 0, and div / remainder have be able to produce negative numbers, so even
x % 4
is more expensive thanx & 3
for signedint x
.)您可能对 Embedded Live:ARM Cortex-M 架构嵌入式程序员指南Cortex-M 架构。
ARM Cortex-M 系列具有无符号和有符号除法指令 UDIV 和 SDIV,需要 2 到 12 个周期。没有MOD指令,但通过{S,U}DIV后跟乘减指令MLS获得等效结果,需要2个周期,总共4-14个周期。
AND 指令是单周期的,因此速度快 4-14 倍。
You may be interested in Embedded Live: Embedded Programmers' Guide to ARM’s Cortex-M Architecture.
The ARM Cortex-M family has unsigned and singed division instructions, UDIV and SDIV, which take 2 to 12 cycles. There is no MOD instruction, but equivalent result is obtained by a {S,U}DIV followed by the multiply-and-subtract instruction MLS, which takes 2 cycles, for a total of 4-14 cycles.
The AND instruction is single cycle, therefore 4-14x faster.
ARM 非常通用。有很多不同的 ARM,并且有些 ARM 没有除法指令(正如 Ray Toal 已经提到的,模数通常作为除法实现的附加结果来实现)。因此,如果您不想调用非常慢的除法子例程,则逻辑运算要快得多(正如 cyco130 提到的,任何好的编译器都会自行识别它并自行生成逻辑运算 - 因此为了程序代码的清晰性我会留在除法(除非你编写汇编程序,那么你当然必须自己编写它,然后你应该进行逻辑运算)。
ARM is very generic. There are lot of different ARMs and there are ARMs which do NOT have a division instruction (as Ray Toal already mentioned, modulo is usually implemented as additional result of the division implementation). So if you dont want to call a very slow division subroutine, the logical operation is much faster (and as cyco130 mentioned, any good compiler would recognize it on its own and generate the logical operation on its own - so for clearness of the program code I would stay with the division (except you program assembler, then you have of course to program it yourself, and then you should take the logical operation).
如果 m 在编译时已知(或者甚至不知道),则可以使用神奇的“乘法逆元”乘法来重新表达整数除法和模数。除法的结果在高 32 位,余数(模数)在低 32 位:
http ://www.hackersdelight.org/magic.htm
以下链接声称它是标准编译器强度降低:
http://www.flounder.com/multiplicative_inverse.htm
If m is known at compile time (or even it it isn't) integer division and modulo can be re-expressed using multiplication by a magic "multiplicative inverse." The result of the division ends up in the high 32 bits and the remainder (modulus) in the lower 32 bits:
http://www.hackersdelight.org/magic.htm
The following link claims that it is a standard compiler strength reduction:
http://www.flounder.com/multiplicative_inverse.htm
如果您使用的是启用了优化的不错的 C 编译器,它已经将其优化为更快的速度,这种技术称为“强度缩减”。如果您正在进行手写汇编,唯一可靠的测试方法就是对其进行基准测试。但请注意,即使同一处理器的不同型号也可能会产生不同的结果。
If you are using a decent C compiler with optimizations enabled, it will already optimize this to whatever is faster, a technique called "strength reduction". If you're doing hand-written assembly, only sure way to test is to benchmark it. But beware, even different models of the same processor could give different results.
根据http://www.coranac.com/tonc/text/asm.htm、ARM没有除法指令。如果这是真的,那么我也不希望它有 MOD 指令。
According to http://www.coranac.com/tonc/text/asm.htm, the ARM has no division instruction. If that's true, then I wouldn't expect it to have a
MOD
instruction, either.