AND 比整数模运算更快？

发布于 2024-12-08 18:52:28 字数 224 浏览 1 评论 0原文

可以将：

i % m

重新表达为：

i & (m-1)

其中，

i 是无符号整数
m 是 2 的幂

我的问题是：AND 运算更快吗？现代 CPU 不支持单指令硬件中的整数模吗？我对 ARM 感兴趣，但在其指令集中没有看到模运算。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦初启 2024-12-15 18:52:28

如今它比“单一指令”更复杂。现代 CPU 是复杂的野兽，需要将指令分解为发出/执行/延迟。它通常还取决于除法/模数的宽度 - 涉及多少位。

无论如何，我不知道 32 位除法在任何内核（无论是否为 ARM）上都会出现单周期延迟。在“现代”ARM 上有整数除法指令，但仅在某些实现上，尤其是在最常见的实现上 - Cortex A8 和 A9 上没有。

在某些情况下，编译器可以省去将除法/模数转换为位移位/掩码运算的麻烦。但是，只有当该值在编译时已知时，这才有可能。在您的情况下，如果编译器可以肯定地看到“m”始终是二的幂，那么它会将其优化为位操作，但如果它是传递给函数的变量（或否则计算），那么它不能，并且将诉诸完全除法/模数。这种代码构造通常有效（但并非总是有效 - 取决于优化器的智能程度）：

unsigned page_size_bits = 12;     // optimization works even without const here

unsigned foo(unsigned address) {
  unsigned page_size = 1U << page_size_bits;
  return address / page_size;
}

技巧是让编译器知道“page_size”是 2 的幂。我知道 gcc 和变体会对此进行特殊处理，但我不确定其他编译器。

作为任何核心（无论是否为 ARM）（甚至 x86）的经验法则，优先选择移位/掩码而不是除法/取模， 特别是对于不是编译时常量的任何内容。即使您的核心有硬件鸿沟，手动执行也会更快。

（此外，有符号除法必须向 0 截断，并且 div / 余数必须能够产生负数，因此即使 x % 4 也比 x & 3 更昂贵对于有符号的int x。）

It's more complicated than "single instruction" these days. Modern CPUs are complex beasts and need their instructions broken down into issue/execute/latency. It also usually depends on the width of the divide/modulo - how many bits are involved.

In any case, I'm not aware of 32 bit division being single cycle latency on any core, ARM or not. On "modern" ARM there are integer divide instructions, but only on some implementations, and most notably not on the most common ones - Cortex A8 and A9.

In some cases, the compiler can save you the trouble of converting a divide/modulo into bit shift/mask operations. However, this is only possible if the value is known at compile time. In your case, if the compiler can see for sure that 'm' is always a power a two, then it'll optimize it to bit ops, but if it's a variable passed into a function (or otherwise computed), then it can't, and will resort to a full divide/modulo. This kind of code construction often works (but not always - depends how smart your optimizer is):

unsigned page_size_bits = 12;     // optimization works even without const here

unsigned foo(unsigned address) {
  unsigned page_size = 1U << page_size_bits;
  return address / page_size;
}

The trick is to let the compiler know that the "page_size" is a power of two. I know that gcc and variants will special-case this, but I'm not sure about other compilers.

As a rule of thumb for any core - ARM or not (even x86), prefer bit shift/mask to divide/modulo, especially for anything that isn't a compile-time constant. Even if your core has hardware divide, it'll be faster to do it manually.

(Also, signed division has to truncate towards 0, and div / remainder have be able to produce negative numbers, so even x % 4 is more expensive than x & 3 for signed int x.)

回复收藏 0 原文

冧九 2024-12-15 18:52:28

您可能对 Embedded Live：ARM Cortex-M 架构嵌入式程序员指南Cortex-M 架构。

ARM Cortex-M 系列具有无符号和有符号除法指令 UDIV 和 SDIV，需要 2 到 12 个周期。没有MOD指令，但通过{S,U}DIV后跟乘减指令MLS获得等效结果，需要2个周期，总共4-14个周期。

AND 指令是单周期的，因此速度快 4-14 倍。

回复收藏 0 原文

瑕疵 2024-12-15 18:52:28

ARM 非常通用。有很多不同的 ARM，并且有些 ARM 没有除法指令（正如 Ray Toal 已经提到的，模数通常作为除法实现的附加结果来实现）。因此，如果您不想调用非常慢的除法子例程，则逻辑运算要快得多（正如 cyco130 提到的，任何好的编译器都会自行识别它并自行生成逻辑运算 - 因此为了程序代码的清晰性我会留在除法（除非你编写汇编程序，那么你当然必须自己编写它，然后你应该进行逻辑运算）。

回复收藏 0 原文