(n - 乘法) 与 (n/2 - 乘法 + 2 加法) 哪个更好？

发布于 2024-11-29 17:40:50 字数 103 浏览 3 评论 0原文

我有一个 C 程序，有 n 次乘法（单次乘法和 n 次迭代），我发现另一个逻辑有 n/2 次迭代（1 次乘法 + 2 次加法）。我知道两者的复杂度都是 O(n)。但就CPU周期而言。哪个更快？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

客…行舟 2024-12-06 17:40:50

在您的计算机上测试。或者，查看您的处理器的规格并猜测。

旧逻辑不再适用：在现代处理器上，整数乘法可能非常便宜，在某些较新的 Intel 处理器上，它需要 3 个时钟周期。在这些相同的处理器上，添加是 1 个周期。然而，在现代流水线处理器中，数据依赖性造成的停顿可能会导致添加花费更长的时间。

我的猜测是，如果您正在进行折叠类型操作，则 N 加法 + N/2 乘法比 N 乘法慢，并且我猜测对于映射类型操作则相反。但这只是一个猜测。

测试你是否想要真相。

但是：大多数如此简单的算法都受内存限制，并且两者的速度相同。

回复收藏 0 原文

我的影子我的梦 2024-12-06 17:40:50

首先遵循 Dietrich Epp 的第一个建议 - 测量是（至少对于复杂的优化问题）唯一确定的方法。

现在，如果您想弄清楚为什么一个比另一个更快，我们可以尝试一下。有两种不同的重要性能指标：延迟和倒数吞吐量。两者的简短总结：

延迟：这是指令在一个指令中产生的延迟。
依赖链。这些数字是最小值。缓存未命中，
未对准和异常可能会增加时钟计数
相当。在启用超线程的情况下，使用相同的
其他线程中的执行单元会导致性能较差。
非正规数、NAN 和无穷大不会增加延迟。这
使用的时间单位是核心时钟周期，而不是参考时钟周期
由时间戳计数器给出。
倒数吞吐量：每个核心时钟周期的平均数
一系列同类独立指令的指令
在同一个线程中。

对于桑迪桥，rec。 add r, r/i（进一步说明 r=register, i=immediate, m=memory）的吞吐量为 0.33，而延迟为 1。

imul r, r code> 的延迟为 3，rec.吞吐量为 1。

因此，正如您所见，它完全取决于您的特定算法 - 如果您可以用两个独立的添加替换一个 imul，则您算法的这一特定部分可以获得 50% 的理论加速（在最好的情况下，显然加速约 350%）。但另一方面，如果您的添加添加了有问题的依赖项，则 imul 可能与添加一样快。

另请注意，我们忽略了所有额外的复杂性，例如内存和缓存行为（通常会对执行时间产生很大影响的东西）或复杂的东西，例如 µop 融合等。一般来说，唯一应该关心这些东西的人是编译器编写者 - 衡量他们的努力结果要简单得多；）

无论如何，如果您想要这些东西的良好列表，请参阅此处（上述延迟/记录吞吐量的描述也来自该特定文档）。

First of all follow Dietrich Epp's first advice - measuring is (at least for complex optimization problems) the only way to be sure.

Now if you want to figure out why one is faster than the other, we can try. There are two different important performance measures: Latency and reciprocal throughput. A short summary of the two:

Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN’s and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
Reciprocal throughput: The average number of core clock cycles per
instruction for a series of independent instructions of the same kind
in the same thread.

For Sandy bridge the rec. throughput for an add r, r/i (for further notice r=register, i=immediate, m=memory) is 0.33 while the latency is 1.

An imul r, r has a latency of 3 and a rec. throughput of 1.

So as you see it completely depends on your specific algorithm - if you can just replace one imul with two independent adds this particular part of your algorithm could get a theoretical speedup of 50% (and in the best case obviously a speedup of ~350%). But on the other hand if your adds add a problematic dependency one imul could be just as fast as one add.

Also note that we've ignored all the additional complications like memory and cache behavior (things which will generally have a much, MUCH larger influence on the execution time) or intricate stuff like µop fusion and whatnot. In general the only people that should care about this stuff are compiler writers - it's much simpler to just measure the result of their efforts ;)

Anyways if you want a good listing of this stuff see this here (the above description of latency/rec. throughput is also from that particular document).

回复收藏 0 原文

~没有更多了~