(n - 乘法) 与 (n/2 - 乘法 + 2 加法) 哪个更好?
我有一个 C 程序,有 n 次乘法(单次乘法和 n 次迭代),我发现另一个逻辑有 n/2 次迭代(1 次乘法 + 2 次加法)。我知道两者的复杂度都是 O(n)。但就CPU周期而言。哪个更快?
I have a C program that has n multiplications (single multiplication with n iterations) and I found another logic that has n/2 iterations of (1 multiplication + 2 additions). I know about the complexity that both are of O(n). but in terms of CPU cycles. which is faster ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在您的计算机上测试。或者,查看您的处理器的规格并猜测。
旧逻辑不再适用:在现代处理器上,整数乘法可能非常便宜,在某些较新的 Intel 处理器上,它需要 3 个时钟周期。在这些相同的处理器上,添加是 1 个周期。然而,在现代流水线处理器中,数据依赖性造成的停顿可能会导致添加花费更长的时间。
我的猜测是,如果您正在进行折叠类型操作,则 N 加法 + N/2 乘法比 N 乘法慢,并且我猜测对于映射类型操作则相反。但这只是一个猜测。
测试你是否想要真相。
但是:大多数如此简单的算法都受内存限制,并且两者的速度相同。
Test on your computer. Or, look at the specs for your processor and guess.
The old logic no longer applies: on modern processors, an integer multiplication might be very cheap, on some newish Intel processors it's 3 clock cycles. Additions are 1 cycle on these same processors. However, in a modern pipelined processor, the stalls created by data dependencies might cause additions to take longer.
My guess is that N additions + N/2 multiplications is slower than N multiplications if you are doing a fold type operation, and I would guess the reverse for a map type operation. But this is only a guess.
Test if you want the truth.
However: Most algorithms this simple are memory-bound, and both will be the same speed.
首先遵循 Dietrich Epp 的第一个建议 - 测量是(至少对于复杂的优化问题)唯一确定的方法。
现在,如果您想弄清楚为什么一个比另一个更快,我们可以尝试一下。有两种不同的重要性能指标:延迟和倒数吞吐量。两者的简短总结:
对于桑迪桥,rec。
add r, r/i
(进一步说明 r=register, i=immediate, m=memory)的吞吐量为 0.33,而延迟为 1。imul r, r
code> 的延迟为 3,rec.吞吐量为 1。因此,正如您所见,它完全取决于您的特定算法 - 如果您可以用两个独立的添加替换一个 imul,则您算法的这一特定部分可以获得 50% 的理论加速(在最好的情况下,显然加速约 350%)。但另一方面,如果您的添加添加了有问题的依赖项,则 imul 可能与添加一样快。
另请注意,我们忽略了所有额外的复杂性,例如内存和缓存行为(通常会对执行时间产生很大影响的东西)或复杂的东西,例如 µop 融合等。一般来说,唯一应该关心这些东西的人是编译器编写者 - 衡量他们的努力结果要简单得多;)
无论如何,如果您想要这些东西的良好列表,请参阅 此处(上述延迟/记录吞吐量的描述也来自该特定文档)。
First of all follow Dietrich Epp's first advice - measuring is (at least for complex optimization problems) the only way to be sure.
Now if you want to figure out why one is faster than the other, we can try. There are two different important performance measures: Latency and reciprocal throughput. A short summary of the two:
For Sandy bridge the rec. throughput for an
add r, r/i
(for further notice r=register, i=immediate, m=memory) is 0.33 while the latency is 1.An
imul r, r
has a latency of 3 and a rec. throughput of 1.So as you see it completely depends on your specific algorithm - if you can just replace one imul with two independent adds this particular part of your algorithm could get a theoretical speedup of 50% (and in the best case obviously a speedup of ~350%). But on the other hand if your adds add a problematic dependency one imul could be just as fast as one add.
Also note that we've ignored all the additional complications like memory and cache behavior (things which will generally have a much, MUCH larger influence on the execution time) or intricate stuff like µop fusion and whatnot. In general the only people that should care about this stuff are compiler writers - it's much simpler to just measure the result of their efforts ;)
Anyways if you want a good listing of this stuff see this here (the above description of latency/rec. throughput is also from that particular document).