经营者业绩|与运营商的比较+
| 之间有什么重大区别吗?和 + 从长远来看会影响代码的性能吗?或者都是 O(1)?我正在使用的代码是这样的:
uint64_t dostuff(uint64_t a,uint64_t b){
// the max values of the inputs are 2^32 - 1
// lots of stuff involving boolean operators
// that have no way of being substituted by
// arithmetic operators
return (a << 32) + b;
//or
return (a << 32) | b;
}
该代码将被多次使用,所以我想尽可能地加快速度。
Is there any major difference between | and + that would affect a code's performance in the long run? or are both O(1)? the code i am working with is something like this:
uint64_t dostuff(uint64_t a,uint64_t b){
// the max values of the inputs are 2^32 - 1
// lots of stuff involving boolean operators
// that have no way of being substituted by
// arithmetic operators
return (a << 32) + b;
//or
return (a << 32) | b;
}
the code will be used many times, so i want to speed it up as much as possible.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
在任何现代计算机上都没有性能差异。
但这两个运算符具有不同的含义。如果该位已设置,
|
将不执行任何操作,但+
将清除该位和所有后续非零位,并将下一个零位设置为 1。No performance difference on any modern computer.
The two operators have different meaning though. If the bit is already set,
|
will do nothing, but+
will clear the bit and all the following non-zero bits and set the next zero bit to 1.两者肯定都是 O(1),因为 O(1) 意味着一个常数。它们可能不是相同的常数。 Big Oh 表示法旨在理解独立于常数的渐近行为。
哦,是的,还有一件事。 始终在优化之前进行分析。你很快就会发现时间并没有花在你想的地方。 永远!
Both are certainly O(1) since O(1) means a constant. They are probably not the same constant. Big Oh notation is meant to understand asymptotic behavior independent of constants.
Oh yeah, one more thing. Always profile before you optimize. You'll find out very quickly that time isn't being spent where you think. Always!
使用
|
。+
只能增加操作时间,原因很明显。Use
|
.+
can only add to the operation time par obvious reasons.两者都是一条指令。至于电子传播时间,不知道哪一个更快。
我想,您可以自己测试速度,但由于差异可能是线性的(如果可以检测到的话),并且受到噪声因素的影响,这可能有点困难。
Both are a single instruction. As for electronic propagation times, no idea which one is faster.
You can test for speed yourself, I guess, but seeing as the difference will probably be linear (if detectable at all), and affected by noisy factors, it may be a bit difficult.
这里最好的答案不是试图预测哪一个更好,而是对其进行基准测试或检查汇编代码。我猜测两者都会针对相同的指令进行优化,并且在任何情况下两者占用的 CPU 周期数可能相同。
但我强烈建议您检查 ASM 并对这两种解决方案进行基准测试。
The best answer here is not trying to predict which one is better but benchmark it or check the assembly code. I would guess that both will be optimized to the same instruction and in any case the number of CPU cycles taken by both could be equal.
But I strongly suggest you to check ASM and benchmark both solutions.
如果有什么优势的话,那肯定是
或
有利。然而,实际上,在任何相当现代的 CPU(甚至除了真正古老的 CPU)上都不可能有任何差异。基本上,
or
只是设置位,仅此而已。只需要一个二输入或
门即可,因此您恰好得到一个传播延迟门。加法器有点复杂:计算当前位需要三输入异或。 XOR 通常由两级门组成。此外,它还生成一个进位,该进位必须用作加法器下一位的输入。因此,“纹波进位加法器”需要与被添加的位一样多的时钟周期。有更聪明的方法来处理这个问题,将进位与加法的其余部分分开处理,这样就可以得到较低的传播延迟,但在最坏的情况下,即使这些也无济于事。
不过,其中大部分仅在您自己设计 CPU 时才重要。如果您使用的是典型的 CPU,功能单元中的门运行得足够快,它可以/将在一个时钟周期内完成完整的加法。一些相当新的甚至可以在单个功能单元中每个时钟周期执行两次添加。
If there's any advantage, it's going to be in favor of the
or
. In reality, however, there's unlikely to be any difference on any reasonably modern CPU (or even anything but a really ancient one).Basically, an
or
just sets the bit, and that's all. One two-inputor
gate is all that's needed, so you get exactly one gate of propagation delay.An adder is a bit more complex: computing the current bit requires a three-input XOR. An XOR is normally composed to two levels of gates. In addition, it generates a carry, that has to be used as an input to the adder for the next bit. A "ripple carry adder", therefore, needs as many clock cycles as there are bits being added. There are cleverer ways of handling the problem where you handle carries separately from the rest of the addition, so you get a lower propagation delay, but in the worst case, even these don't help.
Most of that only matters if you're designing a CPU yourself though. If you're using a typical CPU, the gates in the functional units are running fast enough that it can/will do a full add in one clock cycle. Some reasonably recent ones can even do two adds per clock cycle in a single functional unit.
|
和 '+` 是不同的数学运算。给定方程:
将产生不同的答案。
从技术上讲,“|”运算速度更快,因为它仅使用处理器内部的“或”门。加法运算需要更多的门。
使用“|”获得的性能'+' 上的字符通常浪费在将数据传入和传出处理器所需的时间上。换句话说,净性能可以忽略不计。 (时间差通常在纳秒范围内。)
但是,两种形式之间的维持时间可能会更长。当一个人需要算术而不是位运算(反之亦然)时,尝试找到这个运行时错误可能会很棒。
为了正确的目的使用正确的操作员。让测试和维护小组休息一下。这种微观优化是不值得的。
The
|
and '+` are different mathematical operations.Given the equations:
will yield different answers.
Technically, the `|' operation is faster since it only uses OR gates inside the processor. The addition operation requires more gates.
The performance gained by using '|' over '+' is usually wasted by the time required to fetch data into and out of the processor. In otherwords, the net performance is negligible. (The time difference is usually in the range of nanoseconds.)
However, the maintenance time between the two forms may be greater. When one is needing arithmetic rather than bit twiddling (or vice versa), trying to find this runtime error can be great.
Use the proper operator for the proper purpose. Give the testing and maintenance groups a break. This kind of micro-optimization is not worthwhile.
这是特定于平台的(并且可能是特定于编译器的)。如果我没记错的话,在 PS3 上的 SPU 上,动态 OR 相当昂贵。我不确定具体数字,但我认为最终会将其分为多个操作,导致成本扩展到多个指令。在 x86/x64 或大多数现代 CISC 上,很可能其中任一指令只是一条指令,并且不太可能导致任何管道停顿或其他昂贵的操作。
编辑:
造成成本的原因是 Cell 处理器只有一个通用寄存器,这意味着它无法将两个变量加载到标准寄存器中并执行优化。相反,必须将值加载到必须完成操作的 altivec 寄存器集中,然后必须通过掩码将结果从 altivec 寄存器提取到 gpr 中,以便检索结果。
如果您将这些操作推送到 PS3 或任何现代计算机上的 GPU 上,您可能需要研究这些处理器的行为方式。 GPU 也可能有类似的问题,因为它们也是专用于 SIMD 操作的 RISC 处理器。
This is platform-specific (and likely compiler-specific). On the SPU's on the PS3, dynamic OR's are quite expensive if I remember correctly. I'm not sure of the numbers but I think that it ends up by dividing it into multiple operations, causing the cost to expand to several instructions. On x86/x64 or most modern CISC it is quite likely that either one is just one instruction and is very unlikely to cause any pipeline stalls or other costly operations.
Edit:
The reason for the cost is because the Cell processor only has one general-purpose-register which means that it can't load both variables into standard registers and perform the optimization. Instead the values have to be loaded into the altivec-register set where the operation has to be done, the result then has to be fetched from the altivec registers into the gpr by a mask in order to retrieve the result.
If you are pushing these operations onto a PS3 or the GPU on any modern computer, you might want to look into how those processors behave. The GPU's might also have similar issues since they're also RISC-processors dedicated towards SIMD-operations.