优化头痛 - 从查找表中删除 if
我正在尝试优化以下代码,这是我的应用程序中的瓶颈。 它的作用:它采用双精度值 value1 和 value2 并尝试找到包含校正因子的最大值。如果两个值之间的差异大于 5.0(LUT 按因子 10 缩放),我可以只取这两个值的最大值。如果差异小于 5.0,我可以使用 LUT 中的校正因子。
有谁知道这段代码的更好风格是什么?我不知道我在哪里浪费了时间——是大量的 if 还是乘以 10?
double value1, value2;
// Lookup Table scaled by 10 for (ln(1+exp(-abs(x)))), which is almost 0 for x > 5 and symmetrical around 0. LUT[0] is x=0.0, LUT[40] is x=4.0.
const logValue LUT[50] = { ... }
if (value1 > value2)
{
if (value1 - value2 >= 5.0)
{
return value1;
}
else
{
return value1 + LUT[(uint8)((value1 - value2) * 10)];
}
}
else
{
if (value2 - value1 >= 5.0)
{
return value2;
}
else
{
return value2 + LUT[(uint8)((value2 - value1) * 10)];
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
使用 Excel 几分钟即可生成一个近似方程,看起来它可能具有您所需的精度,因此您可以完全取消查找表。您仍然需要一个条件来确保方程的参数保持在优化的范围内。
PS 我不能保证这会更快,只是这是一种足够不同的方法,值得进行基准测试。
PPS 可以改变约束来得出稍微不同的常数,有很多变化。这是我做的另一组,其中表格和公式之间的差异始终小于 0.008,而且每个值都将小于前面的值。
编辑:我测试了此代码(第二个公式),针对 0 到 10 之间的一百万个随机数进行了 100 次传递,以及问题 MSalters 目前接受的答案,以及暴力实现
max(value1,value2)+log(1.0+exp(-abs(value1-value2)))
。我在双核 AMD Athlon 和 Intel 四核 i7 上进行了尝试,结果大致一致。这是一个典型的运行:多年来,处理器的速度变得令人难以置信,现在它们执行几次浮点乘法和除法的速度比在内存中查找值的速度还要快。这种方法不仅在现代 x86 上更快,而且更准确;方程中的近似误差远小于截断查找输入引起的步长误差。
当然,结果仍然会根据您的处理器和编译器而有所不同;您自己的特定目标仍然需要进行基准测试。
A couple of minutes of playing with Excel produces an approximation equation that looks like it might have the accuracy you need, so you can do away with the lookup table altogether. You still need one condition to make sure the parameters to the equation remain within the range that it was optimized for.
P.S. I'm not guaranteeing this is faster, only that it's a different enough approach to be worth benchmarking.
P.P.S. It's possible to change the constraints to come up with slightly different constants, there are a lot of variations. Here's another set I did where the difference between your table and the formula will always be less than 0.008, also each value will be less than the one preceeding.
Edit: I tested this code (second formula) with 100 passes against a million random numbers between 0 and 10, along with the original code from the question, MSalters currently accepted answer, and a brute force implementation
max(value1,value2)+log(1.0+exp(-abs(value1-value2)))
. I tried it on a dual-core AMD Athlon and an Intel quad-core i7 and the results were roughly consistent. Here's a typical run:Processors have gotten unbelievably fast over the years, so fast now that they can do a couple of floating point multiplies and divides faster than they can look up a value in memory. Not only is this approach faster on a modern x86, it's also more accurate; the approximation errors in the equation are much less than the step errors caused by truncating the input for the lookup.
Naturally results can still vary based on your processor and compiler; benchmarking is still required for your own particular target.
它可能会同等地沿着两条路径走下去,以至于给处理器带来很多管道问题。
您尝试过分析吗?
我还建议尝试使用标准库,看看是否有帮助(例如,它是否能够使用特定于处理器的指令):
It probably goes down both paths equally enough that you're causing a lot of pipe-lining problems for your processor.
Have you tried profiling?
I'd also suggest trying to use the standard library and see if that helps (for example if it's able to use and processor-specific instructions):
我可能会编写稍微不同的代码来处理
value2 情况:
I'd probably have written the code a bit different to handle the
value2<value1
case:我假设当调用该函数时,您很可能会得到必须使用查找表的部分,而不是
>=5.0
部分。在这种情况下,最好引导编译器这样做。尝试一下,让我知道这是否可以提高您的程序的性能。
I am going to assume when the function is called, you'll most likely get the part where you have to use the look up table, rather then the
>=5.0
parts. In this case, it's better to guide the compiler towards this.Try this and let me know if that improves your program's performance.
该代码成为应用程序瓶颈的唯一原因是您多次调用它。您确定需要吗?也许可以更改代码中较高的算法以减少比较?
The only reason this code would be the bottleneck in your application is because you are calling it many many times. Are you sure you need to? Perhaps the algorithm higher up in the code can be changed to use the comparison less?
我已经做了一些非常快速的测试,但请自行分析代码以验证效果。
将
LUT[]
更改为静态变量使我的速度提高了 600%(从 3.5 秒到 0.6 秒)。这接近我使用的绝对最小值测试(0.4 秒)。看看是否有效并重新分析以确定是否需要进一步优化。作为参考,我只是在 VC++ 2010 中对这个循环的执行(内部循环的 1 亿次迭代)进行计时:
I've done some very quick tests but please profile the code yourself to verify the effect.
Changing the
LUT[]
to a static variable got me a 600% speed up (from 3.5s to 0.6s). Which is close to the absolute minimum of test I used (0.4s). See if that works and re-profile to determine if any further optimization is needed.For reference, the I was simply timing the execution of this loop (100 million iterations of the inner loop) in VC++ 2010:
您在函数中计算了多次
value1 - value2
。只需做一次。转换为
uint8_t
也可能存在问题。就性能而言,用作从双精度到整数转换的最佳整数类型是 int,使用数组索引的最佳整数类型也是 int。代码>.请注意,上述内容保证 LUT 索引介于 0(含)和 50(不含)之间。这里不需要
uint8_t
。编辑
经过一番尝试后,这是一个相当快的基于 LUT 的
log(exp(value1)+exp(value2))
近似:整数类型
IndexType
是加快速度的关键之一。我使用 clang 和 g++ 进行了测试,两者都表明转换为intptr_t
(在我的计算机上为long
)并使用intptr_t
作为索引LUT 比其他积分类型更快。它比某些类型要快得多。例如,unsigned long long
和uint8_t
在我的计算机上是非常糟糕的选择。类型不仅仅是一个提示,至少对于我使用的编译器来说是这样。这些编译器完全按照代码的指示执行从浮点类型到整型的转换,无论优化级别如何。
另一个减速带是由将整型类型与 50 进行比较而不是将浮点类型与 5.0 进行比较而产生的。
最后一个障碍:并非所有编译器都是一样的。 在我的计算机上 (YMMV),
g++ -O3
生成的代码比clang -O3
慢得多(此问题慢 25%!),反过来,它生成的代码比 clang -O4 生成的代码要慢一些。我还使用了有理函数近似方法(类似于 Mark Ransom 的答案),但上面显然没有使用这种方法。
You are computing
value1 - value2
quite a few times in your function. Just do it once.That cast to a
uint8_t
may also be also problematic. A far as performance is concerned, the best integral type to use as a conversion from double to integer is anint
, as is the best integral type to use an array index is anint
.Note that the above guarantees that the LUT index will be between 0 (inclusive) and 50 (exclusive). There's no need for a
uint8_t
here.Edit
After some playing around with some variations, this is a fairly fast LUT-based approximation to
log(exp(value1)+exp(value2))
:The integral type
IndexType
is one of the keys to speeding things up. I tested with clang and g++, and both indicated that casting to anintptr_t
(long
on my computer) and using aintptr_t
as an index into the LUT is faster than other integral types. It is considerably faster than some types. For example,unsigned long long
anduint8_t
are incredibly poor choices on my computer.The type is not just a hint, at least with the compilers I used. Those compilers did exactly what the code told it to do with regard to the conversion from floating point type to integral type, regardless of the optimization level.
Another speed bump results from comparing the integral type to 50 as opposed to comparing the floating point type to 5.0.
One last speed bump: Not all compilers are created equal. On my computer (YMMV),
g++ -O3
generates considerably slower code (25% slower with this problem!) than doesclang -O3
, which in turn generates code that is a bit slower than that generated byclang -O4
.I also played with a rational function approximation approach (similar to Mark Ransom's answer), but the above obviously does not use such an approach.