当前位置：文江博客话题详情

arrays c++ lookup

优化头痛 - 从查找表中删除 if

发布于 2024-12-04 07:08:58 字数 782 浏览 3 评论 0 原文

我正在尝试优化以下代码，这是我的应用程序中的瓶颈。它的作用：它采用双精度值 value1 和 value2 并尝试找到包含校正因子的最大值。如果两个值之间的差异大于 5.0（LUT 按因子 10 缩放），我可以只取这两个值的最大值。如果差异小于 5.0，我可以使用 LUT 中的校正因子。

有谁知道这段代码的更好风格是什么？我不知道我在哪里浪费了时间——是大量的 if 还是乘以 10？

double value1, value2;
// Lookup Table scaled by 10 for (ln(1+exp(-abs(x)))), which is almost 0 for x > 5 and symmetrical around 0. LUT[0] is x=0.0, LUT[40] is x=4.0.
const logValue LUT[50] = { ... }

if (value1 > value2)
{
    if (value1 - value2 >= 5.0)
    {
        return value1;
    }
    else
    {
        return value1 + LUT[(uint8)((value1 - value2) * 10)];
    }
}
else
{
    if (value2 - value1 >= 5.0)
    {
        return value2;
    }
    else
    {
        return value2 + LUT[(uint8)((value2 - value1) * 10)];
    }
}

原文

I'm trying to optimize the following piece of code, which is a bottleneck in my application.
What it does: It takes the double values value1 and value2 and tries to find the maximum including a correctional factor. If the difference between both values is greater than 5.0 (the LUT is scaled by factor 10), I can just take the max value of those two. If the difference is smaller than 5.0, I can use the correctional factor from the LUT.

Does anyone have an idea what could be a better style for this piece of code? I don't know where I'm losing time - is it the large number of ifs or the multiplication by 10?

double value1, value2;
// Lookup Table scaled by 10 for (ln(1+exp(-abs(x)))), which is almost 0 for x > 5 and symmetrical around 0. LUT[0] is x=0.0, LUT[40] is x=4.0.
const logValue LUT[50] = { ... }

if (value1 > value2)
{
    if (value1 - value2 >= 5.0)
    {
        return value1;
    }
    else
    {
        return value1 + LUT[(uint8)((value1 - value2) * 10)];
    }
}
else
{
    if (value2 - value1 >= 5.0)
    {
        return value2;
    }
    else
    {
        return value2 + LUT[(uint8)((value2 - value1) * 10)];
    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

焚却相思 2024-12-11 07:08:58

使用 Excel 几分钟即可生成一个近似方程，看起来它可能具有您所需的精度，因此您可以完全取消查找表。您仍然需要一个条件来确保方程的参数保持在优化的范围内。

double diff = abs(value1 - value2);
double dmax = (value1 + value2 + diff) * 0.5; // same as (min+max+(max-min))/2
if (diff > 5.0)
    return dmax;
return dmax + 4.473865638/(2.611112371+diff) + 0.088190879*diff + -1.015046114;

PS 我不能保证这会更快，只是这是一种足够不同的方法，值得进行基准测试。

PPS 可以改变约束来得出稍微不同的常数，有很多变化。这是我做的另一组，其中表格和公式之间的差异始终小于 0.008，而且每个值都将小于前面的值。

return dmax + 3.441318133/(2.296924445+diff) + 0.065529678*diff + -0.797081529;

编辑：我测试了此代码（第二个公式），针对 0 到 10 之间的一百万个随机数进行了 100 次传递，以及问题 MSalters 目前接受的答案，以及暴力实现max(value1,value2)+log(1.0+exp(-abs(value1-value2)))。我在双核 AMD Athlon 和 Intel 四核 i7 上进行了尝试，结果大致一致。这是一个典型的运行：

原始：1.32 秒。
MSalters：1.13秒。
我的：0.67秒。
暴力破解：4.50 秒。

多年来，处理器的速度变得令人难以置信，现在它们执行几次浮点乘法和除法的速度比在内存中查找值的速度还要快。这种方法不仅在现代 x86 上更快，而且更准确；方程中的近似误差远小于截断查找输入引起的步长误差。

当然，结果仍然会根据您的处理器和编译器而有所不同；您自己的特定目标仍然需要进行基准测试。

A couple of minutes of playing with Excel produces an approximation equation that looks like it might have the accuracy you need, so you can do away with the lookup table altogether. You still need one condition to make sure the parameters to the equation remain within the range that it was optimized for.

double diff = abs(value1 - value2);
double dmax = (value1 + value2 + diff) * 0.5; // same as (min+max+(max-min))/2
if (diff > 5.0)
    return dmax;
return dmax + 4.473865638/(2.611112371+diff) + 0.088190879*diff + -1.015046114;

P.S. I'm not guaranteeing this is faster, only that it's a different enough approach to be worth benchmarking.

P.P.S. It's possible to change the constraints to come up with slightly different constants, there are a lot of variations. Here's another set I did where the difference between your table and the formula will always be less than 0.008, also each value will be less than the one preceeding.

return dmax + 3.441318133/(2.296924445+diff) + 0.065529678*diff + -0.797081529;

Edit: I tested this code (second formula) with 100 passes against a million random numbers between 0 and 10, along with the original code from the question, MSalters currently accepted answer, and a brute force implementation max(value1,value2)+log(1.0+exp(-abs(value1-value2))). I tried it on a dual-core AMD Athlon and an Intel quad-core i7 and the results were roughly consistent. Here's a typical run:

Original: 1.32 seconds.
MSalters: 1.13 seconds.
Mine: 0.67 seconds.
Brute force: 4.50 seconds.

Processors have gotten unbelievably fast over the years, so fast now that they can do a couple of floating point multiplies and divides faster than they can look up a value in memory. Not only is this approach faster on a modern x86, it's also more accurate; the approximation errors in the equation are much less than the step errors caused by truncating the input for the lookup.

Naturally results can still vary based on your processor and compiler; benchmarking is still required for your own particular target.

回复收藏 0 原文

眸中客 2024-12-11 07:08:58

它可能会同等地沿着两条路径走下去，以至于给处理器带来很多管道问题。

您尝试过分析吗？

我还建议尝试使用标准库，看看是否有帮助（例如，它是否能够使用特定于处理器的指令）：

double diff = std::fabs(value1 - value2);
double maxv = std::max(value1, value2);
return (diff >= 5.0) ? maxv : maxv + LUT[(uint8)((diff) * 10)];

It probably goes down both paths equally enough that you're causing a lot of pipe-lining problems for your processor.

Have you tried profiling?

I'd also suggest trying to use the standard library and see if that helps (for example if it's able to use and processor-specific instructions):

double diff = std::fabs(value1 - value2);
double maxv = std::max(value1, value2);
return (diff >= 5.0) ? maxv : maxv + LUT[(uint8)((diff) * 10)];

回复收藏 0 原文

遥远的她 2024-12-11 07:08:58

我可能会编写稍微不同的代码来处理 value2 情况：

if (value2 < value1) std::swap(value1, value2);
assert(value1 <= value2); // Assertion corrected
int diff = int((value2 - value1) * 10.0);
if (diff >= 50) diff = 49; // Integer comparison iso floating point
return value2 + LUT[diff];

I'd probably have written the code a bit different to handle the value2<value1 case:

if (value2 < value1) std::swap(value1, value2);
assert(value1 <= value2); // Assertion corrected
int diff = int((value2 - value1) * 10.0);
if (diff >= 50) diff = 49; // Integer comparison iso floating point
return value2 + LUT[diff];

回复收藏 0 原文

宣告ˉ结束 2024-12-11 07:08:58

我假设当调用该函数时，您很可能会得到必须使用查找表的部分，而不是 >=5.0 部分。在这种情况下，最好引导编译器这样做。

double maxval = value1;
double difference_scaled = (value1-value2)*10;
if (difference < 0)
{
    difference = -difference;
    maxval = value2;
}
if (difference < 50)
    return maxval+LUT[(int)difference_scaled];
else
    return maxval;

尝试一下，让我知道这是否可以提高您的程序的性能。

I am going to assume when the function is called, you'll most likely get the part where you have to use the look up table, rather then the >=5.0 parts. In this case, it's better to guide the compiler towards this.

double maxval = value1;
double difference_scaled = (value1-value2)*10;
if (difference < 0)
{
    difference = -difference;
    maxval = value2;
}
if (difference < 50)
    return maxval+LUT[(int)difference_scaled];
else
    return maxval;

Try this and let me know if that improves your program's performance.

回复收藏 0 原文

绝情姑娘 2024-12-11 07:08:58

该代码成为应用程序瓶颈的唯一原因是您多次调用它。您确定需要吗？也许可以更改代码中较高的算法以减少比较？

回复收藏 0 原文

绾颜 2024-12-11 07:08:58

我已经做了一些非常快速的测试，但请自行分析代码以验证效果。

将 LUT[] 更改为静态变量使我的速度提高了 600%（从 3.5 秒到 0.6 秒）。这接近我使用的绝对最小值测试（0.4 秒）。看看是否有效并重新分析以确定是否需要进一步优化。

作为参考，我只是在 VC++ 2010 中对这个循环的执行（内部循环的 1 亿次迭代）进行计时：

int Counter = 0;

for (double j = 0; j < 10; j += 0.001)
    {
        for (double i = 0; i < 10; i += 0.001)
        {
            ++Counter;
            Value1 += TestFunc1(i, j);
        }
    }

I've done some very quick tests but please profile the code yourself to verify the effect.

Changing the LUT[] to a static variable got me a 600% speed up (from 3.5s to 0.6s). Which is close to the absolute minimum of test I used (0.4s). See if that works and re-profile to determine if any further optimization is needed.

For reference, the I was simply timing the execution of this loop (100 million iterations of the inner loop) in VC++ 2010:

int Counter = 0;

for (double j = 0; j < 10; j += 0.001)
    {
        for (double i = 0; i < 10; i += 0.001)
        {
            ++Counter;
            Value1 += TestFunc1(i, j);
        }
    }

回复收藏 0 原文

挽你眉间 2024-12-11 07:08:58

您在函数中计算了多次 value1 - value2 。只需做一次。

转换为 uint8_t 也可能存在问题。就性能而言，用作从双精度到整数转换的最佳整数类型是 int，使用数组索引的最佳整数类型也是 int。代码>.

max_value = value1;
diff = value1 - value2;
if (diff < 0.0) {
  max_value = value2;
  diff = -diff;
}

if (diff >= 5.0) {
  return max_value;
}
else {
  return max_value + LUT[(int)(diff * 10.0)];
}

请注意，上述内容保证 LUT 索引介于 0（含）和 50（不含）之间。这里不需要 uint8_t。

编辑
经过一番尝试后，这是一个相当快的基于 LUT 的 log(exp(value1)+exp(value2)) 近似：

#include <stdint.h>

// intptr_t *happens* to be fastest on my machine. YMMV.
typedef intptr_t IndexType;

double log_sum_exp (double value1, double value2, double *LUT) {
  double diff = value1 - value2;
  if (diff < 0.0) {
    value1 = value2;
    diff = -diff;
  }   
  IndexType idx = diff * 10.0;
  if (idx < 50) {
    value1 += LUT[idx];
  }   
  return value1;
}

整数类型 IndexType 是加快速度的关键之一。我使用 clang 和 g++ 进行了测试，两者都表明转换为 intptr_t （在我的计算机上为 long）并使用 intptr_t 作为索引LUT 比其他积分类型更快。它比某些类型要快得多。例如，unsigned long long 和 uint8_t 在我的计算机上是非常糟糕的选择。

类型不仅仅是一个提示，至少对于我使用的编译器来说是这样。这些编译器完全按照代码的指示执行从浮点类型到整型的转换，无论优化级别如何。

另一个减速带是由将整型类型与 50 进行比较而不是将浮点类型与 5.0 进行比较而产生的。

最后一个障碍：并非所有编译器都是一样的。 在我的计算机上 (YMMV)，g++ -O3 生成的代码比 clang -O3 慢得多（此问题慢 25%！），反过来，它生成的代码比 clang -O4 生成的代码要慢一些。

我还使用了有理函数近似方法（类似于 Mark Ransom 的答案），但上面显然没有使用这种方法。

You are computing value1 - value2 quite a few times in your function. Just do it once.

That cast to a uint8_t may also be also problematic. A far as performance is concerned, the best integral type to use as a conversion from double to integer is an int, as is the best integral type to use an array index is an int.

max_value = value1;
diff = value1 - value2;
if (diff < 0.0) {
  max_value = value2;
  diff = -diff;
}

if (diff >= 5.0) {
  return max_value;
}
else {
  return max_value + LUT[(int)(diff * 10.0)];
}

Note that the above guarantees that the LUT index will be between 0 (inclusive) and 50 (exclusive). There's no need for a uint8_t here.

Edit
After some playing around with some variations, this is a fairly fast LUT-based approximation to log(exp(value1)+exp(value2)):

#include <stdint.h>

// intptr_t *happens* to be fastest on my machine. YMMV.
typedef intptr_t IndexType;

double log_sum_exp (double value1, double value2, double *LUT) {
  double diff = value1 - value2;
  if (diff < 0.0) {
    value1 = value2;
    diff = -diff;
  }   
  IndexType idx = diff * 10.0;
  if (idx < 50) {
    value1 += LUT[idx];
  }   
  return value1;
}

The integral type IndexType is one of the keys to speeding things up. I tested with clang and g++, and both indicated that casting to an intptr_t (long on my computer) and using a intptr_t as an index into the LUT is faster than other integral types. It is considerably faster than some types. For example, unsigned long long and uint8_t are incredibly poor choices on my computer.

The type is not just a hint, at least with the compilers I used. Those compilers did exactly what the code told it to do with regard to the conversion from floating point type to integral type, regardless of the optimization level.

Another speed bump results from comparing the integral type to 50 as opposed to comparing the floating point type to 5.0.

One last speed bump: Not all compilers are created equal. On my computer (YMMV), g++ -O3 generates considerably slower code (25% slower with this problem!) than does clang -O3, which in turn generates code that is a bit slower than that generated by clang -O4.

I also played with a rational function approximation approach (similar to Mark Ransom's answer), but the above obviously does not use such an approach.

回复收藏 0 原文

~没有更多了~