近似 log10[x^k0 + k1]

发布于 2024-10-12 09:11:28 字数 1092 浏览 11 评论 0原文

问候。我试图近似函数

Log10[x^k0 + k1]，其中 .21 < k0＜ 21、0＜ k1＜ ~2000，x为整数<2000 2^14。

k0 & k1 是常数。出于实际目的，您可以假设 k0 = 2.12，k1 = 2660。所需的精度为 5*10^-4 相对误差。

该函数实际上与 Log[x] 相同，只是接近 0 时差异很大。

我已经提出了一个 SIMD 实现，它比简单的查找表快约 1.15 倍，但希望尽可能改进它，我认为由于缺乏有效的指令，这非常困难。

我的 SIMD 实现使用 16 位定点算术来计算三阶多项式（我使用最小二乘拟合）。多项式针对不同的输入范围使用不同的系数。有 8 个范围，范围 i 跨越 (64)2^i 到 (64)2^(i + 1)。其背后的原理是 Log[x] 的导数随 x 快速下降，这意味着多项式将更准确地拟合它，因为多项式精确拟合导数为 0 超出特定阶数的函数。

使用单个 _mm_shuffle_epi8() 即可非常高效地完成 SIMD 表查找。我使用 SSE 的 float 到 int 转换来获取用于定点近似的指数和有效数。我还对循环进行了软件管道化以获得约 1.25 倍的加速，因此可能不太可能进行进一步的代码优化。

我要问的是是否有更高级别的更有效的近似？例如：

这个函数能否分解为具有有限域的函数，例如 log2((2^x) *significand) = x + log2(significand)

因此消除了处理不同范围（表查找）的需要。我认为主要问题是添加 k1 项会杀死我们所知道和喜爱的所有那些好的日志属性，使其成为不可能。或者是吗？

迭代法？不这么认为是因为 log[x] 的牛顿法已经是一个复杂的表达式
利用相邻像素的局部性？ - 如果 8 个输入的范围落在相同的近似范围内，那么我可以查找单个系数，而不是查找每个元素的单独系数。因此，我可以将其用作快速的常见情况，并在不是时使用较慢的通用代码路径。但对于我的数据，在该属性在 70% 的时间内保持不变之前，范围需要约为 2000，这似乎并不使该方法具有竞争力。

请给我一些意见，特别是如果你是一名应用数学家，即使你说这是不可能完成的。谢谢。

原文

Greetings. I'm trying to approximate the function

Log10[x^k0 + k1], where .21 < k0 < 21, 0 < k1 < ~2000, and x is integer < 2^14.

k0 & k1 are constant. For practical purposes, you can assume k0 = 2.12, k1 = 2660. The desired accuracy is 5*10^-4 relative error.

This function is virtually identical to Log[x], except near 0, where it differs a lot.

I already have came up with a SIMD implementation that is ~1.15x faster than a simple lookup table, but would like to improve it if possible, which I think is very hard due to lack of efficient instructions.

My SIMD implementation uses 16bit fixed point arithmetic to evaluate a 3rd degree polynomial (I use least squares fit). The polynomial uses different coefficients for different input ranges. There are 8 ranges, and range i spans (64)2^i to (64)2^(i + 1).
The rational behind this is the derivatives of Log[x] drop rapidly with x, meaning a polynomial will fit it more accurately since polynomials are an exact fit for functions that have a derivative of 0 beyond a certain order.

SIMD table lookups are done very efficiently with a single _mm_shuffle_epi8(). I use SSE's float to int conversion to get the exponent and significand used for the fixed point approximation. I also software pipelined the loop to get ~1.25x speedup, so further code optimizations are probably unlikely.

What I'm asking is if there's a more efficient approximation at a higher level?
For example:

Can this function be decomposed into functions with a limited domain like
log2((2^x) * significand) = x + log2(significand)

hence eliminating the need to deal with different ranges (table lookups). The main problem I think is adding the k1 term kills all those nice log properties that we know and love, making it not possible. Or is it?

Iterative method? don't think so because the Newton method for log[x] is already a complicated expression
Exploiting locality of neighboring pixels? - if the range of the 8 inputs fall in the same approximation range, then I can look up a single coefficient, instead of looking up separate coefficients for each element. Thus, I can use this as a fast common case, and use a slower, general code path when it isn't. But for my data, the range needs to be ~2000 before this property hold 70% of the time, which doesn't seem to make this method competitive.

Please, give me some opinion, especially if you're an applied mathematician, even if you say it can't be done. Thanks.

分享到QQ

分享到微博