近似 log10[x^k0 + k1]

发布于 2024-10-12 09:11:28 字数 1092 浏览 11 评论 0原文

问候。我试图近似函数

Log10[x^k0 + k1],其中 .21 < k0< 21、0< k1< ~2000,x为整数<2000 2^14。

k0 & k1 是常数。出于实际目的,您可以假设 k0 = 2.12,k1 = 2660。所需的精度为 5*10^-4 相对误差。

该函数实际上与 Log[x] 相同,只是接近 0 时差异很大。

我已经提出了一个 SIMD 实现,它比简单的查找表快约 1.15 倍,但希望尽可能改进它,我认为由于缺乏有效的指令,这非常困难。

我的 SIMD 实现使用 16 位定点算术来计算三阶多项式(我使用最小二乘拟合)。多项式针对不同的输入范围使用不同的系数。有 8 个范围,范围 i 跨越 (64)2^i 到 (64)2^(i + 1)。 其背后的原理是 Log[x] 的导数随 x 快速下降,这意味着多项式将更准确地拟合它,因为多项式精确拟合导数为 0 超出特定阶数的函数。

使用单个 _mm_shuffle_epi8() 即可非常高效地完成 SIMD 表查找。我使用 SSE 的 float 到 int 转换来获取用于定点近似的指数和有效数。我还对循环进行了软件管道化以获得约 1.25 倍的加速,因此可能不太可能进行进一步的代码优化。

我要问的是是否有更高级别的更有效的近似? 例如:

  1. 这个函数能否分解为具有有限域的函数,例如 log2((2^x) *significand) = x + log2(significand)

因此消除了处理不同范围(表查找)的需要。我认为主要问题是添加 k1 项会杀死我们所知道和喜爱的所有那些好的日志属性,使其成为不可能。或者是吗?

  1. 迭代法?不这么认为是因为 log[x] 的牛顿法已经是一个复杂的表达式

  2. 利用相邻像素的局部性? - 如果 8 个输入的范围落在相同的近似范围内,那么我可以查找单个系数,而不是查找每个元素的单独系数。因此,我可以将其用作快速的常见情况,并在不是时使用较慢的通用代码路径。但对于我的数据,在该属性在 70% 的时间内保持不变之前,范围需要约为 2000,这似乎并不使该方法具有竞争力。

请给我一些意见,特别是如果你是一名应用数学家,即使你说这是不可能完成的。谢谢。

Greetings. I'm trying to approximate the function

Log10[x^k0 + k1], where .21 < k0 < 21, 0 < k1 < ~2000, and x is integer < 2^14.

k0 & k1 are constant. For practical purposes, you can assume k0 = 2.12, k1 = 2660. The desired accuracy is 5*10^-4 relative error.

This function is virtually identical to Log[x], except near 0, where it differs a lot.

I already have came up with a SIMD implementation that is ~1.15x faster than a simple lookup table, but would like to improve it if possible, which I think is very hard due to lack of efficient instructions.

My SIMD implementation uses 16bit fixed point arithmetic to evaluate a 3rd degree polynomial (I use least squares fit). The polynomial uses different coefficients for different input ranges. There are 8 ranges, and range i spans (64)2^i to (64)2^(i + 1).
The rational behind this is the derivatives of Log[x] drop rapidly with x, meaning a polynomial will fit it more accurately since polynomials are an exact fit for functions that have a derivative of 0 beyond a certain order.

SIMD table lookups are done very efficiently with a single _mm_shuffle_epi8(). I use SSE's float to int conversion to get the exponent and significand used for the fixed point approximation. I also software pipelined the loop to get ~1.25x speedup, so further code optimizations are probably unlikely.

What I'm asking is if there's a more efficient approximation at a higher level?
For example:

  1. Can this function be decomposed into functions with a limited domain like
    log2((2^x) * significand) = x + log2(significand)

hence eliminating the need to deal with different ranges (table lookups). The main problem I think is adding the k1 term kills all those nice log properties that we know and love, making it not possible. Or is it?

  1. Iterative method? don't think so because the Newton method for log[x] is already a complicated expression

  2. Exploiting locality of neighboring pixels? - if the range of the 8 inputs fall in the same approximation range, then I can look up a single coefficient, instead of looking up separate coefficients for each element. Thus, I can use this as a fast common case, and use a slower, general code path when it isn't. But for my data, the range needs to be ~2000 before this property hold 70% of the time, which doesn't seem to make this method competitive.

Please, give me some opinion, especially if you're an applied mathematician, even if you say it can't be done. Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

不醒的梦 2024-10-19 09:11:28

您应该能够通过使用切比雪夫近似来改进最小二乘拟合。 (这个想法是,你正在寻找一个范围内最坏情况偏差最小的近似值;最小二乘则寻找平方差之和最小的近似值。)我猜这不会产生巨大的差异对于你的问题,但我不确定——希望它可以在某种程度上减少你需要分割的范围数量。

如果已经有 log(x) 的快速实现,也许可以计算 P(x) * log(x),其中 P(x) 是通过切比雪夫近似选择的多项式。 (而不是尝试将整个函数作为多项式近似来完成 - 需要更少的范围缩小。)

我在这里是一个业余爱好者 - 只是尝试一下,因为还没有很多答案。

You should be able to improve on least-squares fitting by using Chebyshev approximation. (The idea is, you're looking for the approximation whose worst-case deviation in a range is least; least-squares instead looks for the one whose summed squared difference is least.) I would guess this doesn't make a huge difference for your problem, but I'm not sure -- hopefully it could reduce the number of ranges you need to split into, somewhat.

If there's already a fast implementation of log(x), maybe compute P(x) * log(x) where P(x) is a polynomial chosen by Chebyshev approximation. (Instead of trying to do the whole function as a polynomial approx -- to need less range-reduction.)

I'm an amateur here -- just dipping my toe in as there aren't a lot of answers already.

遗心遗梦遗幸福 2024-10-19 09:11:28

一项观察:
您可以找到一个表达式来表示 x 需要多大,作为 k0 和 k1 的函数,使得项 x^k0 足以主导 k1 进行近似:

x^k0 +k1 ~= x^k0,允许您近似评估函数为

k0*Log(x)。

这将处理所有高于某个值的 x。

One observation:
You can find an expression for how large x needs to be as a function of k0 and k1, such that the term x^k0 dominates k1 enough for the approximation:

x^k0 +k1 ~= x^k0, allowing you to approximately evaluate the function as

k0*Log(x).

This would take care of all x's above some value.

月下凄凉 2024-10-19 09:11:28

我最近读到了 sRGB 模型如何将物理三刺激值压缩为存储的 RGB 值。

它基本上与我尝试近似的函数非常相似,只是它是分段定义的:

k0 x, x < 0.0031308

k1 x^0.417 - k2 否则

我被告知 Log[x^k0 + k1] 中的常量加法是为了使函数的开头更加线性。但这可以通过分段近似轻松实现。这将使近似更加“均匀”——只有 2 个近似范围。由于不再需要计算近似范围索引(整数对数)和执行 SIMD 系数查找,因此计算成本应该会更便宜。

目前,我的结论是这将是最好的方法,尽管它不能精确地近似该函数。困难的部分是提出这一改变并说服人们使用它。

I recently read how the sRGB model compresses physical tri stimulus values into stored RGB values.

It basically is very similar to the function I try to approximate, except that it's defined piece wise:

k0 x, x < 0.0031308

k1 x^0.417 - k2 otherwise

I was told the constant addition in Log[x^k0 + k1] was to make the beginning of the function more linear. But that can easily be achieved with a piece wise approximation. That would make the approximation a lot more "uniform" - with only 2 approximation ranges. This should be cheaper to compute due to no longer needing to compute an approximation range index (integer log) and doing SIMD coefficient lookup.

For now, I conclude this will be the best approach, even though it doesn't approximate the function precisely. The hard part will be proposing this change and convincing people to use it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文