避免调用 Floor()

发布于 2024-08-23 15:18:24 字数 564 浏览 4 评论 0原文

我正在编写一段代码,需要处理不一定在 0 到 1 范围内的 uvs(2D 纹理坐标)。例如,有时我会得到 au 分量为 1.2 的 uv。为了处理这个问题,我正在实现一个包装,通过执行以下操作来导致平铺:

u -= floor(u)
v -= floor(v)

这样做会导致 1.2 变成 0.2,这是所需的结果。它还可以处理负面情况,例如 -0.4 变为 0.6。

然而,这些对发言权的调用相当慢。我已经使用 Intel VTune 分析了我的应用程序,并且我花费了大量的周期来执行此底层操作。

在对这个问题进行了一些背景阅读之后,我提出了以下函数,该函数速度更快一些,但仍然有很多不足之处(我仍然会受到类型转换惩罚等)。

int inline fasterfloor( const float x ) { return x > 0 ? (int) x : (int) x - 1; }

我已经看到了一些通过内联汇编完成的技巧,但似乎没有任何东西可以完全正确地工作或有任何显着的速度改进。

有谁知道处理这种情况的任何技巧?

I am working on a piece of code where I need to deal with uvs (2D texture coordinates) that are not necessarily in the 0 to 1 range. As an example, sometimes I will get a uv with a u component that is 1.2. In order to handle this I am implementing a wrapping which causes tiling by doing the following:

u -= floor(u)
v -= floor(v)

Doing this causes 1.2 to become 0.2 which is the desired result. It also handles negative cases, such as -0.4 becoming 0.6.

However, these calls to floor are rather slow. I have profiled my application using Intel VTune and I am spending a huge amount of cycles just doing this floor operation.

Having done some background reading on the issue, I have come up with the following function which is a bit faster but still leaves a lot to be desired (I am still incurring type conversion penalties, etc).

int inline fasterfloor( const float x ) { return x > 0 ? (int) x : (int) x - 1; }

I have seen a few tricks that are accomplished with inline assembly but nothing that seems to work exactly correct or have any significant speed improvement.

Does anyone know any tricks for handling this kind of scenario?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

定格我的天空 2024-08-30 15:18:24

老问题,但我遇到了它,这让我有点抽搐,因为它还没有得到令人满意的答案。

TL;DR:*不要**为此使用内联汇编、内在函数或任何其他给定的解决方案!相反,使用快速/不安全的数学优化进行编译(g++ 中的“-ffast-math -funsafe-math-optimizations -fno-math-errno”)。 Floor() 如此慢的原因是,如果转换溢出(FLT_MAX 不适合任何大小的标量整数类型),它会更改全局状态,这也使得向量化变得不可能,除非禁用严格的 IEEE-754 兼容性,无论如何你可能不应该依赖它。使用这些标志进行编译会禁用问题行为。

一些备注:

  1. 带有标量寄存器的内联汇编不可矢量化,这会极大地抑制优化编译时的性能。它还要求当前存储在向量寄存器中的任何相关值都溢出到堆栈并重新加载到标量寄存器中,这违背了手动优化的目的。

    使用标

  2. 在我的机器上,使用 SSE cvttss2si 和您概述的方法进行内联汇编实际上比带有编译器优化的简单 for 循环慢。
    这可能是因为如果您允许编译器将整个代码块矢量化在一起,那么您的编译器将更好地分配寄存器并避免管道停顿。对于像这样的一小段代码,内部依赖链很少,几乎没有寄存器溢出的可能性,它几乎没有机会比 asm() 包围的手工优化代码做得更糟。

  3. 内联汇编不可移植,在 Visual Studio 64 位版本中不受支持,并且极其难以阅读。内在函数也面临与上面列出的相同的警告。

  4. 所有其他列出的方法都是不正确的,这可以说比缓慢更糟糕,并且它们在每种情况下都给出了如此边际的性能改进,以至于它不能证明该方法的粗糙性是合理的。 (int)(x+16.0)-16.0 太糟糕了,我什至都不会碰它,但你的方法也是错误的,因为它给了 Floor(-1) 作为-2。当数学代码对性能至关重要以至于标准库无法为您完成这项工作时,在数学代码中包含分支也是一个非常糟糕的主意。所以你的(不正确的)方式应该看起来更像 ((int) x) - (x<0.0),也许有一个中间,这样你就不必执行 fpu 移动两次。分支可能会导致缓存未命中,这将完全抵消性能的任何提升;另外,如果禁用 math errno,则转换为 int 是任何 Floor() 实现的最大剩余瓶颈。如果您/真​​的/不关心获得负整数的正确值,这可能是一个合理的近似值,但除非您非常了解您的用例,否则我不会冒险。

  5. 我尝试使用按位转换和通过位掩码舍入,就像 SUN 的 newlib 实现在 fmodf 中所做的那样,但它花了很长时间才能正确,并且在我的机器上慢了好几倍,即使没有相关的编译器优化标志。很可能,他们为某些古老的 CPU 编写了该代码,其中浮点运算相对非常昂贵,并且没有矢量扩展,更不用说矢量转换操作了; AFAIK 在任何常见架构上都不再出现这种情况。 SUN 也是 Quake 3 使用的快速逆 sqrt() 例程的诞生地;现在大多数架构上都有相关的说明。微优化的最大陷阱之一是它们很快就会过时。

Old question, but I came across it and it made me convulse slightly that it hasn't been satisfactorily answered.

TL;DR: *Don't** use inline assembly, intrinsics, or any of the other given solutions for this! Instead, compile with fast/unsafe math optimizations ("-ffast-math -funsafe-math-optimizations -fno-math-errno" in g++). The reason why floor() is so slow is because it changes global state if the cast would overflow (FLT_MAX does not fit in a scalar integer type of any size), which also makes it impossible to vectorize unless you disable strict IEEE-754 compatibility, which you should probably not rely on anyway. Compiling with these flags disables the problem behavior.

Some remarks:

  1. inline assembly with scalar registers is not vectorizable, which drastically inhibits performance when compiling with optimizations. It also requires that any relevant values currently stored in vector registers be spilled to the stack and reloaded into scalar registers, which defeats the purpose of hand-optimization.

  2. Inline assembly using the SSE cvttss2si with the method you've outlined is actually slower on my machine than a simple for loop with compiler optimizations.
    This is likely because your compiler will allocate registers and avoid pipeline stalls better if you allow it to vectorize whole blocks of code together. For a short piece of code like this with few internal dependent chains and almost no chance of register spillage it has very little chance to do worse than hand-optimized code surrounded by asm().

  3. Inline assembly is unportable, unsupported in Visual Studio 64-bit builds, and insanely hard to read. Intrinsics suffer from the same caveats as well as the ones listed above.

  4. All the other listed ways are simply incorrect, which is arguably worse than being slow, and they give in each case such a marginal performance improvement that it doesn't justify the coarseness of the approach. (int)(x+16.0)-16.0 is so bad I won't even touch it, but your method is also wrong because it gives floor(-1) as -2. It's also a very bad idea to include branches in math code when it's so performance critical that the standard library won't do the job for you. So your (incorrect) way should look more like ((int) x) - (x<0.0), maybe with an intermediate so you don't have to perform the fpu move twice. Branches can cause a cache miss, which will completely negate any increase in performance; also, if math errno is disabled, then casting to int is the biggest remaining bottleneck of any floor() implementation. If you /really/ don't care about getting correct values for negative integers, it may be a reasonable approximation, but I wouldn't risk it unless you know your use case very well.

  5. I tried using bitwise casting and rounding-via-bitmask, like what SUN's newlib implementation does in fmodf, but it took a very long time to get right and was several times slower on my machine, even without the relevant compiler optimization flags. Very likely, they wrote that code for some ancient CPU where floating point operations were comparatively very expensive and there were no vector extensions, let alone vector conversion operations; this is no longer the case on any common architectures AFAIK. SUN is also the birthplace of the fast inverse sqrt() routine used by Quake 3; there is now an instruction for that on most architectures. One of the biggest pitfalls of micro-optimizations is that they become outdated quickly.

时光暖心i 2024-08-30 15:18:24

那么您想要一个非常快速的 float->int 转换吗? AFAIK int->float 转换速度很快,但至少在 MSVC++ 上,float->int 转换会调用一个小辅助函数 ftol(),它会执行一些复杂的操作以确保完成符合标准的转换。如果您不需要如此严格的转换,则可以进行一些汇编黑客操作,假设您使用的是 x86 兼容的 CPU。

这是一个快速浮点到整数的函数,它使用 MSVC++ 内联汇编语法(无论如何它应该给你正确的想法):

inline int ftoi_fast(float f)
{
    int i;

    __asm
    {
        fld f
        fistp i
    }

    return i;
}

在 MSVC++ 64 位上,你需要一个外部 .asm 文件,因为 64 位编译器拒绝内联汇编。该函数基本上使用原始 x87 FPU 指令来加载浮点数 (fld),然后将浮点数存储为整数 (fistp)。 (警告注意:您可以通过直接调整 CPU 上的寄存器来更改此处使用的舍入模式,但不要这样做,您会破坏很多东西,包括 MSVC 对 sin 和 cos 的实现!)

如果您可以假设CPU 上的 SSE 支持(或者有一种简单的方法来制作支持 SSE 的代码路径),您也可以尝试:

#include <emmintrin.h>

inline int ftoi_sse1(float f)
{
    return _mm_cvtt_ss2si(_mm_load_ss(&f));     // SSE1 instructions for float->int
}

...基本上是相同的(加载浮点数然后存储为整数),但使用 SSE 指令,速度更快一点。

其中之一应该涵盖昂贵的浮点到整数的情况,并且任何整数到浮点的转换应该仍然很便宜。很抱歉在这里针对微软,但这是我完成类似性能工作的地方,并且通过这种方式我获得了很大的收获。如果可移植性/其他编译器是一个问题,您将不得不考虑其他问题,但这些函数编译为可能需要 <5 个时钟的两条指令,而不是需要 100 个以上时钟的辅助函数。

So you want a really fast float->int conversion? AFAIK int->float conversion is fast, but on at least MSVC++ a float->int conversion invokes a small helper function, ftol(), which does some complicated stuff to ensure a standards compliant conversion is done. If you don't need such strict conversion, you can do some assembly hackery, assuming you're on an x86-compatible CPU.

Here's a function for a fast float-to-int which rounds down, using MSVC++ inline assembly syntax (it should give you the right idea anyway):

inline int ftoi_fast(float f)
{
    int i;

    __asm
    {
        fld f
        fistp i
    }

    return i;
}

On MSVC++ 64-bit you'll need an external .asm file since the 64 bit compiler rejects inline assembly. That function basically uses the raw x87 FPU instructions for load float (fld) then store float as integer (fistp). (Note of warning: you can change the rounding mode used here by directly tweaking registers on the CPU, but don't do that, you'll break a lot of stuff, including MSVC's implementation of sin and cos!)

If you can assume SSE support on the CPU (or there's an easy way to make an SSE-supporting codepath) you can also try:

#include <emmintrin.h>

inline int ftoi_sse1(float f)
{
    return _mm_cvtt_ss2si(_mm_load_ss(&f));     // SSE1 instructions for float->int
}

...which is basically the same (load float then store as integer) but using SSE instructions, which are a bit faster.

One of those should cover the expensive float-to-int case, and any int-to-float conversions should still be cheap. Sorry to be Microsoft-specific here but this is where I've done similar performance work and I got big gains this way. If portability/other compilers are an issue you'll have to look at something else, but these functions compile to maybe two instructions taking <5 clocks, as opposed to a helper function that takes 100+ clocks.

壹場煙雨 2024-08-30 15:18:24

您想要的操作可以使用 fmod 函数来表达(fmodf 表示浮点数而不是双精度数):

#include <math.h>
u = fmodf(u, 1.0f);

您的编译器很有可能以最有效的方式执行此操作。

或者,您对最后一位精度有多关心?您能否为负值设定一个下限,例如知道它们永远不会低于 -16.0?如果是这样,这样的东西将为您节省一个条件,如果它不是可以用您的数据可靠地进行分支预测的东西,那么它很可能很有用:(

u = (u + 16.0);  // Does not affect fractional part aside from roundoff errors.
u -= (int)u;     // Recovers fractional part if positive.

就此而言,取决于您的数据的外观和您的处理器正在使用,如果其中很大一部分是负数,但一小部分低于 16.0,您可能会发现在进行条件 int 转换之前添加 16.0f 可以加快速度,因为它使您的条件可预测,或者您的编译器可能会这样做。使用条件分支以外的其他方法来执行此操作,在这种情况下,如果不进行测试并查看生成的程序集,就很难说它是有用的。)

The operation you want can be expressed using the fmod function (fmodf for floats rather than doubles):

#include <math.h>
u = fmodf(u, 1.0f);

Chances are reasonably good that your compiler will do this in the most efficient way that works.

Alternately, how concerned are you about last-bit precision? Can you put a lower bound on your negative values, such as something knowing that they're never below -16.0? If so, something like this will save you a conditional, which is quite likely to be useful if it's not something that can be reliably branch-predicted with your data:

u = (u + 16.0);  // Does not affect fractional part aside from roundoff errors.
u -= (int)u;     // Recovers fractional part if positive.

(For that matter, depending on what your data looks like and the processor you're using, if a large fraction of them are negative but a very small fraction are below 16.0, you might find that adding 16.0f before doing your conditional int-casting gives you a speedup because it makes your conditional predictable. Or your compiler may be doing that with something other than a conditional branch in which case it's not useful; it's hard to say without testing and looking at generated assembly.)

沧桑㈠ 2024-08-30 15:18:24

如果您使用的是 Visual C++,请检查“启用内部函数”编译器设置。如果启用,它应该会使大多数数学函数更快(包括下限)。缺点是边缘情况(如 NaN)的处理可能不正确,但对于游戏来说,您可能不在乎。

If you are using Visual C++, check the "Enable Intrinsic Functions" compiler setting. If enabled it should make most math functions faster (including floor). Downside is that handling of edge cases (like NaN) could be incorrect, but for a game, you might not care.

冬天旳寂寞 2024-08-30 15:18:24

如果范围很小,另一个愚蠢的想法可能会起作用......

使用按位运算从浮点数中提取指数,然后使用查找表找到一个掩码,从尾数中擦除不需要的位。使用它来找到地板(擦除点下方的位)以避免重新规范化问题。

编辑我删除了这个“太愚蠢了,加上+ve与-ve问题”。既然它无论如何都被投票了,它就不会被删除,我将把它留给其他人来决定它有多愚蠢。

Another silly idea that might just work if the range is small...

Extract the exponent from the float using bitwise operations, then use a lookup table to find a mask that wipes unwanted bits from the mantissa. Use this to find the floor (wipe bits below the point) to avoid renormalising issues.

EDIT I deleted this as "too silly, plus with a +ve vs. -ve issue". Since it got upvoted anyway, it's undeleted and I'll leave it to others to decide how silly it is.

£烟消云散 2024-08-30 15:18:24

u、v 值的最大输入范围是多少?如果它的范围相当小,例如 -5.0 到 +5.0,那么重复加/减 1.0 直到进入范围会更快,而不是调用昂贵的函数(例如下限)。

What is the maximum input range of your u, v values ? If it's a fairly small range, e.g. -5.0 to +5.0, then it will be quicker to repeatedly add/subtract 1.0 until you get within range, rather than calling expensive functions such as floor.

背叛残局 2024-08-30 15:18:24

如果可能出现的值的范围足够小,也许您可​​以对下限值进行二分搜索。例如,如果值 -2 <= x < 2 可能会发生......

if (u < 0.0)
{
  if (u < 1.0)
  {
    //  floor is 0
  }
  else
  {
    //  floor is 1
  }
}
else
{
  if (u < -1.0)
  {
    //  floor is -2
  }
  else
  {
    //  floor is -1
  }
}

我对此不做任何保证 - 我不知道比较的效率与地板相比如何 - 但它可能值得尝试。

If the range of values that may occur is sufficiently small, perhaps you can binary-search the floor value. For example, if values -2 <= x < 2 can occur...

if (u < 0.0)
{
  if (u < 1.0)
  {
    //  floor is 0
  }
  else
  {
    //  floor is 1
  }
}
else
{
  if (u < -1.0)
  {
    //  floor is -2
  }
  else
  {
    //  floor is -1
  }
}

I make no guarantees about this - I don't know how the efficiency of comparisons compares with floor - but it may be worth trying.

带刺的爱情 2024-08-30 15:18:24

这并不能解决铸造成本问题,但在数学上应该是正确的:

int inline fasterfloor( const float x ) { return x < 0 ? (int) x == x ? (int) x : (int) x -1 : (int) x; }

this one doesn't solve the casting cost but should be mathematically correct:

int inline fasterfloor( const float x ) { return x < 0 ? (int) x == x ? (int) x : (int) x -1 : (int) x; }
坦然微笑 2024-08-30 15:18:24

如果您正在循环并使用 u 和 v 作为索引坐标,则不要使用 float 来获取坐标,而是保持 float 和 int 的值相同并将它们一起递增。这将为您提供一个相应的整数,以便在需要时使用。

If you are looping and using u and v as index coordinates, instead of flooring a float to get the coordinates, keep both a float and int of the same value and increment them together. This will give you a corresponding integer to use when needed.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文