为什么即使使用-FFAST -MATH，为什么不使用浮子来优化2^n的乘法，从而将乘法优化2^n？

发布于 2025-01-31 20:41:36 字数 1992 浏览 1 评论 0原文

考虑到此功能，

float mulHalf(float x) {
    return x * 0.5f;
}

以下功能会使用 normal 输入/输出产生相同的结果。

float mulHalf_opt(float x) {
    __m128i e = _mm_set1_epi32(-1 << 23);
    __asm__ ("paddd\t%0, %1" : "+x"(x) : "xm"(e));
    return x;
}

这是带有-O3 -FFAST-MATH的汇编输出。

mulHalf:
        mulss   xmm0, DWORD PTR .LC0[rip]
        ret

mulHalf_opt:
        paddd   xmm0, XMMWORD PTR .LC1[rip]
        ret

-FFAST-MATH启用-ffinite-Math-Math-golly“假定参数和结果不是NANS或 +-Infs”

因此，mulhalf的编译输出可以更好地使用paddd与-ffast-math上的paddd，如果这样做会在的公差下产生更快的代码。 -FFAST-MATH。

我从。

(MULSS)
Architecture    Latency Throughput (CPI)
Skylake         4       0.5
Broadwell       3       0.5
Haswell         5       0.5
Ivy Bridge      5       1

(PADDD)
Architecture    Latency Throughput (CPI)
Skylake         1       0.33
Broadwell       1       0.5
Haswell         1       0.5
Ivy Bridge      1       0.5

显然，PADD是一个更快的指令。然后我想也许是因为整数和浮点单元之间的旁路延迟。

此答案显示了Agner Fog的表格。

Processor                       Bypass delay, clock cycles 
  Intel Core 2 and earlier        1 
  Intel Nehalem                   2 
  Intel Sandy Bridge and later    0-1 
  Intel Atom                      0 
  AMD                             2 
  VIA Nano                        2-3

看到这一点，paddd似乎仍然是赢家，尤其是在CPU上，尤其是Sandy Bridge的CPU，但对于最近的CPU指定-March，只需更改MULS vmulss，具有相似的延迟/吞吐量。

为什么GCC和Clang不用float to paddd将乘法优化2^n，即使使用-ffast-Math？

原文

Considering this function,

float mulHalf(float x) {
    return x * 0.5f;
}

the following function produces the same result with normal input/output.

float mulHalf_opt(float x) {
    __m128i e = _mm_set1_epi32(-1 << 23);
    __asm__ ("paddd\t%0, %1" : "+x"(x) : "xm"(e));
    return x;
}

This is the assembly output with -O3 -ffast-math.

mulHalf:
        mulss   xmm0, DWORD PTR .LC0[rip]
        ret

mulHalf_opt:
        paddd   xmm0, XMMWORD PTR .LC1[rip]
        ret

-ffast-math enables -ffinite-math-only which "assumes that arguments and results are not NaNs or +-Infs" [1].

So the compiled output of mulHalf might better use paddd with -ffast-math on if doing so produces faster code under the tolerance of -ffast-math.

I got the following tables from Intel Intrinsics Guide.

(MULSS)
Architecture    Latency Throughput (CPI)
Skylake         4       0.5
Broadwell       3       0.5
Haswell         5       0.5
Ivy Bridge      5       1

(PADDD)
Architecture    Latency Throughput (CPI)
Skylake         1       0.33
Broadwell       1       0.5
Haswell         1       0.5
Ivy Bridge      1       0.5

Clearly, paddd is a faster instruction. Then I thought maybe it's because of the bypass delay between integer and floating-point units.

This answer shows a table from Agner Fog.

Processor                       Bypass delay, clock cycles 
  Intel Core 2 and earlier        1 
  Intel Nehalem                   2 
  Intel Sandy Bridge and later    0-1 
  Intel Atom                      0 
  AMD                             2 
  VIA Nano                        2-3

Seeing this, paddd still seems like a winner, especially on CPUs later than Sandy Bridge, but specifying -march for recent CPUs just change mulss to vmulss, which has a similar latency/throughput.

Why don't GCC and Clang optimize multiplication by 2^n with a float to paddd even with -ffast-math?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鹊巢 2025-02-07 20:41:36

这将使0.0F 的输入失败，-ffast-Math-Math不排除。（尽管从技术上讲，这是一个恰好的亚正态分子的特殊情况，也恰好是Mantissa。）。

整数扣除将包裹到全部指数字段，然后翻转符号位，因此您将获得0.0F * 0.5F * 0.5F生产-Inf，这根本不是可以接受。

@chtz指出，可以使用psubusw来修复+0.0F情况，但对于-0.0.0f - ＆gt; +INF。因此，不幸的是，即使使用-FFAST-MATH允许零的“错误”符号，这也是不可用的。但是，即使快速记忆，无限态度和NAN的完全错误也是不可取的。

除此之外，是的，我认为这将有效，并在旁路延迟与Nehalem以外的CPU中付费，即使在其他FP指令之间使用了。

0.0行为是一种展示器。除此之外，与其他输入相比，在FP乘以的情况下，底流程的理想程度要少得多，例如，即使设置了FTZ（输出时冲洗至零），也会产生亚正常。用DAZ集（Denmals为零）读取它的代码仍然可以正确处理它，但是对于具有最小归一化指数（编码为1）的数字，FP Bit-Pattern也可能是错误的非零mantissa。例如，由于将归一化数字乘以0.5F，您可以获得0x00000001的0x00000001。

即使不是为了0.0F Showstopper，这种怪异可能比海湾合作委员会愿意对人造成的怪异。因此，即使对于GCC可以证明非零的情况，我也不会期望它，除非它也可以远离flt_min。这可能很少是不值得寻找的。

，当您知道它是安全的时候，您当然可以手动执行此操作，尽管Simd Interinss更加方便。我希望Scalar Type-typunn中的ASM相当糟糕，大概是2x movd 围绕整数sub，而不是将其保留在paddd时，仅当您只想要低标量fp元素时。

Godbolt for几次尝试 ，包括直接的内在物质，这些内在物质仅像我们希望的那样call缩编译到内存源paddd。 Clang的Shuffle Optimizer认为上部元素是“死”（_MM_CVTSS_F32仅读取底部的元素），并且能够将它们视为“不在乎”。

// clang compiles this fully efficiently
// others waste an instruction or more on _mm_set_ss to zero the upper XMM elements
float mulHalf_opt_intrinsics(float x) {
    __m128i e = _mm_set1_epi32(-1u << 23);
    __m128 vx = _mm_set_ss(x);
    vx = _mm_castsi128_ps( _mm_add_epi32(_mm_castps_si128(vx), e) );
    return _mm_cvtss_f32(vx);
}

和普通标量版。我尚未测试它是否可以自动矢量化，但可以想象它可以做到这一点。否则，GCC和Clang均进行MOVD/add/movd（或sub）将值反弹至GP-Integer寄存器。

float mulHalf_opt_memcpy_scalar(float x) {
    uint32_t xi;
    memcpy(&xi, &x, sizeof(x));
    xi += -1u << 23;
    memcpy(&x, &xi, sizeof(x));
    return x;
}

This fails for an input of 0.0f, which -ffast-math doesn't rule out. (Even though technically that's a special case of a subnormal that just happens to also have a zero mantissa.).

Integer subtraction would wrap to an all-ones exponent field, and flip the sign bit, so you'd get 0.0f * 0.5f producing -Inf, which is simply not acceptable.

@chtz points out that the +0.0f case can be repaired by using psubusw, but that still fails for -0.0f -> +Inf. So unfortunately that's not usable either, even with -ffast-math allowing the "wrong" sign of zero. But being fully wrong for infinities and NaNs is also undesirable even with fast-math.

Other than that, yes I think this would work, and pay for itself in bypass latency vs. ALU latency on CPUs other than Nehalem, even if used between other FP instructions.

The 0.0 behaviour is a showstopper. Besides that, the underflow behaviour is a lot less desirable than with FP multiply for other inputs, e.g. producing a subnormal even when FTZ (flush to zero on output) is set. Code that reads it with DAZ set (denormals are zero) would still handle it properly, but the FP bit-pattern might also be wrong for a number with the minimum normalized exponent (encoded as 1) and a non-zero mantissa. e.g. you could get a bit-pattern of 0x00000001 as a result of multiplying a normalized number by 0.5f.

Even if not for the 0.0f showstopper, this weirdness might be more than GCC would be willing to inflict on people. So I wouldn't expect it even for cases where GCC can prove non-zero, unless it could also prove far from FLT_MIN. That may be rare enough not to be worth looking for.

You can certainly do it manually when you know it's safe, although much more convenient with SIMD intrinsics. I'd expect rather bad asm from scalar type-punning, probably 2x movd around integer sub, instead of keeping it in an XMM for paddd when you only want the low scalar FP element.

Godbolt for several attempts, including straightforward intrinsics which clang compiles to just a memory-source paddd like we hoped. Clang's shuffle optimizer sees that the upper elements are "dead" (_mm_cvtss_f32 only reads the bottom one), and is able to treat them as "don't care".

// clang compiles this fully efficiently
// others waste an instruction or more on _mm_set_ss to zero the upper XMM elements
float mulHalf_opt_intrinsics(float x) {
    __m128i e = _mm_set1_epi32(-1u << 23);
    __m128 vx = _mm_set_ss(x);
    vx = _mm_castsi128_ps( _mm_add_epi32(_mm_castps_si128(vx), e) );
    return _mm_cvtss_f32(vx);
}

And a plain scalar version. I haven't tested to see if it can auto-vectorize, but it might conceivably do so. Without that, GCC and clang do both movd/add/movd (or sub) to bounce the value to a GP-integer register.

float mulHalf_opt_memcpy_scalar(float x) {
    uint32_t xi;
    memcpy(&xi, &x, sizeof(x));
    xi += -1u << 23;
    memcpy(&x, &xi, sizeof(x));
    return x;
}

回复收藏 0 原文

~没有更多了~