手工降低浮点划分的强度

发布于 2025-02-08 15:35:44 字数 2002 浏览 2 评论 0原文

在本学期我们在计算机科学方面的最后一项任务之一中，我们必须对某些代码片段进行降低强度。他们中的大多数只是直截了当的，尤其是在研究编译器输出时。但是，即使在编译器的帮助下，我也无法解决其中一个。

我们的教授给了我们以下提示：

提示：询问IEEE 754单精制浮点数是如何在内存中表示。

这是代码段：（ a是类型double*），

for (int i = 0; i < N; ++i) {
    a[i] += i / 5.3;   
}

首先，我尝试查看此剪辑的编译器输出，以 godbolt 。我试图在没有任何优化的情况下对其进行编译：（注意：我仅复制了for循环中的相关部分）

    mov     eax, DWORD PTR [rbp-4]
    cdqe
    lea     rdx, [0+rax*8]
    mov     rax, QWORD PTR [rbp-16]
    add     rax, rdx
    movsd   xmm1, QWORD PTR [rax]
    cvtsi2sd        xmm0, DWORD PTR [rbp-4]    //division relevant
    movsd   xmm2, QWORD PTR .LC0[rip]          //division relevant
    divsd   xmm0, xmm2                         //division relevant
    mov     eax, DWORD PTR [rbp-4]
    cdqe
    lea     rdx, [0+rax*8]
    mov     rax, QWORD PTR [rbp-16]
    add     rax, rdx
    addsd   xmm0, xmm1
    movsd   QWORD PTR [rax], xmm0

，并使用-O3：

.L2:
        pshufd  xmm0, xmm2, 238             //division relevant
        cvtdq2pd        xmm1, xmm2          //division relevant
        movupd  xmm6, XMMWORD PTR [rax]
        add     rax, 32
        cvtdq2pd        xmm0, xmm0          //division relevant
        divpd   xmm1, xmm3                  //division relevant
        movupd  xmm5, XMMWORD PTR [rax-16]
        paddd   xmm2, xmm4
        divpd   xmm0, xmm3                  //division relevant
        addpd   xmm1, xmm6
        movups  XMMWORD PTR [rax-32], xmm1
        addpd   xmm0, xmm5
        movups  XMMWORD PTR [rax-16], xmm0
        cmp     rax, rbp
        jne     .L2

我评论了汇编代码的部门部分。但是，此输出并不能帮助我理解如何在片段上施加强度降低。（也许进行了太多的优化以完全理解输出）

第二，我试图理解Float Part Part Part 5.3的位表示。这是：

0   |10000001|01010011001100110011010
Sign|Exponent|Mantissa

但这也对我没有帮助。

原文

In one of our last assignments in Computer Science this term we have to apply strength reduction on some code fragments. Most of them were just straight forward, especially with looking into compiler output. But one of them I wont be able to solve, even with the help of the compiler.

Our profs gave us the following hint:

Hint: Inquire how IEEE 754 single-precision floating-point numbers are
represented in memory.

Here is the code snippet: (a is of type double*)

for (int i = 0; i < N; ++i) {
    a[i] += i / 5.3;   
}

At first I tried to look into the compiler output for this snipped on godbolt. I tried to compile it without any optimization: (note: I copied only the relevant part in the for loop)

    mov     eax, DWORD PTR [rbp-4]
    cdqe
    lea     rdx, [0+rax*8]
    mov     rax, QWORD PTR [rbp-16]
    add     rax, rdx
    movsd   xmm1, QWORD PTR [rax]
    cvtsi2sd        xmm0, DWORD PTR [rbp-4]    //division relevant
    movsd   xmm2, QWORD PTR .LC0[rip]          //division relevant
    divsd   xmm0, xmm2                         //division relevant
    mov     eax, DWORD PTR [rbp-4]
    cdqe
    lea     rdx, [0+rax*8]
    mov     rax, QWORD PTR [rbp-16]
    add     rax, rdx
    addsd   xmm0, xmm1
    movsd   QWORD PTR [rax], xmm0

and with -O3:

.L2:
        pshufd  xmm0, xmm2, 238             //division relevant
        cvtdq2pd        xmm1, xmm2          //division relevant
        movupd  xmm6, XMMWORD PTR [rax]
        add     rax, 32
        cvtdq2pd        xmm0, xmm0          //division relevant
        divpd   xmm1, xmm3                  //division relevant
        movupd  xmm5, XMMWORD PTR [rax-16]
        paddd   xmm2, xmm4
        divpd   xmm0, xmm3                  //division relevant
        addpd   xmm1, xmm6
        movups  XMMWORD PTR [rax-32], xmm1
        addpd   xmm0, xmm5
        movups  XMMWORD PTR [rax-16], xmm0
        cmp     rax, rbp
        jne     .L2

I commented the division part of the assembly code. But this output does not help me understanding how to apply strength reduction on the snippet. (Maybe there are too many optimizations going on to fully understand the output)

Second, I tried to understand the bit representation of the float part 5.3.
Which is:

0   |10000001|01010011001100110011010
Sign|Exponent|Mantissa

But this does not help me either.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

悲歌长辞 2025-02-15 15:35:44

如果我们采用wikipedia的定义

强度降低是一种编译器的优化，昂贵的操作被等效但价格较低的操作代替

然后我们可以通过将昂贵的浮点划分转换为浮点倍数，再加上两个浮点多重贴（FMA），在此使用强度降低。假设double映射到IEEE-754 binary64，则浮点数计算的默认舍入模式是圆头到达的，或者是> code> int是一种32位类型，我们可以通过简单的详尽测试来证明转换正确：

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <math.h>

int main (void)
{
    const double rcp_5p3 = 1.0 / 5.3; // 0x1.826a439f656f2p-3
    int i = INT_MAX;
    do {
        double ref = i / 5.3;
        double res = fma (fma (-5.3, i * rcp_5p3, i), rcp_5p3, i * rcp_5p3);
        if (res != ref) {
            printf ("error: i=%2d  res=%23.13a  ref=%23.13a\n", i, res, ref);
            return EXIT_FAILURE;
        }
        i--;
    } while (i >= 0);
    return EXIT_SUCCESS;
}

X86-64和ARM64（例如X86-64和ARM64）的大多数现代实例具有对FMA的硬件支持，因此fma（）可以直接映射到适当的硬件指令。应通过查看产生的二进制的拆卸来确认这一点。如果缺乏对FMA的硬件支持，则显然不应应用转换，因为fma（）的软件实现很慢，有时在功能上不正确。

这里的基本思想是，在数学上，除法等效于与倒数的乘法。但是，对于有限的浮动点算术不一定是正确的。上面的代码试图通过在FMA的帮助下确定幼稚方法中的误差并在必要时应用校正来改善位准确计算的可能性。有关文献参考的背景，请参阅此早期问题。

据我所知，尚无数学上一般证明的算法来确定与哪些股息配对上述转换是安全的（也就是说，要提供位准确的结果），这就是为什么严格必要的详尽测试的原因证明转换是有效的。

在评论中， pascal cuoq 指出，还有一种替代算法，可以使用潜在的优势 - 浮点数浮点数除名通过将除数的倒数预先计算到本地精度，特别是作为双双的编译时间。有关背景，请参见N. Brisebarre和J.-M. Muller，“通过arbirary Precision常数正确圆形乘法”，计算机上的IEEE交易，57（2）：162-174，2008年2月，该交易也提供了指导，该指南如何确定该转换是否安全。特定常数。由于本案很简单，因此我再次使用详尽的测试表明它是安全的。在适用的情况下，这将使分区降低到一个FMA，再加上一个倍数：

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <mathimf.h>

int main (void)
{
    const double rcp_5p3_hi =  1.8867924528301888e-1; //  0x1.826a439f656f2p-3
    const double rcp_5p3_lo = -7.2921377017921457e-18;// -0x1.0d084b1883f6e0p-57
    int i = INT_MAX;
    do {
        double ref = i / 5.3;
        double res = fma (i, rcp_5p3_hi, i * rcp_5p3_lo);
        if (res != ref) {
            printf ("i=%2d  res=%23.13a  ref=%23.13a\n", i, res, ref);
            return EXIT_FAILURE;
        }
        i--;
    } while (i >= 0);
    return EXIT_SUCCESS;
}

If we adopt Wikipedia's definition that

strength reduction is a compiler optimization where expensive operations are replaced with equivalent but less expensive operations

then we can apply strength reduction here by converting the expensive floating-point division into a floating-point multiply plus two floating-point multiply-adds (FMAs). Assuming that double is mapped to IEEE-754 binary64, the default rounding mode for floating-point computation is round-to-nearest-or-even, and that int is a 32-bit type, we can prove the transformation correct by simple exhaustive test:

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <math.h>

int main (void)
{
    const double rcp_5p3 = 1.0 / 5.3; // 0x1.826a439f656f2p-3
    int i = INT_MAX;
    do {
        double ref = i / 5.3;
        double res = fma (fma (-5.3, i * rcp_5p3, i), rcp_5p3, i * rcp_5p3);
        if (res != ref) {
            printf ("error: i=%2d  res=%23.13a  ref=%23.13a\n", i, res, ref);
            return EXIT_FAILURE;
        }
        i--;
    } while (i >= 0);
    return EXIT_SUCCESS;
}

Most modern instances of common processors architectures like x86-64 and ARM64 have hardware support for FMA, such that fma() can be mapped directly to the appropriate hardware instruction. This should be confirmed by looking at the disassembly of the binary generated. Where hardware support for FMA is lacking the transformation obviously should not be applied, as software implementations of fma() are slow and sometimes functionally incorrect.

The basic idea here is that mathematically, division is equivalent to multiplication with the reciprocal. However, that is not necessarily true for finite-precision floating-point arithmetic. The code above tries to improve the likelihood of bit-accurate computation by determining the error in the naive approach with the help of FMA and applying a correction where necessary. For background including literature references see this earlier question.

To the best of my knowledge, there is not yet a general mathematically proven algorithm to determine for which divisors paired with which dividends the above transformation is safe (that is, delivers bit-accurate results), which is why an exhaustive test is strictly necessary to show that the transformation is valid.

In comments, Pascal Cuoq points out that there is an alternative algorithm to potentially strength-reduce floating-point division with a compile-time constant divisor, by precomputing the reciprocal of the divisor to more than native precision and specifically as a double-double. For background see N. Brisebarre and J.-M. Muller, "Correctly rounded multiplication by arbirary precision constant", IEEE Transactions on Computers, 57(2): 162-174, February 2008, which also provides guidance how to determine whether that transformation is safe for any particular constant. Since the present case is simple, I again used exhaustive test to show it is safe. Where applicable, this will reduce the division down to one FMA plus a multiply:

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <mathimf.h>

int main (void)
{
    const double rcp_5p3_hi =  1.8867924528301888e-1; //  0x1.826a439f656f2p-3
    const double rcp_5p3_lo = -7.2921377017921457e-18;// -0x1.0d084b1883f6e0p-57
    int i = INT_MAX;
    do {
        double ref = i / 5.3;
        double res = fma (i, rcp_5p3_hi, i * rcp_5p3_lo);
        if (res != ref) {
            printf ("i=%2d  res=%23.13a  ref=%23.13a\n", i, res, ref);
            return EXIT_FAILURE;
        }
        i--;
    } while (i >= 0);
    return EXIT_SUCCESS;
}

回复收藏 0 原文

雄赳赳气昂昂 2025-02-15 15:35:44

要涵盖另一个方面：由于类型int的所有值都可以完全表示为double（但不是float），因此可以骑行在评估i / 5.3 < / code>时，通过引入一个从0.0到n的浮点变量来评估i / 5.3 < / code>时发生的int to-double转换：

double fp_i = 0;
for (int i = 0; i < N; fp_i += 1, i++)
    a[i] += fp_i / 5.3;

但是，这会杀死自动化，并引入了一系列相关的链条浮点添加。浮点添加通常是3或4个周期，因此，即使CPU可以更快地将指令分配到循环中，最后一次迭代将在至少（N-1）*3周期后退休。值得庆幸的是，浮点部门没有完全管道，X86 CPU可以派遣浮点数划分的速率大致匹配或超过加法说明的延迟。

这留下了杀死矢量化的问题。可以通过手动展开循环并引入两个独立链来将其带回，但是使用AVX，您需要四个链才能进行完整的矢量化：

double fp_i0 = 0, fp_i1 = 1;
int i = 0;
for (; i+1 < N; fp_i0 += 2, fp_i1 += 2, i += 2) {
    double t0 = a[i], t1 = a[i+1];
    a[i]   = t0 + fp_i0 / 5.3;
    a[i+1] = t1 + fp_i1 / 5.3;
}
if (i < N)
    a[i] += i / 5.3;

To cover another aspect: since all values of type int are exactly representable as double (but not as float), it is possible to get rid of int-to-double conversion that happens in the loop when evaluating i / 5.3 by introducing a floating-point variable that counts from 0.0 to N:

double fp_i = 0;
for (int i = 0; i < N; fp_i += 1, i++)
    a[i] += fp_i / 5.3;

However, this kills autovectorization, and introduces a chain of dependent floating-point additions. Floating point addition is typically 3 or 4 cycles, so the last iteration will retire after at least (N-1)*3 cycles, even if the CPU could dispatch the instructions in the loop faster. Thankfully, floating-point division is not fully pipelined, and the rate at which an x86 CPU can dispatch floating-point division roughly matches or exceeds latency of the addition instruction.

This leaves the problem of killed vectorization. It's possible to bring it back by manually unrolling the loop and introducing two independent chains, but with AVX you'd need four chains for full vectorization:

double fp_i0 = 0, fp_i1 = 1;
int i = 0;
for (; i+1 < N; fp_i0 += 2, fp_i1 += 2, i += 2) {
    double t0 = a[i], t1 = a[i+1];
    a[i]   = t0 + fp_i0 / 5.3;
    a[i+1] = t1 + fp_i1 / 5.3;
}
if (i < N)
    a[i] += i / 5.3;

回复收藏 0 原文

予囚 2025-02-15 15:35:44

警告：几天后，我意识到此答案是不正确的，因为它在计算o / 5.3 < / code>的计算中忽略了下流的后果（对亚正常或零）。在这种情况下，将结果乘以两个功率是“精确的”，但并不产生结果，即将较大整数除以5.3具有。

i/5.3 < /代码>仅需要计算i的奇数。对于i的偶数值，您可以简单地乘以2.0（i/2）/5.3的值，该值已经在循环中以前计算出来。

剩下的困难是以0和n-1之间的每个索引重新排序迭代，并且该程序无需记录任意号码分区结果。

实现此目的的一种方法是在所有奇数上迭代o小于n，然后在计算o/5.3之后，以处理索引<代码> o ，还处理表单o*2 ** p的所有索引。

if (N > 0) {
  a[0] += 0.0; // this is needed for strict IEEE 754 compliance lol
  for (int o = 1; o < N; o += 2) {
    double d = o / 5.3;
    int i = o;
    do { 
      a[i] += d;
      i += i;
      d += d;
    } while (i < N);
  }
}

注意：这没有使用提供的提示“询问在内存中如何表示IEEE 754单精度浮点数”。我认为我非常了解单一精确的浮点数在内存中的表示，但是我看不出这是相关的，尤其是因为代码中没有单一精确值或计算以进行优化。我认为该问题的表达方式存在错误，但是以上方面仍然是对这个问题的部分答案。

我还忽略了n的值的溢出问题，这些值接近上述代码段中的int_max，因为代码已经足够复杂。

另外，上面的转换仅取代了两个分区。它通过使代码不可证实（也不太适合缓存）来做到这一点。在您的问题中，gcc -o3已经表明，自动矢量化可以应用于您的教授建议的起点，这可能比抑制一半的部门更有益。这个答案中的转型唯一好处是，这是一种降低力量，您的教授要求。

CAVEAT: After a few days I realized that this answer is incorrect in that it ignores the consequence of underflow (to subnormal or to zero) in the computation of o / 5.3. In this case, multiplying the result by a power of two is “exact” but does not produce the result that dividing the larger integer by 5.3 would have.

i / 5.3 only needs to be computed for odd values of i.
For even values of i, you can simply multiply by 2.0 the value of (i/2)/5.3, which was already computed earlier in the loop.

The remaining difficulty is to reorder the iterations in a way such that each index between 0 and N-1 is handled exactly once and the program does not need to record an arbitrary number of division results.

One way to achieve this is to iterate on all odd numbers o less than N, and after computing o / 5.3 in order to handle index o, to also handle all indexes of the form o * 2**p.

if (N > 0) {
  a[0] += 0.0; // this is needed for strict IEEE 754 compliance lol
  for (int o = 1; o < N; o += 2) {
    double d = o / 5.3;
    int i = o;
    do { 
      a[i] += d;
      i += i;
      d += d;
    } while (i < N);
  }
}

Note: this does not use the provided hint “Inquire how IEEE 754 single-precision floating-point numbers are represented in memory”. I think I know pretty well how single-precision floating-point numbers are represented in memory, but I do not see how that is relevant, especially since there are no single-precision values or computations in the code to optimize. I think there is a mistake in the way the problem is expressed, but still the above is technically a partial answer to the question as phrased.

I also ignored overflow problems for values of N that come close to INT_MAX in the code snippet above, since the code is already complicated enough.

As an additional note, the above transformation only replaces one division out of two. It does this by making the code unvectorizable (and also less cache-friendly). In your question, gcc -O3 has already shown that automatic vectorization could be applied to the starting point that your professor suggested, and that is likely to be more beneficial than suppressing half the divisions can be. The only good thing about the transformation in this answer is that it is a sort of strength reduction, which your professor requested.

回复收藏 0 原文

~没有更多了~