为什么以下循环展开会导致错误的结果？

发布于 2025-01-24 18:06:46 字数 1991 浏览 2 评论 0原文

我目前正在尝试优化一些我为三角调整的24x24矩阵的程序编写的MIPS汇编程序。我目前的目标是利用延迟的分支和手动循环展开，尝试减少周期。 注意：我使用的所有矩阵算术使用了32位单个精度。

算法的一部分涉及我要展开的以下循环（n永远是24），

...
    float inv = 1/A[k][k]
    for (j = k + 1; j < N; j++) {
        /* divide by pivot element */
        A[k][j] = A[k][j] * inv;
    }
...

但

...
    float inv = 1/A[k][k]
    for (j = k + 1; j < N; j +=2) {
        /* divide by pivot element */
        A[k][j]     = A[k][j]     * inv;
        A[k][j + 1] = A[k][j + 1] * inv;
    }
...

它会生成结果不正确，我不知道为什么。有趣的是，具有环状展开的版本正确地生成了矩阵的第一行，但其余版本不正确。不正确展开的不循环的版本可以三角矩阵进行三角调节。

这是我尝试这样做的尝试。

...

# No loop unrolling
loop_2:
    move    $a3, $t2          # column number b = j (getelem A[k][j])
    jal     getelem           # Addr of A[k][j] in $v0 and val in $f0
    addiu   $t2, $t2, 1       ## j += 2
    mul.s   $f0, $f0, $f2     # Perform A[k][j] * inv
    bltu    $t2, 24, loop_2   # if j < N, jump to loop_2
    swc1    $f0, 0($v0)       ## Perform A[k][j] := A[k][j] * inv

    # The matrix triangulates without problem with this original code.

...

...

# One loop unrolling
loop_2:
    move    $a3, $t2         # column number b = j (getelem A[k][j])
    jal     getelem          # Addr of A[k][j] in $v0 and val in $f0
    addiu   $t2, $t2, 2      ## j += 2
    lwc1    $f1, 4($v0)      # $f1 <- A[k][j + 1]
    mul.s   $f0, $f0, $f2    # Perform A[k][j] * inv
    mul.s   $f1, $f1, $f2    # Perform A[k][j+1] * inv
    swc1    $f0, 0($v0)      # Perform A[k][j] := A[k][j] * inv
    bltu    $t2, 24, loop_2  # if j < N, jump to loop_2
    swc1    $f1, 4($v0)      ## Perform A[k][j + 1] := A[k][j + 1] * inv

    # The first row in the resulting matrix is correct, but the remaining ones not when using this once unrolled loop code.

...

原文

I am currently trying to optimize some MIPS assembler that I've written for a program that triangulates a 24x24 matrix. My current goal is to utilize delayed branching and manual loop unrolling to try and cut down on the cycles. Note: I am using 32-bit single precision for all the matrix arithmetic.

Part of the algorithm involves the following loop that I'm trying to unroll (N will always be 24)

...
    float inv = 1/A[k][k]
    for (j = k + 1; j < N; j++) {
        /* divide by pivot element */
        A[k][j] = A[k][j] * inv;
    }
...

I want

...
    float inv = 1/A[k][k]
    for (j = k + 1; j < N; j +=2) {
        /* divide by pivot element */
        A[k][j]     = A[k][j]     * inv;
        A[k][j + 1] = A[k][j + 1] * inv;
    }
...

but it generates the incorrect result and I don't know why. The interesting thing is that the version with loop unrolling generates the first row of matrix correctly but the remaining ones incorrect. The version without loop unrolling correctly triangulates the matrix.

Here is my attempt at doing it.

...

# No loop unrolling
loop_2:
    move    $a3, $t2          # column number b = j (getelem A[k][j])
    jal     getelem           # Addr of A[k][j] in $v0 and val in $f0
    addiu   $t2, $t2, 1       ## j += 2
    mul.s   $f0, $f0, $f2     # Perform A[k][j] * inv
    bltu    $t2, 24, loop_2   # if j < N, jump to loop_2
    swc1    $f0, 0($v0)       ## Perform A[k][j] := A[k][j] * inv

    # The matrix triangulates without problem with this original code.

...

...

# One loop unrolling
loop_2:
    move    $a3, $t2         # column number b = j (getelem A[k][j])
    jal     getelem          # Addr of A[k][j] in $v0 and val in $f0
    addiu   $t2, $t2, 2      ## j += 2
    lwc1    $f1, 4($v0)      # $f1 <- A[k][j + 1]
    mul.s   $f0, $f0, $f2    # Perform A[k][j] * inv
    mul.s   $f1, $f1, $f2    # Perform A[k][j+1] * inv
    swc1    $f0, 0($v0)      # Perform A[k][j] := A[k][j] * inv
    bltu    $t2, 24, loop_2  # if j < N, jump to loop_2
    swc1    $f1, 4($v0)      ## Perform A[k][j + 1] := A[k][j + 1] * inv

    # The first row in the resulting matrix is correct, but the remaining ones not when using this once unrolled loop code.

...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

毅然前行 2025-01-31 18:06:46

展开的C循环条件是越野车。

J＆lt; n; j += 2可以使用j = n-1，
开始循环主体
在a [k] [n-1]（fine）和a [k] [n] [n]（不罚款）上访问数组。

一种常见方法是J＆lt; n-1 ，或一般j＆lt; n-（unloll-1）。但是对于无符号n，您还必须在开始循环之前单独检查n＆gt; = enroll，因为n-1可以包装到一个巨大的无符号值。

保持J＆lt;限制通常对C编译器与J + 1＆lt； n这是他们必须计算的单独的东西。并且还可以阻止编译器证明该循环不是无限的无符号计数（例如size_t），因为它定义得很好，因此n = uint_max可能会导致条件总是在真正取决于起点。（例如j = uint_max-2 make uint_max-1＆lt; uint_max和j+= 2 make 0＆lt; uint_max，也为true。）因此，这与使用j＆lt; = limit对未签名计数器使用类似的问题。编译器非常喜欢知道循环何时可能是无限的。对于某些人来说，如果在第一次迭代之前无法计算的Trip计数，则可以禁用自动矢量化。

如果j是从0开始的，则如果保证n是揭开因素的倍数，则可以摆脱草率的条件。但是，正如内特指出的那样，这是不同的。

MIPS ASM的效率

通常是循环展开点是性能。循环内部对辅助功能的非内部呼叫正在击败目的。

JAL Getelem我假设有一堆乘以乘以用指针和两个整数重做索引的东西吗？请注意，您正在沿着连续的内存扫描，因此您可以增加指针。

计算一个终端排列以进行比较，因此您的MIPS循环看起来像

 # some checking outside the loop, maybe with a bxx to the end of it.
 looptop:                  # do{

    lwc1   $f2, 0($t0)
    lwc1   $f3, 4($t0)
    addiu  $t0, $t0, 4*2      # p+=2     advance by 8 bytes, 2 floats
    ...
    swc1   something, 0($t0)
    swc1   something, 4($t0)
    bne    $t0, $t1        # }while(p!=endp)

   # maybe another condition to check if you should run one last iteration.

MIPS bltu只是一个伪建筑（sltu/bnez）;这就是为什么更好地计算精确的终端排列的原因，以便您可以将单个机器指令用作循环分支的原因。

是的，这可能意味着将迭代的计数降低到2个倍数以确保正确性。或执行标量迭代和舍入 up 到2的倍数。例如x ++/x＆amp; = - 2;

带有软件管道的>负载和分裂，但还没有商店，如果奇怪的话，您可以让圆形上的循环重做该元素。（如果分支机构错误预测的机会超过FP的乘以乘以和多余的商店。）尚未完全考虑到这一点，但是Simd与第一个不规则的向量进行了类似的想法，然后是潜在的派对偏置的矢量。（Simd矢量化就像展开一样，但是随后您将其滚回到一个有4个元素的单个指令中。）

The unrolled C loop condition is buggy.

j < N; j +=2 can start the loop body with j = N-1,
accessing the array at A[k][N-1] (fine) and A[k][N] (not fine).

One common method is j < N-1, or in general j < N-(unroll-1). But for unsigned N, you also have to separately check N >= unroll before starting the loop, because N-1 could wrap to a huge unsigned value.

Keeping the j < limit is generally good for C compilers vs. j + 1 < N which is a separate thing they'd have to calculate. And can also stop a compiler from proving that the loop isn't infinite for unsigned counts (like size_t), because that's well-defined as wrapping around, so N = UINT_MAX could lead to the condition always being true depending on the starting point. (e.g. j = UINT_MAX-2 makes UINT_MAX-1 < UINT_MAX, and j+=2 makes 0 < UINT_MAX, also true.) So it's a similar problem to using j <= limit for unsigned counters. Compilers really like to know when a loop is potentially infinite. For some, that it disables auto-vectorization if the trip-count isn't calculable ahead of the first iteration.

If j was starting at 0, you can get away with a sloppy condition if N was guaranteed to be a multiple of the unroll factor. But here it's different, as Nate points out.

efficiency of your MIPS asm

generally the point of loop unrolling is performance. A non-inline call to a helper function inside the loop is kind of defeating the purpose.

jal getelem I assume does a bunch of multiplies and stuff to redo the indexing with a pointer and two integers? Notice that you're scanning along contiguous memory in one row, so you can just increment a pointer.

Calculate an end-pointer to compare against, so your MIPS loop can look like

 # some checking outside the loop, maybe with a bxx to the end of it.
 looptop:                  # do{

    lwc1   $f2, 0($t0)
    lwc1   $f3, 4($t0)
    addiu  $t0, $t0, 4*2      # p+=2     advance by 8 bytes, 2 floats
    ...
    swc1   something, 0($t0)
    swc1   something, 4($t0)
    bne    $t0, $t1        # }while(p!=endp)

   # maybe another condition to check if you should run one last iteration.

MIPS bltu is only a pseudo-instruction (sltu/bnez); that's why it's better to calculate an exact end-pointer so you can use a single machine instruction as the loop branch.

And yes, this might mean rounding the iteration count down to a multiple of 2 to ensure correctness. Or doing a scalar iteration and rounding up to a multiple of 2. e.g. x++ / x&=-2;

With software pipelining, e.g. doing a load and divide but not a store yet, you could maybe let the rounding-up have the loop redo that element if odd. (If the chance of a branch mispredict costs more than an FP multiply and a redundant store.) Haven't fully thought this through, but it's a similar idea to SIMD doing a first unaligned vector, then a potentially-partially-overlapping aligned vector. (SIMD vectorization is like unrolling, but then you roll back up into a single instruction that does 4 elements, for example.)

回复收藏 0 原文

~没有更多了~