高效的手动循环展开
我有这个 C 代码:
for (k = 0; k < n_n; k++) {
if (k == i || k == j) continue;
dd=q2_vect[k]-q1_vect;
d2=dd*dd;
if (d2<0) {
a=1;
break;
}
}
出于编译器优化原因(在 Cell 处理器的 SPE 上),我需要手动取消循环,所以我尝试了:
dd=q2_vect[0]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;
dd=q2_vect[1]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;
dd=q2_vect[2]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;
.....
.....
// end
goto notdone;
done:
ok=0;
notdone:
.....
但我不知道如何处理
if (k == i || k == j) continue;
和 与 lopp 依赖的事实每次在“n_n”上运行时,我应该手动编写代码多次,以达到最大值“n_n”。
您认为如何解决?
I have this C code:
for (k = 0; k < n_n; k++) {
if (k == i || k == j) continue;
dd=q2_vect[k]-q1_vect;
d2=dd*dd;
if (d2<0) {
a=1;
break;
}
}
For compiler optimization reasons (on the SPE of the Cell processor), I need to unloop this by hand, so I tried:
dd=q2_vect[0]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;
dd=q2_vect[1]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;
dd=q2_vect[2]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;
.....
.....
// end
goto notdone;
done:
ok=0;
notdone:
.....
but I do not know how to deal with the
if (k == i || k == j) continue;
and with the fact that the lopp depends on each run on "n_n", and by hand I should write the code so many times as the maximal value "n_n" would get.
How do you think it can be fixed?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
你确定所写的代码是正确的吗?如果
dd
是有符号整数类型,则当前代码具有未定义的行为,并且如果d2
为无符号或 ifdd
则永远不会满足 if 中的条件> 和 d2 是浮点类型。看起来您正在对除i
或j
之外的第一个索引k
进行损坏的搜索,其中对表达式q2_vect[ 进行平方k]-q1_vect
溢出。至于有效地跳过
i
和j
迭代,我会只查看展开的“循环”停止的位置,并在k+1
处重新启动它如果k
等于i
或j
。这是假设循环中的代码没有副作用/运行总计,这与所写的一样,但我希望您可能想让代码执行其他操作(例如对平方求和)。最后,当您似乎根本没有工作代码时,我高度怀疑您希望手动展开循环。任何好的编译器都可以为您展开循环,但通常您想要执行的循环展开类型会使性能变得更糟而不是更好。我认为你最好先让你的代码正常工作,然后测量(并查看编译器生成的 asm),并且只有在你确定存在问题之后才尝试改进。
Are you sure the code as written is correct? The current code has undefined behavior if
dd
is a signed integer type, and the condition in the if is never satisfied ifd2
is unsigned or ifdd
andd2
are floating point types. It looks like you're doing a broken search for the first indexk
other thani
orj
where squaring the expressionq2_vect[ k]-q1_vect
overflows.As for efficiently skipping the
i
andj
iterations, I would instead just look at where the unrolled "loop" stopped, and restart it atk+1
ifk
was equal toi
orj
. This is assuming the code in your loop has no side effects/running total, which is true as written, but I expect you might have meant for the code to do something else (like summing the squares).Finally, I am highly skeptical of your wish to unroll the loop manually when you don't even seem to have working code to begin with. Any good compiler can unroll the loop for you, but often the type of loop unrolling you're looking to do makes performance worse rather than better. I think you'd do better getting your code to work correctly first, then measuring (and looking at the compiler-generated asm), and only trying to improve on that after you've determined there's a problem.
编写的这段代码相当不适合 SPE,因为它的分支太多。此外,有关所涉及变量类型的信息也会有所帮助;编写的测试看起来相当晦涩(即使使用
>0
修复),但代码看起来可能是使用某种重载operator -
的向量类的 C++表示两个向量的向量减法和运算符 * 来计算点积。在 SPE 上处理此类简单循环的第一件事是让它们无分支(至少是内部循环;即展开几次并仅在每 N 次迭代时检查是否提前退出)并使用 SIMD 指令:SPE 只有 SIMD 指令,因此不在循环中使用 SIMD 处理会立即浪费 75% 的可用寄存器空间和计算能力。同样,SPE 一次只能加载对齐的 qword(16 字节),使用较小的数据类型需要额外的工作来打乱寄存器的内容,以便您尝试加载的值最终出现在“首选槽”中。
您可以通过使用以下无分支形式重写循环的第一部分来摆脱 if (k == i || k == j) (这是伪代码。它立即适用于整数,但您需要使用内在函数来获取浮点上的按位运算):
这里,
cmp_equal
对应于相应的 SPE 内在函数(语义:cmp_equal(a,b) == (a = = b) ? ~0u : 0 )。当
k == i
或k == j
时,这会强制d2
为零。要避免内部循环中的
if (d2 > 0)
分支,请执行以下操作:并且仅每隔几个循环检查
a
是否为非零(提前退出)迭代。如果为d2
计算的所有值都是非负的(如果您的类型是整数、浮点数或实值向量类,则会出现这种情况),您可以进一步简化。只要这样做:最后,只有当所有单独的项都非零时,
a
才会非零。但要小心整数溢出(如果您使用的是整数)和 NaN(如果您使用的是浮点数)。如果您必须处理这些情况,上述简化将破坏代码。This code as written is fairly unsuitable for SPEs since it's so branch-heavy. Also, information on the types of the variables involved would help; the test as written seems fairly obscure (even with the
>0
fix), but the code looks like it might be C++ using some sort of vector class that overloadsoperator -
to mean vector subtraction andoperator *
of two vectors to compute a dot product.The first thing to do with such simple loops on SPEs is to get them branch-free (at least the inner loop; i.e. unroll a couple of times and only check for early exit every N iterations) and use SIMD instructions: SPEs only have SIMD instructions, so not using SIMD processing in your loops instantly wastes 75% of your available register space and computational power. Similarly, SPEs can only load aligned qwords (16 bytes) at a time, using smaller data types requires extra work to shuffle the contents of registers around so that the value you're trying to load ends up in the "preferred slot".
You get rid of the
if (k == i || k == j)
by rewriting the first part of the loop using the following branch-free form (this is pseudocode. It's immediately applicable for ints, but you'll need to use intrinsics to get bitwise ops on floats):Here,
cmp_equal
corresponds to the respective SPE intrinsics (semantics:cmp_equal(a,b) == (a == b) ? ~0u : 0
). This forcesd2
to zero whenk == i
ork == j
.To avoid the
if (d2 > 0)
branch in the inner loop, do the following:and only check if
a
is nonzero (to early-out) every few loop iterations. If all values computed ford2
are nonnegative (will be the case if your type is ints, floats or a real-valued vector class), you can simplify this further. Just do:In the end,
a
will only be nonzero if all of the individual terms were nonzero. But be careful with integer overflows (if you're using ints) and NaNs (if you're using floats). If you have to handle these cases, the above simplification will break the code.通常循环展开意味着使循环包含几次迭代,从而减少运行次数。例如,
可以展开为
Usually loop unrolling means making the loop contain a few iterations, such that it is run fewer times. For example,
could be unrolled to
对于第一个问题,当条件满足时,您需要不“执行”循环体。对于这个特定问题,您只需将该条件的逻辑否定放在
if
语句的条件中即可。通常,展开是按一个因素进行的;展开的代码仍然存在于循环中(除非已知循环范围非常小)。此外,您需要在循环之外完成工作的“余数”(对应于问题大小除以展开因子的余数)。
因此,循环展开的示例:
可以按 2 倍展开:
其中第二个循环执行“余数”(它还将
i
设置为与展开的循环相同的值) ,但如果此后不需要i
,则整行可以只是if (i < n) etc
对于这种情况)。For the first problem, you need to not "execute" the loop body when the condition is met. For this particular problem, you can just place the logical negation of that condition inside the
if
statement's condition.Normally unrolling is by a factor; the unrolled code still lives in a loop (unless the loop bounds are known to be very small). Furthermore, you'll need to do the "remainder" of the work (corresponding to the remainder of the problem size divided by the unroll factor) outside the loop.
So, an example of loop unrolling:
can be unrolled by a factor of 2 to:
where the second loop does the "remainder" (it also sets
i
to be the same thing as the unrolled loop would have, but ifi
is not needed after this, the whole line can just beif (i < n) etc
for this case).假设 n_n 是编译时常量,则循环可以像这样简单地展开:
本质上, continue 之后的所有内容都进入 if 语句的
else
部分编辑:因为
n_n
不是编译时常量,您仍然可以通过在循环中多次运行循环来展开循环,然后以 switch-case 语句结束。事实上,你可以像这样组合它们,这称为 Duff 的设备。assuming n_n is a compile-time constant, the loop may be trivially unrolled like so:
essentially, everything after the continue goes into the
else
portion of an if statementEdit: since
n_n
is not a compile time constant, you can still unroll the loop by doing several runs through the loop in a loop, and then finish with a switch-case statement. in fact, you can combine them like so, this is called the Duff's device.展开这个循环在这里没有多大帮助。内循环软件展开有助于指令的软件流水线操作,以在运行时实现更高的 IPC。在这里,它可能会通过展开来破坏逻辑。
Unrolling this loop will not help here much. Inner loop software unrolling helps with software pipelining of instructions to achieve higher IPC at run-time. Here it could corrupt the logic by unrolling.