高效的手动循环展开

发布于 2024-10-04 15:18:37 字数 684 浏览 2 评论 0原文

我有这个 C 代码:

for (k = 0; k < n_n; k++) {
    if (k == i || k == j) continue;
    dd=q2_vect[k]-q1_vect;
    d2=dd*dd;
    if (d2<0) {
        a=1;
        break;
    }       
}  

出于编译器优化原因(在 Cell 处理器的 SPE 上),我需要手动取消循环,所以我尝试了:

dd=q2_vect[0]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

dd=q2_vect[1]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

dd=q2_vect[2]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

.....
.....

// end
goto notdone;

done: 
ok=0;

notdone:
.....

但我不知道如何处理

if (k == i || k == j) continue;

和 与 lopp 依赖的事实每次在“n_n”上运行时,我应该手动编写代码多次,以达到最大值“n_n”。

您认为如何解决?

I have this C code:

for (k = 0; k < n_n; k++) {
    if (k == i || k == j) continue;
    dd=q2_vect[k]-q1_vect;
    d2=dd*dd;
    if (d2<0) {
        a=1;
        break;
    }       
}  

For compiler optimization reasons (on the SPE of the Cell processor), I need to unloop this by hand, so I tried:

dd=q2_vect[0]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

dd=q2_vect[1]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

dd=q2_vect[2]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

.....
.....

// end
goto notdone;

done: 
ok=0;

notdone:
.....

but I do not know how to deal with the

if (k == i || k == j) continue;

and with the fact that the lopp depends on each run on "n_n", and by hand I should write the code so many times as the maximal value "n_n" would get.

How do you think it can be fixed?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

攒眉千度 2024-10-11 15:18:37

你确定所写的代码是正确的吗?如果 dd 是有符号整数类型,则当前代码具有未定义的行为,并且如果 d2 为无符号或 if dd 则永远不会满足 if 中的条件> 和 d2 是浮点类型。看起来您正在对除 ij 之外的第一个索引 k 进行损坏的搜索,其中对表达式 q2_vect[ 进行平方k]-q1_vect 溢出。

至于有效地跳过 ij 迭代,我会只查看展开的“循环”停止的位置,并在 k+1 处重新启动它如果k等于ij。这是假设循环中的代码没有副作用/运行总计,这与所写的一样,但我希望您可能想让代码执行其他操作(例如对平方求和)。

最后,当您似乎根本没有工作代码时,我高度怀疑您希望手动展开循环。任何好的编译器都可以为您展开循环,但通常您想要执行的循环展开类型会使性能变得更糟而不是更好。我认为你最好先让你的代码正常工作,然后测量(并查看编译器生成的 asm),并且只有在你确定存在问题之后才尝试改进。

Are you sure the code as written is correct? The current code has undefined behavior if dd is a signed integer type, and the condition in the if is never satisfied if d2 is unsigned or if dd and d2 are floating point types. It looks like you're doing a broken search for the first index k other than i or j where squaring the expression q2_vect[ k]-q1_vect overflows.

As for efficiently skipping the i and j iterations, I would instead just look at where the unrolled "loop" stopped, and restart it at k+1 if k was equal to i or j. This is assuming the code in your loop has no side effects/running total, which is true as written, but I expect you might have meant for the code to do something else (like summing the squares).

Finally, I am highly skeptical of your wish to unroll the loop manually when you don't even seem to have working code to begin with. Any good compiler can unroll the loop for you, but often the type of loop unrolling you're looking to do makes performance worse rather than better. I think you'd do better getting your code to work correctly first, then measuring (and looking at the compiler-generated asm), and only trying to improve on that after you've determined there's a problem.

风流物 2024-10-11 15:18:37

编写的这段代码相当不适合 SPE,因为它的分支太多。此外,有关所涉及变量类型的信息也会有所帮助;编写的测试看起来相当晦涩(即使使用 >0 修复),但代码看起来可能是使用某种重载 operator - 的向量类的 C++表示两个向量的向量减法和运算符 * 来计算点积。

在 SPE 上处理此类简单循环的第一件事是让它们无分支(至少是内部循环;即展开几次并仅在每 N 次迭代时检查是否提前退出)并使用 SIMD 指令:SPE 只有 SIMD 指令,因此不在循环中使用 SIMD 处理会立即浪费 75% 的可用寄存器空间和计算能力。同样,SPE 一次只能加载对齐的 qword(16 字节),使用较小的数据类型需要额外的工作来打乱寄存器的内容,以便您尝试加载的值最终出现在“首选槽”中。

您可以通过使用以下无分支形式重写循环的第一部分来摆脱 if (k == i || k == j) (这是伪代码。它立即适用于整数,但您需要使用内在函数来获取浮点上的按位运算):

dd = q2_vect[k] - q1_vect;
d2 = dd * dd;
d2 &= ~(cmp_equal(k, i) | cmp_equal(k, j));

这里,cmp_equal 对应于相应的 SPE 内在函数(语义:cmp_equal(a,b) == (a = = b) ? ~0u : 0 )。当k == ik == j时,这会强制d2为零。

要避免内部循环中的 if (d2 > 0) 分支,请执行以下操作:

a |= cmp_greater(d2, 0);

并且仅每隔几个循环检查 a 是否为非零(提前退出)迭代。如果为 d2 计算的所有值都是非负的(如果您的类型是整数、浮点数或实值向量类,则会出现这种情况),您可以进一步简化。只要这样做:

a |= d2;

最后,只有当所有单独的项都非零时,a 才会非零。但要小心整数溢出(如果您使用的是整数)和 NaN(如果您使用的是浮点数)。如果您必须处理这些情况,上述简化将破坏代码。

This code as written is fairly unsuitable for SPEs since it's so branch-heavy. Also, information on the types of the variables involved would help; the test as written seems fairly obscure (even with the >0 fix), but the code looks like it might be C++ using some sort of vector class that overloads operator - to mean vector subtraction and operator * of two vectors to compute a dot product.

The first thing to do with such simple loops on SPEs is to get them branch-free (at least the inner loop; i.e. unroll a couple of times and only check for early exit every N iterations) and use SIMD instructions: SPEs only have SIMD instructions, so not using SIMD processing in your loops instantly wastes 75% of your available register space and computational power. Similarly, SPEs can only load aligned qwords (16 bytes) at a time, using smaller data types requires extra work to shuffle the contents of registers around so that the value you're trying to load ends up in the "preferred slot".

You get rid of the if (k == i || k == j) by rewriting the first part of the loop using the following branch-free form (this is pseudocode. It's immediately applicable for ints, but you'll need to use intrinsics to get bitwise ops on floats):

dd = q2_vect[k] - q1_vect;
d2 = dd * dd;
d2 &= ~(cmp_equal(k, i) | cmp_equal(k, j));

Here, cmp_equal corresponds to the respective SPE intrinsics (semantics: cmp_equal(a,b) == (a == b) ? ~0u : 0). This forces d2 to zero when k == i or k == j.

To avoid the if (d2 > 0) branch in the inner loop, do the following:

a |= cmp_greater(d2, 0);

and only check if a is nonzero (to early-out) every few loop iterations. If all values computed for d2 are nonnegative (will be the case if your type is ints, floats or a real-valued vector class), you can simplify this further. Just do:

a |= d2;

In the end, a will only be nonzero if all of the individual terms were nonzero. But be careful with integer overflows (if you're using ints) and NaNs (if you're using floats). If you have to handle these cases, the above simplification will break the code.

谎言月老 2024-10-11 15:18:37

通常循环展开意味着使循环包含几次迭代,从而减少运行次数。例如,

for(i=0;i<count;i++) {
    printf("%d", i);
}

可以展开为

i=0;
if(count%2==1) {
    printf("%d", i);
    i=1;
}
while(i<count) {
    printf("%d", i);
    printf("%d", i+1);
    i+=2;
}

Usually loop unrolling means making the loop contain a few iterations, such that it is run fewer times. For example,

for(i=0;i<count;i++) {
    printf("%d", i);
}

could be unrolled to

i=0;
if(count%2==1) {
    printf("%d", i);
    i=1;
}
while(i<count) {
    printf("%d", i);
    printf("%d", i+1);
    i+=2;
}
稀香 2024-10-11 15:18:37

对于第一个问题,当条件满足时,您需要不“执行”循环体。对于这个特定问题,您只需将该条件的逻辑否定放在 if 语句的条件中即可。

通常,展开是按一个因素进行的;展开的代码仍然存在于循环中(除非已知循环范围非常小)。此外,您需要在循环之外完成工作的“余数”(对应于问题大小除以展开因子的余数)。

因此,循环展开的示例:

for (i = 0; i < n; ++i) do_something(i);

可以按 2 倍展开:

for (i = 0; i < n-1; i += 2) { do_something(i); do_something(i+1); }
for (; i < n; ++i) do_something(i);

其中第二个循环执行“余数”(它还将 i 设置为与展开的循环相同的值) ,但如果此后不需要 i ,则整行可以只是 if (i < n) etc 对于这种情况)。

For the first problem, you need to not "execute" the loop body when the condition is met. For this particular problem, you can just place the logical negation of that condition inside the if statement's condition.

Normally unrolling is by a factor; the unrolled code still lives in a loop (unless the loop bounds are known to be very small). Furthermore, you'll need to do the "remainder" of the work (corresponding to the remainder of the problem size divided by the unroll factor) outside the loop.

So, an example of loop unrolling:

for (i = 0; i < n; ++i) do_something(i);

can be unrolled by a factor of 2 to:

for (i = 0; i < n-1; i += 2) { do_something(i); do_something(i+1); }
for (; i < n; ++i) do_something(i);

where the second loop does the "remainder" (it also sets i to be the same thing as the unrolled loop would have, but if i is not needed after this, the whole line can just be if (i < n) etc for this case).

与风相奔跑 2024-10-11 15:18:37

假设 n_n 是编译时常量,则循环可以像这样简单地展开:

do
{ 
  k=0
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

  k=1
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

  /* and so on, n_n times */

  k= n_n-1
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

} while (0);

本质上, continue 之后的所有内容都进入 if 语句的 else 部分

编辑:因为 n_n 不是编译时常量,您仍然可以通过在循环中多次运行循环来展开循环,然后以 switch-case 语句结束。事实上,你可以像这样组合它们,这称为 Duff 的设备。

#define LOOP_BODY              \
do{                            \  
  if (k == i || k == j)        \
    ;                          \
  else                         \
  {                            \
    dd=q2_vect[ k]-q1_vect;    \
    d2=dd*dd;                  \
    if (d2<0)                  \
    {                          \
      a=1;                     \
      break;                   \
    }                          \
  } while (0)          


k = 0;
switch(n_n % 8)
{
  case 0: for (; k < n_n; k++) { LOOP_BODY; k++; 
  case 7:                        LOOP_BODY; k++;
  case 6:                        LOOP_BODY; k++;
  case 5:                        LOOP_BODY; k++;
  case 4:                        LOOP_BODY; k++;
  case 3:                        LOOP_BODY; k++;
  case 2:                        LOOP_BODY; k++;    
  case 1:                        LOOP_BODY; k++;}
}

assuming n_n is a compile-time constant, the loop may be trivially unrolled like so:

do
{ 
  k=0
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

  k=1
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

  /* and so on, n_n times */

  k= n_n-1
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

} while (0);

essentially, everything after the continue goes into the else portion of an if statement

Edit: since n_n is not a compile time constant, you can still unroll the loop by doing several runs through the loop in a loop, and then finish with a switch-case statement. in fact, you can combine them like so, this is called the Duff's device.

#define LOOP_BODY              \
do{                            \  
  if (k == i || k == j)        \
    ;                          \
  else                         \
  {                            \
    dd=q2_vect[ k]-q1_vect;    \
    d2=dd*dd;                  \
    if (d2<0)                  \
    {                          \
      a=1;                     \
      break;                   \
    }                          \
  } while (0)          


k = 0;
switch(n_n % 8)
{
  case 0: for (; k < n_n; k++) { LOOP_BODY; k++; 
  case 7:                        LOOP_BODY; k++;
  case 6:                        LOOP_BODY; k++;
  case 5:                        LOOP_BODY; k++;
  case 4:                        LOOP_BODY; k++;
  case 3:                        LOOP_BODY; k++;
  case 2:                        LOOP_BODY; k++;    
  case 1:                        LOOP_BODY; k++;}
}
偷得浮生 2024-10-11 15:18:37

展开这个循环在这里没有多大帮助。内循环软件展开有助于指令的软件流水线操作,以在运行时实现更高的 IPC。在这里,它可能会通过展开来破坏逻辑。

Unrolling this loop will not help here much. Inner loop software unrolling helps with software pipelining of instructions to achieve higher IPC at run-time. Here it could corrupt the logic by unrolling.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文