如何快速找到向量和的最大元素？

发布于 2024-08-03 23:32:04 字数 716 浏览 9 评论 0原文

我的程序的最内部循环中有以下代码，

struct V {
  float val [200]; // 0 <= val[i] <= 1
};

V a[600];
V b[250];
V c[250];
V d[350];
V e[350];

// ... init values in a,b,c,d,e ...

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  for (int ii = 0; ii < 200; ii++) {
    float act_val =
      a[ai].val[ii] +
      b[bi].val[ii] +
      c[ci].val[ii] +
      d[ci].val[ii] +
      e[ci].val[ii];

    if (act_val > best_val) {
      best_val = act_val;
      best_ii = ii;
    }
  }

  return best_ii;
}

我不在乎它是否是一些聪明的算法（但这将是最有趣的）或一些 C++ 技巧或内在函数或汇编程序。但我需要使 findmax 函数更加高效。

非常感谢。

编辑： 看来分支是最慢的操作（错误预测？）。

原文

I have a following code in a most inner loop of my program

struct V {
  float val [200]; // 0 <= val[i] <= 1
};

V a[600];
V b[250];
V c[250];
V d[350];
V e[350];

// ... init values in a,b,c,d,e ...

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  for (int ii = 0; ii < 200; ii++) {
    float act_val =
      a[ai].val[ii] +
      b[bi].val[ii] +
      c[ci].val[ii] +
      d[ci].val[ii] +
      e[ci].val[ii];

    if (act_val > best_val) {
      best_val = act_val;
      best_ii = ii;
    }
  }

  return best_ii;
}

I don't care whether it will be some clever algorithm (but this would be most interesting) or some C++ tricks or intrinsics or assembler. But I need to make findmax function more efficient.

Big thanks in advance.

Edit:
It seems that branch is the slowest operation (misprediction?).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

年少掌心 2024-08-10 23:32:04

如果编译器难以缩短跳转，这可能会有所帮助：就

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  float* a_it = &a[ai].val[0]
  float* b_it = &b[bi].val[0]
  float* c_it = &c[ci].val[0]
  float* d_it = &d[di].val[0] // assume typo ci->di
  float* e_it = &e[ei].val[0] // assume typo ci->ei

  for (int ii = 0; ii < 200; ii++) {
    float act_val = *(a_it++) + *(b_it++) + *(c_it++) + *(d_it++) + *(e_it++);
    best_val =  (act_val <= best_val) ? best_val : act_val; // becomes _fsel
    best_ii  =  (act_val <= best_val) ? best_ii : ii; // becomes _fsel
  }

  return best_ii;
}

缓存未命中而言，生成总和表可能会更快，我稍后会发布此内容：

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  float* its[] = {&a[ai].val[0], &a[bi].val[0], &a[ci].val[0], &a[di].val[0], &a[ei].val[0] };

  V sums;
  for (int ii = 0; ii < 200; ii++) {
    sums.val[ii] = * (++its[0]);
  }

  for (int iter = 1 ; iter < 5; ++iter)  {
      for (int ii = 0; ii < 200; ii++) {
        sums.val[ii] += * (++its[iter]);
      }
    }
  }
  for (int ii = 0; ii < 200; ii++) {
    best_val =  (sums.val[ii] <= best_val) ? best_val : sums.val[ii]; // becomes _fsel
    best_ii  =  (sums.val[ii] <= best_val) ? best_ii : ii; // becomes _fsel
  } 
  return best_ii;
}

This might help a bit if the compiler is having difficulty short cutting the jumps:

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  float* a_it = &a[ai].val[0]
  float* b_it = &b[bi].val[0]
  float* c_it = &c[ci].val[0]
  float* d_it = &d[di].val[0] // assume typo ci->di
  float* e_it = &e[ei].val[0] // assume typo ci->ei

  for (int ii = 0; ii < 200; ii++) {
    float act_val = *(a_it++) + *(b_it++) + *(c_it++) + *(d_it++) + *(e_it++);
    best_val =  (act_val <= best_val) ? best_val : act_val; // becomes _fsel
    best_ii  =  (act_val <= best_val) ? best_ii : ii; // becomes _fsel
  }

  return best_ii;
}

Generating a sum table might be faster in terms of cache misses I'll post this in a bit:

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  float* its[] = {&a[ai].val[0], &a[bi].val[0], &a[ci].val[0], &a[di].val[0], &a[ei].val[0] };

  V sums;
  for (int ii = 0; ii < 200; ii++) {
    sums.val[ii] = * (++its[0]);
  }

  for (int iter = 1 ; iter < 5; ++iter)  {
      for (int ii = 0; ii < 200; ii++) {
        sums.val[ii] += * (++its[iter]);
      }
    }
  }
  for (int ii = 0; ii < 200; ii++) {
    best_val =  (sums.val[ii] <= best_val) ? best_val : sums.val[ii]; // becomes _fsel
    best_ii  =  (sums.val[ii] <= best_val) ? best_ii : ii; // becomes _fsel
  } 
  return best_ii;
}

回复收藏 0 原文

梨涡少年 2024-08-10 23:32:04

嗯，我认为没有明显的算法优化空间。理论上，人们只能计算五个向量的总和，直到明显无法达到最大值为止，但这会增加仅对五个数字求和的大量开销。您可以尝试使用多个线程并向线程分配范围，但是当您只有 200 个非常短的工作项时，您必须考虑线程创建开销。

因此，我倾向于说，在 x86 上使用汇编器和 MMX 或 SSE 指令，或者使用（特定于机器的）C++ 库提供对此指令的访问是最好的选择。

回复收藏 0 原文

轮廓§ 2024-08-10 23:32:04

我看不出有什么方法可以在不检查每个总和的情况下做到这一点，这使得这是一个 O(n) 问题。但由于您的数据是线性排列的，Intel/AMD MMX 或 SSE 指令可能会有所帮助。有关 Microsoft 内部函数的实现，请参阅以下链接：

http://msdn.microsoft.com/en-us /library/y0dh78ez(VS.71).aspx

回复收藏 0 原文

谁把谁当真 2024-08-10 23:32:04

除非编译器为您优化它们，否则在循环中计算 a[ai] 等会花费您一些时间（尽管很短），因为它们在 findmax< 的持续时间内是固定的。 /代码>。鉴于此，您可能会尝试类似的方法：

int findmax(int ai, int bi, int ci, int di, int ei) {
    float    best_val = std::numeric_limits<float>::min();
    int      best_ii = 0;
    const V& a(a[ai]);
    const V& b(b[bi]);
    const V& c(c[ci]);
    const V& d(d[di]);
    const V& e(e[ei]);

    for (int ii = 0; ii < 200; ++ii) {
        float act_val = a.val[ii] + b.val[ii] + c.val[ii] +
                        d.val[ii] + e.val[ii];

        if (act_val > best_val) {
            best_val = act_val;
            best_ii = ii;
        }
    }

    return best_ii;
}

改进代码的其他方法可能是改变数据的表示方式，从而产生不同的（但速度更快）findmax 算法。

Unless compiler optimizes them out for you, computing a[ai], etc., in the loop will cost you some time (however slight) given that they are fixed for the duration of findmax. In light of that you might try something like:

int findmax(int ai, int bi, int ci, int di, int ei) {
    float    best_val = std::numeric_limits<float>::min();
    int      best_ii = 0;
    const V& a(a[ai]);
    const V& b(b[bi]);
    const V& c(c[ci]);
    const V& d(d[di]);
    const V& e(e[ei]);

    for (int ii = 0; ii < 200; ++ii) {
        float act_val = a.val[ii] + b.val[ii] + c.val[ii] +
                        d.val[ii] + e.val[ii];

        if (act_val > best_val) {
            best_val = act_val;
            best_ii = ii;
        }
    }

    return best_ii;
}

Other means of improving the code might be to alter the way the data is represented, leading to a different (but much faster) findmax algorithm.

回复收藏 0 原文

永言不败 2024-08-10 23:32:04

尝试一次迭代所有向量。这是两个向量的示例：

for (float *ap = a[ai].val, *bp = b[bi].val; ap - a[ai].val < 200; ap++, bp ++) {
    float act_val = *ap + *bp;
    // check for max and return if necessary
}

Try to iterate all vectors at once. Here's the example for two vectors:

for (float *ap = a[ai].val, *bp = b[bi].val; ap - a[ai].val < 200; ap++, bp ++) {
    float act_val = *ap + *bp;
    // check for max and return if necessary
}

回复收藏 0 原文