如何快速找到向量和的最大元素?

发布于 2024-08-03 23:32:04 字数 716 浏览 7 评论 0原文

我的程序的最内部循环中有以下代码,

struct V {
  float val [200]; // 0 <= val[i] <= 1
};

V a[600];
V b[250];
V c[250];
V d[350];
V e[350];

// ... init values in a,b,c,d,e ...

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  for (int ii = 0; ii < 200; ii++) {
    float act_val =
      a[ai].val[ii] +
      b[bi].val[ii] +
      c[ci].val[ii] +
      d[ci].val[ii] +
      e[ci].val[ii];

    if (act_val > best_val) {
      best_val = act_val;
      best_ii = ii;
    }
  }

  return best_ii;
}

我不在乎它是否是一些聪明的算法(但这将是最有趣的)或一些 C++ 技巧或内在函数或汇编程序。但我需要使 findmax 函数更加高效。

非常感谢。

编辑: 看来分支是最慢的操作(错误预测?)。

I have a following code in a most inner loop of my program

struct V {
  float val [200]; // 0 <= val[i] <= 1
};

V a[600];
V b[250];
V c[250];
V d[350];
V e[350];

// ... init values in a,b,c,d,e ...

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  for (int ii = 0; ii < 200; ii++) {
    float act_val =
      a[ai].val[ii] +
      b[bi].val[ii] +
      c[ci].val[ii] +
      d[ci].val[ii] +
      e[ci].val[ii];

    if (act_val > best_val) {
      best_val = act_val;
      best_ii = ii;
    }
  }

  return best_ii;
}

I don't care whether it will be some clever algorithm (but this would be most interesting) or some C++ tricks or intrinsics or assembler. But I need to make findmax function more efficient.

Big thanks in advance.

Edit:
It seems that branch is the slowest operation (misprediction?).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

年少掌心 2024-08-10 23:32:04

如果编译器难以缩短跳转,这可能会有所帮助:就

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  float* a_it = &a[ai].val[0]
  float* b_it = &b[bi].val[0]
  float* c_it = &c[ci].val[0]
  float* d_it = &d[di].val[0] // assume typo ci->di
  float* e_it = &e[ei].val[0] // assume typo ci->ei

  for (int ii = 0; ii < 200; ii++) {
    float act_val = *(a_it++) + *(b_it++) + *(c_it++) + *(d_it++) + *(e_it++);
    best_val =  (act_val <= best_val) ? best_val : act_val; // becomes _fsel
    best_ii  =  (act_val <= best_val) ? best_ii : ii; // becomes _fsel
  }

  return best_ii;
}

缓存未命中而言,生成总和表可能会更快,我稍后会发布此内容:

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  float* its[] = {&a[ai].val[0], &a[bi].val[0], &a[ci].val[0], &a[di].val[0], &a[ei].val[0] };

  V sums;
  for (int ii = 0; ii < 200; ii++) {
    sums.val[ii] = * (++its[0]);
  }

  for (int iter = 1 ; iter < 5; ++iter)  {
      for (int ii = 0; ii < 200; ii++) {
        sums.val[ii] += * (++its[iter]);
      }
    }
  }
  for (int ii = 0; ii < 200; ii++) {
    best_val =  (sums.val[ii] <= best_val) ? best_val : sums.val[ii]; // becomes _fsel
    best_ii  =  (sums.val[ii] <= best_val) ? best_ii : ii; // becomes _fsel
  } 
  return best_ii;
}

This might help a bit if the compiler is having difficulty short cutting the jumps:

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  float* a_it = &a[ai].val[0]
  float* b_it = &b[bi].val[0]
  float* c_it = &c[ci].val[0]
  float* d_it = &d[di].val[0] // assume typo ci->di
  float* e_it = &e[ei].val[0] // assume typo ci->ei

  for (int ii = 0; ii < 200; ii++) {
    float act_val = *(a_it++) + *(b_it++) + *(c_it++) + *(d_it++) + *(e_it++);
    best_val =  (act_val <= best_val) ? best_val : act_val; // becomes _fsel
    best_ii  =  (act_val <= best_val) ? best_ii : ii; // becomes _fsel
  }

  return best_ii;
}

Generating a sum table might be faster in terms of cache misses I'll post this in a bit:

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  float* its[] = {&a[ai].val[0], &a[bi].val[0], &a[ci].val[0], &a[di].val[0], &a[ei].val[0] };

  V sums;
  for (int ii = 0; ii < 200; ii++) {
    sums.val[ii] = * (++its[0]);
  }

  for (int iter = 1 ; iter < 5; ++iter)  {
      for (int ii = 0; ii < 200; ii++) {
        sums.val[ii] += * (++its[iter]);
      }
    }
  }
  for (int ii = 0; ii < 200; ii++) {
    best_val =  (sums.val[ii] <= best_val) ? best_val : sums.val[ii]; // becomes _fsel
    best_ii  =  (sums.val[ii] <= best_val) ? best_ii : ii; // becomes _fsel
  } 
  return best_ii;
}
梨涡少年 2024-08-10 23:32:04

嗯,我认为没有明显的算法优化空间。理论上,人们只能计算五个向量的总和,直到明显无法达到最大值为止,但这会增加仅对五个数字求和的大量开销。您可以尝试使用多个线程并向线程分配范围,但是当您只有 200 个非常短的工作项时,您必须考虑线程创建开销。

因此,我倾向于说,在 x86 上使用汇编器和 MMX 或 SSE 指令,或者使用(特定于机器的)C++ 库提供对此指令的访问是最好的选择。

Well, I see no obvious room for algorithmic optimizations. Theoreticaly one could only calculate the sum of the five vectors until it is obvious that the maximum cannot be reached, but this would add way to much overhead for only summing five numbers. You could try using multiple threads and assign ranges to the threads, but you have to think about the thread creation overhead when you have only 200 very short work items.

So I tend to say that using Assembler and MMX or SSE instructions on x86 or maybe a (machine specific) C++ a library providing access to this instructions is your best bet.

轮廓§ 2024-08-10 23:32:04

我看不出有什么方法可以在不检查每个总和的情况下做到这一点,这使得这是一个 O(n) 问题。但由于您的数据是线性排列的,Intel/AMD MMX 或 SSE 指令可能会有所帮助。有关 Microsoft 内部函数的实现,请参阅以下链接:

http://msdn.microsoft.com/en-us /library/y0dh78ez(VS.71).aspx

I don't see any way to do this without examining each sum, making this an O(n) problem. But since your data are laid out linearly, the Intel/AMD MMX or SSE instructions might help. See this link for Microsoft's implementation of intrinsics:

http://msdn.microsoft.com/en-us/library/y0dh78ez(VS.71).aspx

谁把谁当真 2024-08-10 23:32:04

除非编译器为您优化它们,否则在循环中计算 a[ai] 等会花费您一些时间(尽管很短),因为它们在 findmax< 的持续时间内是固定的。 /代码>。鉴于此,您可能会尝试类似的方法:

int findmax(int ai, int bi, int ci, int di, int ei) {
    float    best_val = std::numeric_limits<float>::min();
    int      best_ii = 0;
    const V& a(a[ai]);
    const V& b(b[bi]);
    const V& c(c[ci]);
    const V& d(d[di]);
    const V& e(e[ei]);

    for (int ii = 0; ii < 200; ++ii) {
        float act_val = a.val[ii] + b.val[ii] + c.val[ii] +
                        d.val[ii] + e.val[ii];

        if (act_val > best_val) {
            best_val = act_val;
            best_ii = ii;
        }
    }

    return best_ii;
}

改进代码的其他方法可能是改变数据的表示方式,从而产生不同的(但速度更快)findmax 算法。

Unless compiler optimizes them out for you, computing a[ai], etc., in the loop will cost you some time (however slight) given that they are fixed for the duration of findmax. In light of that you might try something like:

int findmax(int ai, int bi, int ci, int di, int ei) {
    float    best_val = std::numeric_limits<float>::min();
    int      best_ii = 0;
    const V& a(a[ai]);
    const V& b(b[bi]);
    const V& c(c[ci]);
    const V& d(d[di]);
    const V& e(e[ei]);

    for (int ii = 0; ii < 200; ++ii) {
        float act_val = a.val[ii] + b.val[ii] + c.val[ii] +
                        d.val[ii] + e.val[ii];

        if (act_val > best_val) {
            best_val = act_val;
            best_ii = ii;
        }
    }

    return best_ii;
}

Other means of improving the code might be to alter the way the data is represented, leading to a different (but much faster) findmax algorithm.

永言不败 2024-08-10 23:32:04

尝试一次迭代所有向量。这是两个向量的示例:

for (float *ap = a[ai].val, *bp = b[bi].val; ap - a[ai].val < 200; ap++, bp ++) {
    float act_val = *ap + *bp;
    // check for max and return if necessary
}

Try to iterate all vectors at once. Here's the example for two vectors:

for (float *ap = a[ai].val, *bp = b[bi].val; ap - a[ai].val < 200; ap++, bp ++) {
    float act_val = *ap + *bp;
    // check for max and return if necessary
}
恋你朝朝暮暮 2024-08-10 23:32:04

看一下循环展开(以及 Duff 的设备作为一个特定但复杂得多的示例)。这些是我能想到的唯一真正的算法优化。

Loop_unwinding

达夫的_device

Take a look at loop unwinding (and Duff's device for a specific, but far more complicated, example). Those are the only real algorithm optimizations I can come up with.

Loop_unwinding

Duff's_device

泡沫很甜 2024-08-10 23:32:04

如果没有有关存储在 abc、< 中的数据(值)的附加信息,您实际上不可能比这更快。代码>d和e。您必须检查每一项金额以确定哪一项是最大的。

对于第 N 个元素查询,情况会更糟,但幸运的是,您没有问那个。

You can't really get that much faster than that without additional information about the data (values) stored in a, b, c, d, and e. You have to inspect every sum to determine which one is the greatest.

It get's a little worse for Nth element queries, but fortunately, you didn't ask that one.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文