STL 算法中跨函数边界的循环重新排序

发布于 2024-10-08 01:10:44 字数 667 浏览 2 评论 0原文

为了简单起见,我们假设我有一个由 N 矩阵组成的向量,每行 M 行。我正在使用 STL std::accumulate 来计算所有矩阵的总和。我传递一个二元函子,它接受两个矩阵(通过引用)并返回它们的总和(通过引用)。完全披露:我正在使用 libstdc++ 并行模式。在函子内部,我逐个循环行以计算总和。

尽管每个矩阵都太大而无法放入缓存,但一行却非常适合。因此,对循环进行重新排序是有利的,以便外部循环在 M 行上索引,内部循环在 N 矩阵上索引。除了定义内联函子之外,我还能做些什么来鼓励这种跨函数边界循环重新排序。我当然可以重构代码,但我理想地希望保留使用 STL 算法所提供的简单结构。如果有一些特定于 gcc 的东西,我也不介意。

我实际上并没有处理矩阵,这只是一个例子,但同样的问题结构也适用。主要问题是性能问题。解释实际场景会太麻烦,但核心问题是:STL 的累积需要在嵌套循环之间进行排序,这对缓存不太友好,因为它试图在移动到下一个对象之前完成两个对象的添加。单个对象太大而无法保存在缓存中,但它的一部分可以。因此,如果一次计算一个“部分”的“加法”(针对所有对象),则可以加快执行速度。手动重新排序循环可以显着提高 FLOPS。但我理想地希望编译器进行重新排序,以便我可以在 STL 级别进行编码(尽可能)。所以我正在寻找技巧来做到这一点。

For simplicity lets assume that I have a vector of N matrices each of M rows. I am using STL std::accumulate to compute the sum of all the matrices. I pass a binary functor that accepts two matrices (by reference) and returns their sum (by reference). Full disclosure: I am using libstdc++ parallel mode. Inside the functor I loop over the rows individually to compute the sum.

Though each matrix is too large to fit in the cache, a row fits in very nicely. So it would be advantageous to re-order the loops so that the outer loop indexes over the M rows and the inner one over the N matrices. In addition to defining the functor inline, is there anything else I can do to encourage such a cross-function-boundary loop re-ordering. I can of course restructure the code, but I would ideally like to keep the simple structure that the use of STL algorithms afford. If there is something that is gcc specific I wouldnt mind that either.

I am not actually dealing with matrices, that was just an example, but the same problem structure applies. The main issue is that of performance. Explaining the actual scenario would be too cumbersome, but the core problem is this: STL's accumulate entails an ordering among the nested loops that isnt very cache friendly because it tries to complete the addition of two objects before moving on to the next object. A single object is too big to be held in cache, but parts of it can be. So the execution can be sped up if one computes the 'additions' one 'part' at a time (over all objects). Hand reordering the loops leads to substantial improvement in FLOPS. But I would ideally like the compiler to do the re-ordering so that I can code at the STL level (as far as possible). So I am looking for tricks to do this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

娜些时光,永不杰束 2024-10-15 01:10:44
class Matrix;
class Row;
struct SumNRow {
  int _rowidx;
//  Row _tempRow; //For return by reference left out for simplicity
  SumNRow(int iRowIdx): _rowIdx(iRowIdx) {}
  Row operator(const Matrix & iMarix1, const Matrix iMatrix2) {
    return iMarix1[_rowIdx] + iMatrix2[_rowIdx];
  }
};

template<class MatrixIterator>
void sum(const MatrixIterator & iMarixStart, const MatrixIterator & iMatrixEnd, Matrix & oMarix) {
  for (int i = 0; i < iMarixStart->rowCount(); ++i) {
    oMarix[i]=std::accumulate(iMarixStart, iMatrixEnd, SumNRow(i));
  }
}
class Matrix;
class Row;
struct SumNRow {
  int _rowidx;
//  Row _tempRow; //For return by reference left out for simplicity
  SumNRow(int iRowIdx): _rowIdx(iRowIdx) {}
  Row operator(const Matrix & iMarix1, const Matrix iMatrix2) {
    return iMarix1[_rowIdx] + iMatrix2[_rowIdx];
  }
};

template<class MatrixIterator>
void sum(const MatrixIterator & iMarixStart, const MatrixIterator & iMatrixEnd, Matrix & oMarix) {
  for (int i = 0; i < iMarixStart->rowCount(); ++i) {
    oMarix[i]=std::accumulate(iMarixStart, iMatrixEnd, SumNRow(i));
  }
}
旧故 2024-10-15 01:10:44

我无法想象编译器会解决这个问题,除非所有内容都是内联的并且 M 和 N 是常量。即便如此,这也将是一个漫长的过程。

为了保持 STL 算法风格,请在累加上使用 foreach M 并让函子对一行求和。

I can't imagine the compiler figuring this out unless everything was inlined and M and N were constant. Even then, it would be a stretch.

To keep STL algorithmic style, use a foreach M over the accumulate and have the functor just sum a row.

薄暮涼年 2024-10-15 01:10:44

编写新算法,或将内容包装在 for 循环或 std::for_each() 调用中。这比寻找适应 std::accumulate() 的方法要容易得多。我认为这里唯一的其他选择是向库引入一个新的抽象级别,超越迭代器。编写一个新算法或引入一个额外的循环会更容易。

Write a new algorithm, or wrap things in a for-loop or a std::for_each() call. It's going to be much easier than finding ways to adapt std::accumulate(). I think that the only other alternative here is to introduce a new abstraction level to the library, that is beyond iterators. It's easier to just write a new algorithm or introduce an extra loop.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文