特征矩阵的矩阵乘法，用于列的子集

发布于 2025-01-26 00:30:32 字数 3201 浏览 3 评论 0原文

eigen :: matrix在一组随机的列索引上的矩阵乘法的最快方法是什么？

Eigen::MatrixXd mat = Eigen::MatrixXd::Random(100, 1000);
// vector of random indices (linspaced here for brevity)
Eigen::VectorXi idx = VectorXi::LinSpaced(8,1000,9);

我正在使用RCPPEIGEN和R，它仍在3.x版本的Eigen上（不支持（）带有索引数组），无论如何，我的理解是（）操作员仍然执行深层副本。

现在，我正在进行深层副本，并在IDX中仅适用于列的数据，

template <typename T>
inline Eigen::Matrix<T, -1, -1> subset_cols(const Eigen::Matrix<T, -1, -1>& x, const std::vector<size_t>& cols) {
    Eigen::Matrix<T, -1, -1> y(x.rows(), cols.size());
    for (size_t i = 0; i < cols.size(); ++i)
        y.col(i) = x.col(cols[i]);
    return y;
}

然后进行矩阵乘法：

Eigen::MatrixXd sub_mat = subset_cols(mat, idx);
Eigen::MatrixXd a = sub_mat * sub_mat.transpose();

a是我想要的。

必须有某种方法可以避免深层复制，而是使用eigen :: Map？

编辑5/9/22： 回复@markus，他建议使用原始数据访问和eigen :: Map提出了一种方法。所提出的解决方案比深拷贝的矩阵乘法慢一点。此处的基准测试是使用RCPP代码和R：

//[[Rcpp::depends(RcppClock)]]
#include <RcppClock.h>

//[[Rcpp::export]]
void bench(Eigen::MatrixXd mat, Eigen::VectorXi idx){
  Rcpp::Clock clock;
  size_t reps = 100;
  while(reps-- > 0){
    clock.tick("copy");
    Eigen::MatrixXd sub_mat = subset_cols(mat, idx);
    Eigen::MatrixXd a = sub_mat * sub_mat.transpose();
    clock.tock("copy");
    
    clock.tick("map");
    double *b_raw = new double[mat.rows() * mat.rows()];
    Eigen::Map<Eigen::MatrixXd> b(b_raw, mat.rows(), mat.rows());
    subset_AAt(b_raw, mat, idx);
    clock.tock("map");
  }
  clock.stop("clock");
}

以下是100列矩阵的三个运行。我们在（1）10列的子集上进行矩阵乘法，（2）1000列的子集和（3）10000列的子集。

R：

bench(
  matrix(runif(100000 * 100), 100, 100000), 
  sample(100000, 10) - 1)

# Unit: microseconds 
# ticker   mean     sd   min    max neval
#    copy  31.65  4.376 30.15  69.46   100
#     map 113.46 21.355 68.54 166.29   100

bench(
  matrix(runif(100000 * 100), 100, 100000), 
  sample(100000, 1000) - 1)

#  Unit: milliseconds 
#  ticker  mean     sd   min   max neval
#    copy 2.361 0.5789 1.972  4.86   100
#     map 9.495 2.4201 7.962 19.90   100

bench(
  matrix(runif(100000 * 100), 100, 100000), 
  sample(100000, 10000) - 1)

#  Unit: milliseconds 
#  ticker   mean     sd    min   max neval
#    copy  23.04  2.774  20.95  42.4   100
#     map 378.14 19.424 351.56 492.0   100

我在几台机器上进行了基准测试，结果相似。以上结果来自良好的HPC节点。

编辑：5/10/2022 这是一个代码片段，该代码片段执行矩阵乘法，以迅速使用不直接使用特征Blas的任何代码：

template <typename T>
Eigen::Matrix<T, -1, -1> subset_AAt(const Eigen::Matrix<T, -1, -1>& A, const Eigen::VectorXi& cols) {
  const size_t n = A.rows();
  Eigen::Matrix<T, -1, -1> AAt(n, n);
  for (size_t k = 0; k < cols.size(); ++k) {
    const T* A_data = A.data() + cols(k) * n;
    for (size_t i = 0; i < n; ++i) {
      T tmp_i = A_data[i];
      for (size_t j = 0; j <= i; ++j) {
        AAt(i * n + j) += tmp_i * A_data[j];
      }
    }
  }
  return AAt;
}

原文

What is the fastest method for matrix multiplication of an Eigen::Matrix over a random set of column indices?

Eigen::MatrixXd mat = Eigen::MatrixXd::Random(100, 1000);
// vector of random indices (linspaced here for brevity)
Eigen::VectorXi idx = VectorXi::LinSpaced(8,1000,9);

I'm using RcppEigen and R, which is still on a 3.x version of Eigen (no support for () with index arrays), and regardless, my understanding is that the () operator still performs a deep copy.

Right now I'm doing a deep copy and generating a new matrix with data only for columns in idx:

template <typename T>
inline Eigen::Matrix<T, -1, -1> subset_cols(const Eigen::Matrix<T, -1, -1>& x, const std::vector<size_t>& cols) {
    Eigen::Matrix<T, -1, -1> y(x.rows(), cols.size());
    for (size_t i = 0; i < cols.size(); ++i)
        y.col(i) = x.col(cols[i]);
    return y;
}

and then doing matrix multiplication:

Eigen::MatrixXd sub_mat = subset_cols(mat, idx);
Eigen::MatrixXd a = sub_mat * sub_mat.transpose();

a is what I want.

There must be some way to avoid a deep copy and instead use Eigen::Map?

Edit 5/9/22:
In reply to @Markus, who proposed an approach using raw data access and Eigen::Map. The proposed solution is a bit slower than matrix multiplication of a deep copy. Benchmarking here is done with Rcpp code and R:

//[[Rcpp::depends(RcppClock)]]
#include <RcppClock.h>

//[[Rcpp::export]]
void bench(Eigen::MatrixXd mat, Eigen::VectorXi idx){
  Rcpp::Clock clock;
  size_t reps = 100;
  while(reps-- > 0){
    clock.tick("copy");
    Eigen::MatrixXd sub_mat = subset_cols(mat, idx);
    Eigen::MatrixXd a = sub_mat * sub_mat.transpose();
    clock.tock("copy");
    
    clock.tick("map");
    double *b_raw = new double[mat.rows() * mat.rows()];
    Eigen::Map<Eigen::MatrixXd> b(b_raw, mat.rows(), mat.rows());
    subset_AAt(b_raw, mat, idx);
    clock.tock("map");
  }
  clock.stop("clock");
}

Here are three runs of a 100,000-column matrix with 100 rows. We are doing matrix multiplication on (1) a subset of 10 columns, (2) a subset of 1000 columns, and (3) a subset of 10000 columns.

bench(
  matrix(runif(100000 * 100), 100, 100000), 
  sample(100000, 10) - 1)

# Unit: microseconds 
# ticker   mean     sd   min    max neval
#    copy  31.65  4.376 30.15  69.46   100
#     map 113.46 21.355 68.54 166.29   100

bench(
  matrix(runif(100000 * 100), 100, 100000), 
  sample(100000, 1000) - 1)

#  Unit: milliseconds 
#  ticker  mean     sd   min   max neval
#    copy 2.361 0.5789 1.972  4.86   100
#     map 9.495 2.4201 7.962 19.90   100

bench(
  matrix(runif(100000 * 100), 100, 100000), 
  sample(100000, 10000) - 1)

#  Unit: milliseconds 
#  ticker   mean     sd    min   max neval
#    copy  23.04  2.774  20.95  42.4   100
#     map 378.14 19.424 351.56 492.0   100

I benchmarked on a few machines with similar results. Above results are from a good HPC node.

Edit: 5/10/2022
Here is a code snippet that performs matrix multiplication for a subset of columns as quickly as any code not directly using the Eigen BLAS:

template <typename T>
Eigen::Matrix<T, -1, -1> subset_AAt(const Eigen::Matrix<T, -1, -1>& A, const Eigen::VectorXi& cols) {
  const size_t n = A.rows();
  Eigen::Matrix<T, -1, -1> AAt(n, n);
  for (size_t k = 0; k < cols.size(); ++k) {
    const T* A_data = A.data() + cols(k) * n;
    for (size_t i = 0; i < n; ++i) {
      T tmp_i = A_data[i];
      for (size_t j = 0; j <= i; ++j) {
        AAt(i * n + j) += tmp_i * A_data[j];
      }
    }
  }
  return AAt;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

筱果果 2025-02-02 00:30:32

利用对称性

您可以利用所得矩阵将是对称的：

Mat sub_mat = subset_cols(mat, idx); // From your original post
Mat a = Mat::Zero(numRows, numRows);
a.selfadjointView<Eigen::Lower>().rankUpdate(sub_mat); // (1)
a.triangularView<Eigen::Upper>() = a.transpose(); // (2)

line （1）将计算a += sub_mat * sub_mat.mat.transpose（）。（2）然后将下部写入上部。另请参见文档（在这里）。当然，如果您只能与下部生活在一起，则可以省略步骤（2）。

对于100x100000矩阵MAT，服用10列时，我的速度约为

1.1倍，
在服用100列时〜1.5倍，
1000列时，我使用1000列时，使用〜1.7倍。

在Windows上使用 MSVC和Linux使用具有完整优化和AVX的Clang。

启用并行化速度

加快计算的方法是启用 noreflow noreferrer“> parallealisization> parallealisization 通过与OpenMP合并。艾根照顾其余的。但是，利用对称性的代码确实从中受益。但是原始代码

Eigen::MatrixXd sub_mat = subset_cols(mat, idx);
Eigen::MatrixXd a = sub_mat * sub_mat.transpose();

确实如此。

的速度约为

对于100x100000矩阵mat，使用clang on Linux上的clang，运行4个线程（在4个真实内核上）并与一个线程进行比较，我在服用时 1.0倍。 10列，即
当服用100列时，〜1.8倍
〜1.8倍〜2.0倍，

换句话说，换句话说，4个或更多的核心优于上面显示的对称方法，除了少量的列。仅使用2个核心总是较慢。请注意，使用 smt 有时会损害我的测试中的性能，有时值得注意。

其他笔记

我已经在评论中写了这一点，但为了完整的目的：
eigen :: map由于步伐是不符号的，因此行不通。使用切片给我比与Clang和Clang和Clang和Linux上的复制方法更好的10％海湾合作委员会，但在MSVC上有些糟糕。另外，正如您指出的那样，它在特征的3.3分支上不可用。有一个自定义方式在我的表演中始终更糟测试。
另外，在我的测试中，与复制方法相比，它没有保存任何内存。

我认为很难击败有关性能的复制方法本身，因为特征矩阵默认情况下，这意味着复制几列相当便宜。此外，在不真正了解细节的情况下，我怀疑eigen可以将其优化的全部功能投入到完整的矩阵上，以计算产品并转置，而无需处理视图或类似的内容。这可能会给特征提供更多的矢量化或缓存位置的机会。

除此之外，不仅应打开优化，还应使用最高的指令集。在我的测试中打开AVX的性能提高了〜1.5倍。不幸的是，我无法测试AVX512。

Exploiting symmetry

You can exploit that the resulting matrix will be symmetric like so:

Mat sub_mat = subset_cols(mat, idx); // From your original post
Mat a = Mat::Zero(numRows, numRows);
a.selfadjointView<Eigen::Lower>().rankUpdate(sub_mat); // (1)
a.triangularView<Eigen::Upper>() = a.transpose(); // (2)

Line (1) will compute a += sub_mat * sub_mat.transpose() for the lower part only. (2) will then write the lower part to the upper part. Also see the documentation (here and here).
Of course, if you can live with only the lower part, step (2) can be omitted.

For a 100x100000 matrix mat, I get a speed up of a factor of roughly

~1.1x when taking 10 columns,
~1.5x when taking 100 columns,
~1.7x when taking 1000 columns

both on Windows using MSVC and on Linux using clang with full optimizations and AVX.

Enabling parallelization

Another way to speed up the computation is to enable parallelization by compiling with OpenMP. Eigen takes care of the rest. The code above that exploits the symmetry does not benefit from it, however. But the original code

Eigen::MatrixXd sub_mat = subset_cols(mat, idx);
Eigen::MatrixXd a = sub_mat * sub_mat.transpose();

does.

For a 100x100000 matrix mat, using clang on Linux, running with 4 threads (on 4 real cores) and comparing to a single thread, I get a speed up of a factor of roughly

~1.0x when taking 10 columns, i.e. no speed up at all
~1.8x when taking 100 columns
~2.0x when taking 1000 columns

In other words, 4 cores or more outperform the symmetric method shown above except for a very small number of columns. Using only 2 cores was always slower. Note that using SMT hurt the performance in my tests, sometimes notably.

Other notes

I already wrote this in the comment, but for the sake of completeness:
Eigen::Map will not work because the strides are non-equidistant. Using slicing gives me ~10% better performance than your copying method on Linux with clang and gcc, but somewhat worse on MSVC. Also, as you noted, it is not available on the 3.3 branch of Eigen. There is a custom way to mimic it, but it performed always worse in my tests.
Also, in my tests, it did not save any memory compared to the copying method.

I think it is hard to beat the copying method itself regarding performance because the Eigen matrices are column major by default, meaning that copying a few columns is rather cheap. Moreover, without really knowing details, I suspect that Eigen can then throw the full might of its optimization on the full matrix to compute the product and transpose without having to deal with views or anything like this. This might give Eigen more chances for vectorization or cache locality.

Apart from this, not only optimizations should be turned on but also the highest possible instruction set should be used. Turning on AVX in my tests improved the performance by ~1.5x. Unfortunately, I cannot test AVX512.

回复收藏 0 原文

眉黛浅 2025-02-02 00:30:32

如果任何人都发现这一点有所帮助，我可以使用OpenMP和三角形索引在接受的问题中击败特征代码的性能。在这种情况下，我正在使用rcpp :: numericMatrix，但是您可以插入eigen :: matrixxd in：通过：

    Rcpp::NumericMatrix Rcpp_AAt(const Rcpp::NumericMatrix& mat) {
    const size_t n = mat.cols();
    const size_t n_vals = n / 2 * (1 + n) - n;
    Rcpp::NumericMatrix res(n, n);
    #pragma omp parallel for
    for (size_t k = 0; k < (n_vals + n); ++k) {
        // k is linear index
        if (k >= n_vals) {
            size_t i = k - n_vals;
            double tmp = 0;
            for (size_t row = 0; row < mat.rows(); ++row)
                tmp += mat(row, i) * mat(row, i);
            res(i, i) = tmp;
        } else {
            size_t i = n - 2 - std::floor(std::sqrt(-8 * k + 4 * n * (n - 1) - 7) / 2.0 - 0.5);
            size_t j = k + i + 1 - n * (n - 1) / 2 + (n - i) * ((n - i) - 1) / 2;
            double tmp = 0;
            for (size_t row = 0; row < mat.rows(); ++row)
                tmp += mat(row, i) * mat(row, j);
            res(i, j) = tmp;
            res(j, i) = tmp;
        }
    }
    return res;
}

通过使用三角形索引，我们允许OpenMP oppawn oppawn Off Off所有组合组合的线程，这比一次在一个列之间平行（出于明显的原因）更有效。 Eigen使用多线程，所以我认为这是公平的游戏。

In case anyone finds this helpful down the road, I was able to beat the performance of the Eigen code in the accepted question using OpenMP and triangular indexing. In this case I'm using Rcpp::NumericMatrix, but you could plug Eigen::MatrixXd right in:

    Rcpp::NumericMatrix Rcpp_AAt(const Rcpp::NumericMatrix& mat) {
    const size_t n = mat.cols();
    const size_t n_vals = n / 2 * (1 + n) - n;
    Rcpp::NumericMatrix res(n, n);
    #pragma omp parallel for
    for (size_t k = 0; k < (n_vals + n); ++k) {
        // k is linear index
        if (k >= n_vals) {
            size_t i = k - n_vals;
            double tmp = 0;
            for (size_t row = 0; row < mat.rows(); ++row)
                tmp += mat(row, i) * mat(row, i);
            res(i, i) = tmp;
        } else {
            size_t i = n - 2 - std::floor(std::sqrt(-8 * k + 4 * n * (n - 1) - 7) / 2.0 - 0.5);
            size_t j = k + i + 1 - n * (n - 1) / 2 + (n - i) * ((n - i) - 1) / 2;
            double tmp = 0;
            for (size_t row = 0; row < mat.rows(); ++row)
                tmp += mat(row, i) * mat(row, j);
            res(i, j) = tmp;
            res(j, i) = tmp;
        }
    }
    return res;
}

By using triangular indexing, we are allowing for OpenMP to spawn off threads for all combinations of columns, which is more efficient than just parallelizing across one column at a time (for obvious reasons). Eigen uses multithreading, so I figure this is fair game.

回复收藏 0 原文

~没有更多了~