当行不是内存连接时，用于矩阵乘法的英特尔MKL用于矩阵乘法

发布于 2025-01-26 18:32:58 字数 1328 浏览 1 评论 0原文

我们的硬件是Intel Xeon Phi，因此我们鼓励我们通过使用Intel MKL替换手写线性代数操作（例如Square Matrix乘法）来充分利用它。

问题是，对于这种情况，哪个应该是正确的MKL用法，因为我们矩阵行不连续的问题可能会禁止某些功能的用法，例如cblas_dgemm。

这是稀疏blas的用例吗？

与非连续行的矩阵的示例：

#include <iostream>

int main()
{

        // construct this matrix:
        //
        // ( 1 2 3 )
        // ( 4 5 6 )

        const int NCOLS = 3;

        // allocate two perhaps-not-contiguous blocks of memory
        double *row1 = (double*)malloc(NCOLS * sizeof(double));
        double *row2 = (double*)malloc(NCOLS * sizeof(double));

        // initialize them to the desired values, i.e.
        row1[0] = 1;
        row1[1] = 2;
        row1[2] = 3;
        row2[0] = 4;
        row2[1] = 5;
        row2[2] = 6;

        // allocate a block of two memory elements the size of a pointer
        double **matrix = (double**)malloc(2 * sizeof(double*));

        // set them to point to the two (perhaps-not-contiguous) previous blocks
        matrix[0] = &row1[0];
        matrix[1] = &row2[0];

        // print
        for (auto j=0; j<2; j++)
        {
                for (auto i=0; i<3; i++)
                {
                        std::cout << matrix[j][i] << ",";
                }
                std::cout << "\n";
        }
}

原文

Our hardware is Intel Xeon Phi so we are encouraged to get the most out of it by replacing hand-written linear algebra ops (e.g. square matrix multiplications) using Intel MKL.

The question is which should be the correct MKL usage for this case, as the problem of our matrices' rows not being contiguous in memory may forbid the usage of certain functions, e.g. cblas_dgemm.

Is this a use-case for sparse BLAS?

Example of matrix with non contiguous rows:

#include <iostream>

int main()
{

        // construct this matrix:
        //
        // ( 1 2 3 )
        // ( 4 5 6 )

        const int NCOLS = 3;

        // allocate two perhaps-not-contiguous blocks of memory
        double *row1 = (double*)malloc(NCOLS * sizeof(double));
        double *row2 = (double*)malloc(NCOLS * sizeof(double));

        // initialize them to the desired values, i.e.
        row1[0] = 1;
        row1[1] = 2;
        row1[2] = 3;
        row2[0] = 4;
        row2[1] = 5;
        row2[2] = 6;

        // allocate a block of two memory elements the size of a pointer
        double **matrix = (double**)malloc(2 * sizeof(double*));

        // set them to point to the two (perhaps-not-contiguous) previous blocks
        matrix[0] = &row1[0];
        matrix[1] = &row2[0];

        // print
        for (auto j=0; j<2; j++)
        {
                for (auto i=0; i<3; i++)
                {
                        std::cout << matrix[j][i] << ",";
                }
                std::cout << "\n";
        }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

月竹挽风 2025-02-02 18:32:58

密集的BLAS操作可以在给定的固定步伐的矩阵上运行，但在您的情况下，大步并不恒定。稀疏的矩阵旨在在包含许多零的矩阵上操作，这显然不是您的情况（至少在提供的示例中不是）。

由于您的矩阵在实践中是巨大的（20K x 20k），因此最快的解决方案是在大连续的矩阵线上复制矩阵线。的确，即使受支持的阵列，Blas密集的操作也不会有效地效率（无论如何不是Afaik）。这是因为数组数组通常不会有效地存储在高速缓存中（如果幸运的话，可以几乎连续分配行），并且需要更复杂的索引。

例如，在X86-64 PC上，像我的X86-64 PC上，具有〜40 GB/s RAM和一个i5-9600kf处理器，能够在此类矩阵上在〜300 Gflops上操作，a o（n ** 3）矩阵乘法需要大约2*20_000 ** 3 / 300E9 = 53.3 < / code>秒。同时，矩阵的最佳副本将需要大约2 * 8 * 20_000 ** 2 / 40E9 = 0.16 < / code>秒。因此，与实际矩阵乘法时间相比，副本的时间可以忽略不计。话虽如此，当然需要三个副本（两个输入矩阵和输出矩阵）。此外，快速BLAS实现使用的是具有更好渐近复杂性的Strassen算法，在这种情况下，应快于2〜3。尽管如此，与实际矩阵乘法时间（＆gt; 17.8 s）相比，所有副本（〜0.5 s）的时间非常小。

在Knl Xeon Phi上，McDram达到了〜400 GB/s的吞吐量，主RAM达到〜90 GB/s的吞吐量，而处理器可以达到3个Tflops。因此，如果将矩阵存储在McDram中，则矩阵乘法的结果应为1.8〜5.3秒，所有副本的结果应为0.05秒。如果矩阵存储在缓慢的DRAM中，则副本时间为0.21秒，与计算时间相比，副本时间明显更大，但仍然不大。如果您没有足够的空间将矩阵存储在McDram中，则可以将矩阵分开（例如10k x 10k）并分别计算每个瓷砖（使用副本和Blas DGEMM用于每个瓷砖）。

如果您想实现更高的性能，那么您可以将Xeon Phi处理器的几个线程置于一些块的副本，以便在计算时间内复制时间。但是，这肯定会使代码更加复杂，以进行较小的改进。

Dense BLAS operations can operate on matrices with a given fixed stride but in your case the stride is not constant. Sparse matrices are meant to operate on matrices containing a lot of zeros which is apparently not your case (at least not in the provided example).

Since your matrices is huge in practice (20k x 20k), the fastest solution is to copy the matrix lines in a big contiguous one. Indeed, BLAS dense operations would not be efficient on arrays of arrays even if it was supported (which is AFAIK not the case anyway). This is because arrays of arrays are generally not efficiently stored in cache (if you are lucky, lines can be allocated nearly contiguously though) and require a more complex indexing.

For example, on a x86-64 PC like mine with a ~40 GB/s RAM and a i5-9600KF processor capable of operating at ~300 GFlops on such matrices, a O(n**3) matrix multiplication would require about 2 * 20_000**3 / 300e9 = 53.3 seconds. Meanwhile an optimal copy of the matrix would require about 2 * 8 * 20_000**2 / 40e9 = 0.16 seconds. Thus, the time of the copy is negligible compared to the actual matrix multiplication time. That being said, three copy are certainly needed (the two input matrices and the output one). Additionally, fast BLAS implementations use the Strassen algorithm with a better asymptotic complexity that should be about 2~3 faster in this case. Still, the time of all the copies (~0.5 s) is very small compared to the actual matrix multiplication time (>17.8 s).

On a KNL Xeon Phi, the MCDRAM reach a throughput of ~400 GB/s, the main RAM reach a throughput of ~90 GB/s while the processor can reach 3 TFlops. Thus, if you store the matrices in the MCDRAM, the results should be 1.8~5.3 second for the matrix multiplication and 0.05 second for all the copies. If the matrices are stored in the slow DRAM, then the copy time is 0.21 second which is significantly bigger but still not so big compared to the computation time. If you do not have enough space to store the matrices in MCDRAM, then you can split the matrices in big tiles (eg. 10k x 10k) and compute each tile separately (using copies and a BLAS DGEMM for each tile).

If you want to achieve higher performance, then you can dedicate few threads of the Xeon Phi processor to make copy of some block so to overlap the copy time with the computation time. However, this will certainly make the code much more complex for a small improvement.

回复收藏 0 原文

~没有更多了~