OpenMP增加线程数量增加了执行时间

发布于 2025-01-27 13:09:40 字数 1637 浏览 1 评论 0原文

我正在实现稀疏矩阵乘法(元素类型std :: complex)之后,将它们转换为csr(压缩稀疏行)格式,我正在为此使用OpenMP但是我注意到,增加线程的数量并不一定会提高性能,有时完全恰恰相反!为什么这样?我该怎么办来解决这个问题?

typedef std::vector < std::vector < std::complex < int >>> matrix;

struct CSR {
    std::vector<std::complex<int>> values; //non-zero values
    std::vector<int> row_ptr; //pointers of rows
    std::vector<int> cols_index; //indices of columns
    int rows; //number of rows
    int cols; //number of columns
    int NNZ; //number of non_zero elements
};

const matrix multiply_omp (const CSR& A,
    const CSR& B,const unsigned int num_threds=4) {
    if (A.cols != B.rows)
        throw "Error";
    CSR B_t = sparse_transpose(B);
    omp_set_num_threads(num_threds);
    matrix result(A.rows, std::vector < std::complex < int >>(B.cols, 0));
    #pragma omp parallel
    {
        int i, j, k, l;
        #pragma omp for
        for (i = 0; i < A.rows; i++) {
            for (j = 0; j < B_t.rows; j++) {
                std::complex < int > sum(0, 0);
                for (k = A.row_ptr[i]; k < A.row_ptr[i + 1]; k++)
                    for (l = B_t.row_ptr[j]; l < B_t.row_ptr[j + 1]; l++)
                        if (A.cols_index[k] == B_t.cols_index[l]) {
                            sum += A.values[k] * B_t.values[l];
                            break;
                        }
                if (sum != std::complex < int >(0, 0)) {
                    result[i][j] += sum;
                }
            }
        }
    }
    return result;
}

I'm implementing sparse matrices multiplication(type of elements std::complex) after converting them to CSR(compressed sparse row) format and I'm using openmp for this, but what I noticed that increasing the number of threads doesn't necessarily increase the performance, sometimes is totally the opposite! why is that the case? and what can I do to solve the issue?

typedef std::vector < std::vector < std::complex < int >>> matrix;

struct CSR {
    std::vector<std::complex<int>> values; //non-zero values
    std::vector<int> row_ptr; //pointers of rows
    std::vector<int> cols_index; //indices of columns
    int rows; //number of rows
    int cols; //number of columns
    int NNZ; //number of non_zero elements
};

const matrix multiply_omp (const CSR& A,
    const CSR& B,const unsigned int num_threds=4) {
    if (A.cols != B.rows)
        throw "Error";
    CSR B_t = sparse_transpose(B);
    omp_set_num_threads(num_threds);
    matrix result(A.rows, std::vector < std::complex < int >>(B.cols, 0));
    #pragma omp parallel
    {
        int i, j, k, l;
        #pragma omp for
        for (i = 0; i < A.rows; i++) {
            for (j = 0; j < B_t.rows; j++) {
                std::complex < int > sum(0, 0);
                for (k = A.row_ptr[i]; k < A.row_ptr[i + 1]; k++)
                    for (l = B_t.row_ptr[j]; l < B_t.row_ptr[j + 1]; l++)
                        if (A.cols_index[k] == B_t.cols_index[l]) {
                            sum += A.values[k] * B_t.values[l];
                            break;
                        }
                if (sum != std::complex < int >(0, 0)) {
                    result[i][j] += sum;
                }
            }
        }
    }
    return result;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

坏尐絯 2025-02-03 13:09:40

您可以尝试改善此算法的缩放,但是我会使用更好的算法。您正在为两个稀疏矩阵的产物分配一个密集的矩阵(错误,但这是旁边的)。这是浪费的,因为很少有两个稀疏矩阵的项目不会被长期致密。

您的算法也有错误的时间复杂性。您搜索B行的方式意味着您的复杂性具有额外的因素,例如每行的平均数量。更好的算法将假设每行中的索引都已排序,然后保留您进入该行多远的指针。

阅读有关“图形”的文献,以参考有效算法。

You can try to improve the scaling of this algorithm, but I would use a better algorithm. You are allocating a dense matrix (wrongly, but that's beside the point) for the product of two sparse matrices. That's wasteful since quite often the project of two sparse matrices will not be dense by a long shot.

Your algorithm also has the wrong time complexity. The way you search through the rows of B means that your complexity has an extra factor of something like the average number of nonzeros per row. A better algorithm would assume that the indices in each row are sorted, and then keep a pointer for how far you got into that row.

Read the literature on "Graph Blas" for references to efficient algorithms.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文