使用 CUDA Thrust 同时对多个数组进行排序

发布于 2024-12-18 05:10:39 字数 929 浏览 3 评论 0原文

我需要按相同的键对 GPU 上已有的 20+ 数组进行排序，每个数组的长度相同。我不能直接使用 sort_by_key() 因为它也会对键进行排序（使它们无法对下一个数组进行排序）。这是我尝试的方法：

thrust::device_vector<int>  indices(N); 
thrust::sequence(indices.begin(),indices.end());
thrust::sort_by_key(keys.begin(),keys.end(),indices.begin());

thrust::gather(indices.begin(),indices.end(),a_01,a_01);
thrust::gather(indices.begin(),indices.end(),a_02,a_02);
...
thrust::gather(indices.begin(),indices.end(),a_20,a_20);

这似乎不起作用，因为 gather() 期望输出的数组与输入的数组不同，即这有效：

thrust::gather(indices.begin(),indices.end(),a_01,o_01);
...

但是，我宁愿不分配20+ 个额外数组。我知道有一个使用推力::元组、推力::zip_iterator和推力::sort_by_keys()的解决方案，类似于此处。但是，我只能在一个元组中组合最多 10 个数组，因此我需要再次复制键向量。您将如何完成这项任务？

原文

I need to sort 20+ arrays, already on the GPU, each of the same length, by the same keys. I can not use sort_by_key() directly since it sorts the keys as well (making them useless to sort the next array). Here is what I tried instead:

thrust::device_vector<int>  indices(N); 
thrust::sequence(indices.begin(),indices.end());
thrust::sort_by_key(keys.begin(),keys.end(),indices.begin());

thrust::gather(indices.begin(),indices.end(),a_01,a_01);
thrust::gather(indices.begin(),indices.end(),a_02,a_02);
...
thrust::gather(indices.begin(),indices.end(),a_20,a_20);

This does not seem to work since gather() expects a different array for the output than for the input, i.e. this works:

thrust::gather(indices.begin(),indices.end(),a_01,o_01);
...

However, I would prefer to not allocate 20+ extra arrays for this task. I know that there is a solution using a thrust::tuple, thrust::zip_iterator and thrust::sort_by_keys(), similiar to here. However, I can only combine up to 10 arrays in a tuple, s.t. I would need to duplicate the key vector again. How would you tackle this task?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

-柠檬树下少年和吉他 2024-12-25 05:10:39

我认为对多个数组进行排序的经典方法是所谓的背对背方法，该方法使用两次thrust::stable_sort_by_key。您需要创建一个键向量，以便同一数组中的元素具有相同的键。例如：

Elements: 10.5 4.3 -2.3 0. 55. 24. 66.
Keys:      0    0    0  1   1   1   1

在本例中，我们有两个数组，第一个包含 3 元素，第二个包含 4 元素。

您首先需要调用 thrust::stable_sort_by_key 将矩阵值作为键，

thrust::stable_sort_by_key(d_matrix.begin(),
                           d_matrix.end(),
                           d_keys.begin(),
                           thrust::less<float>());

之后，您就知道了，

Elements: -2.3 0 4.3 10.5 24. 55. 66.
Keys:       0  1  0    0   1   1   1

这意味着数组元素是有序的，而键不是有序的。然后您需要一秒钟来调用 thrust::stable_sort_by_key

thrust::stable_sort_by_key(d_keys.begin(),
                           d_keys.end(),
                           d_matrix.begin(),
                           thrust::less<int>());

以便根据键执行排序。完成这一步后，您就得到了

Elements: -2.3 4.3 10.5 0 24. 55. 66.
Keys:       0   0   0   1  1   1   1

最终想要的结果。

下面是一个完整的工作示例，它考虑以下问题：单独对矩阵的每一行进行排序。这是一种特殊情况，其中所有数组都具有相同的长度，但该方法适用于可能具有不同长度的数组。

#include <cublas_v2.h>

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>

#include <stdio.h>
#include <iostream>

#include "Utilities.cuh"

/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {

    T Ncols; // --- Number of columns

    __host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}

    __host__ __device__ T operator()(T i) { return i / Ncols; }
};

/********/
/* MAIN */
/********/
int main()
{
    const int Nrows = 5;     // --- Number of rows
    const int Ncols = 8;     // --- Number of columns

    // --- Random uniform integer distribution between 10 and 99
    thrust::default_random_engine rng;
    thrust::uniform_int_distribution<int> dist(10, 99);

    // --- Matrix allocation and initialization
    thrust::device_vector<float> d_matrix(Nrows * Ncols);
    for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);

    // --- Print result
    printf("Original matrix\n");
    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "]\n";
    }

    /*************************/
    /* BACK-TO-BACK APPROACH */
    /*************************/
    thrust::device_vector<float> d_keys(Nrows * Ncols);

    // --- Generate row indices
    thrust::transform(thrust::make_counting_iterator(0),
                      thrust::make_counting_iterator(Nrows*Ncols),
                      thrust::make_constant_iterator(Ncols),
                      d_keys.begin(),
                      thrust::divides<int>());

    // --- Back-to-back approach
    thrust::stable_sort_by_key(d_matrix.begin(),
                               d_matrix.end(),
                               d_keys.begin(),
                               thrust::less<float>());

    thrust::stable_sort_by_key(d_keys.begin(),
                               d_keys.end(),
                               d_matrix.begin(),
                               thrust::less<int>());

    // --- Print result
    printf("\n\nSorted matrix\n");
    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "]\n";
    }

    return 0;
}

I think that the classical way to sort multiple arrays is the so-called back-to-back approach which uses uses thrust::stable_sort_by_key two times. You need to create a keys vector such that elements within the same array have the same key. For example:

Elements: 10.5 4.3 -2.3 0. 55. 24. 66.
Keys:      0    0    0  1   1   1   1

In this case we have two arrays, the first with 3 elements and the second with 4 elements.

You first need to call thrust::stable_sort_by_key having the matrix values as the keys like

thrust::stable_sort_by_key(d_matrix.begin(),
                           d_matrix.end(),
                           d_keys.begin(),
                           thrust::less<float>());

After that, you have

Elements: -2.3 0 4.3 10.5 24. 55. 66.
Keys:       0  1  0    0   1   1   1

which means that the array elements are ordered, while the keys are not. Then you need a second to call thrust::stable_sort_by_key

thrust::stable_sort_by_key(d_keys.begin(),
                           d_keys.end(),
                           d_matrix.begin(),
                           thrust::less<int>());

so performing a sorting according to the keys. After that step, you have

Elements: -2.3 4.3 10.5 0 24. 55. 66.
Keys:       0   0   0   1  1   1   1

which is the final desired result.

Below, a full working example which considers the following problem: separately order each row of a matrix. This is a particular case in which all the arrays have the same length, but the approach works with arrays having possibly different lengths.

#include <cublas_v2.h>

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>

#include <stdio.h>
#include <iostream>

#include "Utilities.cuh"

/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {

    T Ncols; // --- Number of columns

    __host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}

    __host__ __device__ T operator()(T i) { return i / Ncols; }
};

/********/
/* MAIN */
/********/
int main()
{
    const int Nrows = 5;     // --- Number of rows
    const int Ncols = 8;     // --- Number of columns

    // --- Random uniform integer distribution between 10 and 99
    thrust::default_random_engine rng;
    thrust::uniform_int_distribution<int> dist(10, 99);

    // --- Matrix allocation and initialization
    thrust::device_vector<float> d_matrix(Nrows * Ncols);
    for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);

    // --- Print result
    printf("Original matrix\n");
    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "]\n";
    }

    /*************************/
    /* BACK-TO-BACK APPROACH */
    /*************************/
    thrust::device_vector<float> d_keys(Nrows * Ncols);

    // --- Generate row indices
    thrust::transform(thrust::make_counting_iterator(0),
                      thrust::make_counting_iterator(Nrows*Ncols),
                      thrust::make_constant_iterator(Ncols),
                      d_keys.begin(),
                      thrust::divides<int>());

    // --- Back-to-back approach
    thrust::stable_sort_by_key(d_matrix.begin(),
                               d_matrix.end(),
                               d_keys.begin(),
                               thrust::less<float>());

    thrust::stable_sort_by_key(d_keys.begin(),
                               d_keys.end(),
                               d_matrix.begin(),
                               thrust::less<int>());

    // --- Print result
    printf("\n\nSorted matrix\n");
    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "]\n";
    }

    return 0;
}

回复收藏 0 原文

两仪 2024-12-25 05:10:39

好吧，如果您可以操作指向 device_vector 的指针，那么您实际上只需要分配一个额外的数组：

thrust::device_vector<int>  indices(N); 
thrust::sequence(indices.begin(),indices.end());
thrust::sort_by_key(keys.begin(),keys.end(),indices.begin());

thrust::device_vector<int> temp(N);
thrust::device_vector<int> *sorted = &temp;
thrust::device_vector<int> *pa_01 = &a_01;
thrust::device_vector<int> *pa_02 = &a_02;
...
thrust::device_vector<int> *pa_20 = &a_20;

thrust::gather(indices.begin(), indices.end(), *pa_01, *sorted);
pa_01 = sorted; sorted = &a_01;
thrust::gather(indices.begin(), indices.end(), *pa_02, *sorted);
pa_02 = sorted; sorted = &a_02;
...
thrust::gather(indices.begin(), indices.end(), *pa_20, *sorted);
pa_20 = sorted; sorted = &a_20;

或者类似的东西无论如何都应该可以工作。您需要修复它，以便临时设备向量在超出范围时不会自动释放——我建议使用 cudaMalloc 分配 CUDA 设备指针，然后用 device_ptr 包装它们，而不是使用自动 device_vectors。

Well, you really only need to allocate one extra array if you are OK with manipulating pointers to device_vector instead:

thrust::device_vector<int>  indices(N); 
thrust::sequence(indices.begin(),indices.end());
thrust::sort_by_key(keys.begin(),keys.end(),indices.begin());

thrust::device_vector<int> temp(N);
thrust::device_vector<int> *sorted = &temp;
thrust::device_vector<int> *pa_01 = &a_01;
thrust::device_vector<int> *pa_02 = &a_02;
...
thrust::device_vector<int> *pa_20 = &a_20;

thrust::gather(indices.begin(), indices.end(), *pa_01, *sorted);
pa_01 = sorted; sorted = &a_01;
thrust::gather(indices.begin(), indices.end(), *pa_02, *sorted);
pa_02 = sorted; sorted = &a_02;
...
thrust::gather(indices.begin(), indices.end(), *pa_20, *sorted);
pa_20 = sorted; sorted = &a_20;

Or something like that should work anyway. You would need to fix it so the temp device vector is not automatically deallocated when it goes out of scope -- I suggest allocating the CUDA device pointers using cudaMalloc and then wrapping them with device_ptr instead of using automatic device_vectors.

回复收藏 0 原文

~没有更多了~