使用 CUDA Thrust 同时对多个数组进行排序
我需要按相同的键对 GPU 上已有的 20+
数组进行排序,每个数组的长度相同。我不能直接使用 sort_by_key()
因为它也会对键进行排序(使它们无法对下一个数组进行排序)。这是我尝试的方法:
thrust::device_vector<int> indices(N);
thrust::sequence(indices.begin(),indices.end());
thrust::sort_by_key(keys.begin(),keys.end(),indices.begin());
thrust::gather(indices.begin(),indices.end(),a_01,a_01);
thrust::gather(indices.begin(),indices.end(),a_02,a_02);
...
thrust::gather(indices.begin(),indices.end(),a_20,a_20);
这似乎不起作用,因为 gather()
期望输出的数组与输入的数组不同,即这有效:
thrust::gather(indices.begin(),indices.end(),a_01,o_01);
...
但是,我宁愿不分配20+ 个额外数组。我知道有一个使用推力::元组、推力::zip_iterator和推力::sort_by_keys()的解决方案,类似于此处。但是,我只能在一个元组中组合最多 10 个数组,因此我需要再次复制键向量。您将如何完成这项任务?
I need to sort 20+
arrays, already on the GPU, each of the same length, by the same keys. I can not use sort_by_key()
directly since it sorts the keys as well (making them useless to sort the next array). Here is what I tried instead:
thrust::device_vector<int> indices(N);
thrust::sequence(indices.begin(),indices.end());
thrust::sort_by_key(keys.begin(),keys.end(),indices.begin());
thrust::gather(indices.begin(),indices.end(),a_01,a_01);
thrust::gather(indices.begin(),indices.end(),a_02,a_02);
...
thrust::gather(indices.begin(),indices.end(),a_20,a_20);
This does not seem to work since gather()
expects a different array for the output than for the input, i.e. this works:
thrust::gather(indices.begin(),indices.end(),a_01,o_01);
...
However, I would prefer to not allocate 20+
extra arrays for this task. I know that there is a solution using a thrust::tuple, thrust::zip_iterator and thrust::sort_by_keys(), similiar to here. However, I can only combine up to 10
arrays in a tuple, s.t. I would need to duplicate the key vector again. How would you tackle this task?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为对多个数组进行排序的经典方法是所谓的背对背方法,该方法使用两次
thrust::stable_sort_by_key
。您需要创建一个键向量,以便同一数组中的元素具有相同的键。例如:在本例中,我们有两个数组,第一个包含
3
元素,第二个包含4
元素。您首先需要调用
thrust::stable_sort_by_key
将矩阵值作为键,之后,您就知道了,
这意味着数组元素是有序的,而键不是有序的。然后您需要一秒钟来调用
thrust::stable_sort_by_key
以便根据键执行排序。完成这一步后,您就得到了
最终想要的结果。
下面是一个完整的工作示例,它考虑以下问题:单独对矩阵的每一行进行排序。这是一种特殊情况,其中所有数组都具有相同的长度,但该方法适用于可能具有不同长度的数组。
I think that the classical way to sort multiple arrays is the so-called back-to-back approach which uses uses
thrust::stable_sort_by_key
two times. You need to create a keys vector such that elements within the same array have the same key. For example:In this case we have two arrays, the first with
3
elements and the second with4
elements.You first need to call
thrust::stable_sort_by_key
having the matrix values as the keys likeAfter that, you have
which means that the array elements are ordered, while the keys are not. Then you need a second to call
thrust::stable_sort_by_key
so performing a sorting according to the keys. After that step, you have
which is the final desired result.
Below, a full working example which considers the following problem: separately order each row of a matrix. This is a particular case in which all the arrays have the same length, but the approach works with arrays having possibly different lengths.
好吧,如果您可以操作指向
device_vector
的指针,那么您实际上只需要分配一个额外的数组:或者类似的东西无论如何都应该可以工作。您需要修复它,以便临时设备向量在超出范围时不会自动释放——我建议使用 cudaMalloc 分配 CUDA 设备指针,然后用 device_ptr 包装它们,而不是使用自动 device_vectors。
Well, you really only need to allocate one extra array if you are OK with manipulating pointers to
device_vector
instead:Or something like that should work anyway. You would need to fix it so the temp device vector is not automatically deallocated when it goes out of scope -- I suggest allocating the CUDA device pointers using cudaMalloc and then wrapping them with device_ptr instead of using automatic device_vectors.