GPU 上的排序列表交集

发布于 2024-12-10 06:53:58 字数 158 浏览 0 评论 0原文

我知道如何使用 O(n+m) 算法在 CPU 上交叉两个排序列表,其中 n 和 m 是两个列表的长度。然而,是否有一种好的算法可以在 GPU 上交叉两个列表来避免写入冲突。我担心在相交时,两个线程可能会尝试写入同一输出缓冲区,从而导致冲突。我不是在寻找图书馆。我想知道基本思想+一些代码(如果可能的话)

i know how to intersect two sorted lists on the CPU using an O(n+m) algorithm where n and m are the length of the 2 lists. However, is there a good algorithm for intersecting two lists on the GPU that avoids write conflicts. I am scared that while intersecting, two threads may try to write to the same output buffer resulting in conflict. I am not looking for a library. I want to know the basic idea + some code if possible

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

痕至 2024-12-17 06:53:58

我知道您可能不希望您的代码与库绑定。不过,我认为 Thrust 有一种算法可以完全满足您的需求,前提是您在传统的 C 数组中处理列表。

在这里查看 thrust::merge http:// /wiki.thrust.googlecode.com/hg/html/group__merging.html

-- 编辑 --

对您的问题进行更多思考后,直接在 GPU 上相交两个列表似乎非常有用写起来很复杂CUDA。

以下代码片段是一个替代解决方案(我之前的解决方案和@jmsu 的合并)。这是如何使两个以降序存储的整数列表相交的示例。这些列表存储在设备内存中,但计算不能在内核中执行。因此,如果可能的话,您需要在内核调用之间使用它:

#include <thrust/set_operations.h>
#include <ostream>

int main() {

    int A_host[] = {11, 9, 5, 3};
    int B_host[] = {14, 12, 10, 5, 1};
    int sizeA = 4;
    int sizeB = 5;
    int sizeC = (sizeA < sizeB) ? sizeA : sizeB;

    int C_host[sizeC];

    int* A_device;
    int* B_device;
    int* C_device;

    cudaMalloc( (void**) &A_device, sizeof(int) * sizeA);
    cudaMalloc( (void**) &B_device, sizeof(int) * sizeB);
    cudaMalloc( (void**) &C_device, sizeof(int) * sizeC);

    cudaMemcpy( A_device, A_host, sizeof(int) * sizeA, cudaMemcpyHostToDevice);
    cudaMemcpy( B_device, B_host, sizeof(int) * sizeB, cudaMemcpyHostToDevice);
    cudaMemset( C_device, 0, sizeof(int) * sizeC);

    // add an alias to thrust::device_ptr<int> to be more readable
    typedef thrust::device_ptr<int> ptrI;

    thrust::set_intersection(ptrI(A_device), ptrI(A_device + sizeA), ptrI(B_device), ptrI(B_device + sizeB), ptrI(C_device), thrust::greater<int>() );
    cudaMemcpy( C_host, C_device, sizeof(int) * sizeC, cudaMemcpyDeviceToHost);


    std::copy(C_host, C_host + sizeC, std::ostream_iterator<int> (std::cout, " ") );
}

I understand that you might not want your code to be tied to a library. However, I think Thrust has an algorithm that does exactly what you want, provided that you handle your list in a traditional C-array.

Have a look at thrust::merge here http://wiki.thrust.googlecode.com/hg/html/group__merging.html

-- edit --

Having thought a little more on your question, intersecting two lists directly on the GPU seems very complex to write with CUDA.

The following code snippet is an alternative solution (a merge of my previous solution and of @jmsu's). This is an example how to intersect two integers lists stored in a decreasing order. The lists are stored in the device memory, but the computation cannot be performed within a kernel. Thus, you need to use it between kernel calls, if it is possible:

#include <thrust/set_operations.h>
#include <ostream>

int main() {

    int A_host[] = {11, 9, 5, 3};
    int B_host[] = {14, 12, 10, 5, 1};
    int sizeA = 4;
    int sizeB = 5;
    int sizeC = (sizeA < sizeB) ? sizeA : sizeB;

    int C_host[sizeC];

    int* A_device;
    int* B_device;
    int* C_device;

    cudaMalloc( (void**) &A_device, sizeof(int) * sizeA);
    cudaMalloc( (void**) &B_device, sizeof(int) * sizeB);
    cudaMalloc( (void**) &C_device, sizeof(int) * sizeC);

    cudaMemcpy( A_device, A_host, sizeof(int) * sizeA, cudaMemcpyHostToDevice);
    cudaMemcpy( B_device, B_host, sizeof(int) * sizeB, cudaMemcpyHostToDevice);
    cudaMemset( C_device, 0, sizeof(int) * sizeC);

    // add an alias to thrust::device_ptr<int> to be more readable
    typedef thrust::device_ptr<int> ptrI;

    thrust::set_intersection(ptrI(A_device), ptrI(A_device + sizeA), ptrI(B_device), ptrI(B_device + sizeB), ptrI(C_device), thrust::greater<int>() );
    cudaMemcpy( C_host, C_device, sizeof(int) * sizeC, cudaMemcpyDeviceToHost);


    std::copy(C_host, C_host + sizeC, std::ostream_iterator<int> (std::cout, " ") );
}
未蓝澄海的烟 2024-12-17 06:53:58

做这样的事情怎么样:
如果数组 B 中存在值 A[i],则将其存储在 C[i] 中,否则 C[i]:=DUMMY。

然后进行并行数组压缩?
有一些工具可以做到这一点 - 例如检查此处 - 一个库以及一篇描述所用算法的论文。

How about doing something like this:
If value A[i] exists in the array B, store it in C[i], otherwise C[i]:=DUMMY.

And then perform parallel array compaction?
There are tools to do exactly that - check here for example - a library and a paper describing the algorithm used.

装纯掩盖桑 2024-12-17 06:53:58

遵循 Thrust 的想法,正如 @jHackTheRipper 所说,你不需要整个 Thrust,你可以只使用你需要的。

要使用 Thrust 执行此操作,您可以使用 Thrust::set_intersection

http://wiki.thrust.googlecode.com/hg/html/group_set_operations.html#ga17277fec1491c8a916c9908a5ae40807

Thrust 文档中的示例代码:

#include <thrust/set_operations.h>
...
int A1[6] = {1, 3, 5, 7, 9, 11};
int A2[7] = {1, 1, 2, 3, 5,  8, 13};

int result[7];

int *result_end = thrust::set_intersection(A1, A1 + 6, A2, A2 + 7, result);
// result is now {1, 3, 5}

执行它在GPU上,您可以将数组复制到设备内存并使用thrust::device_ptr,或者更好的方法是使用推力::设备向量。它们与 STL 向量兼容。

thrust::host_vector<int> h_list1;
thrust::host_vector<int> h_list2;
// insert code to populate the lists...

thrust::device_vector<int> d_list1 = h_list1; // copy list1 from host to device
thrust::device_vector<int> d_list2 = h_list2; // copy list2 from host to device

thrust::device_vector<int> d_result;

thrust::set_intersection(d_list1.begin(), d_list1.end(), d_list2.begin(), d_list2.end(), d_result.begin());

thrust::host_vector<int> h_result = d_result; // copy result from device to host

我还没有检查过代码,但应该与此接近。 Thrust 网站提供了很好的示例来帮助您入门。

Following the Thrust idea, as @jHackTheRipper said you don't need the whole of Thrust, you can use only what you need.

To do it with Thrust you can use the thrust::set_intersection

http://wiki.thrust.googlecode.com/hg/html/group_set_operations.html#ga17277fec1491c8a916c9908a5ae40807

Example code from the Thrust documentation:

#include <thrust/set_operations.h>
...
int A1[6] = {1, 3, 5, 7, 9, 11};
int A2[7] = {1, 1, 2, 3, 5,  8, 13};

int result[7];

int *result_end = thrust::set_intersection(A1, A1 + 6, A2, A2 + 7, result);
// result is now {1, 3, 5}

To execute it on the GPU you can copy the arrays to device memory and use thrust::device_ptr or the better way is to use thrust::device_vector. These are compatible with STL vectors.

thrust::host_vector<int> h_list1;
thrust::host_vector<int> h_list2;
// insert code to populate the lists...

thrust::device_vector<int> d_list1 = h_list1; // copy list1 from host to device
thrust::device_vector<int> d_list2 = h_list2; // copy list2 from host to device

thrust::device_vector<int> d_result;

thrust::set_intersection(d_list1.begin(), d_list1.end(), d_list2.begin(), d_list2.end(), d_result.begin());

thrust::host_vector<int> h_result = d_result; // copy result from device to host

I haven't checked the code but it should be something close to this. The thrust website has great examples to get you started.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文