CUDA上多线程使用设备变量

发布于 2024-08-25 05:47:45 字数 349 浏览 5 评论 0原文

我正在玩cuda。

目前我有一个问题。我正在测试一个大型数组的特定响应,当我收到响应时,我必须将数据复制到另一个数组上。

例如,我的 5 个元素的测试数组如下所示:
[ ][ ][v1][ ][ ][v2]

结果必须如下所示:
[v1][v2]

问题是如何计算第二个数组存储结果的地址?并行检查第一个数组的所有元素。

我正在考虑声明一个设备变量 int addr = 0。每次我找到响应时,我都会增加 addr。但我对此不确定,因为这意味着 addr 可以同时被多个线程访问。这会引起问题吗?或者线程会等到另一个线程完成使用该变量吗?

I am playing around with cuda.

At the moment I have a problem. I am testing a large array for particular responses, and when I get the response, I have to copy the data onto another array.

For example, my test array of 5 elements looks like this:
[ ][ ][v1][ ][ ][v2]

Result must look like this:
[v1][v2]

The problem is how do I calculate the address of the second array to store the result? All elements of the first array are checked in parallel.

I am thinking to declare a device variable int addr = 0. Every time I find a response, I will increment the addr. But I am not sure about that because it means that addr may be accessed by multiple threads at the same time. Will that cause problems? Or will the thread wait until another thread finishes using that variable?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

私野 2024-09-01 05:47:45

并不像看起来那么微不足道。我刚刚完成一个实施,我可以告诉你需要什么
阅读扫描 Gpu Gems 3 文章,特别是章节 39.3.1流压缩。

要实现您自己的从 SDK 中的 LargeArrayScan 示例开始,这将只为您提供预扫描。假设设备内存中有选择数组(1 和 0 的数组,表示 1-选择 0-丢弃),dev_selection_array 一个 dev_elements_array 元素将被选择一个 dev_prescan_array 和一个 dev_result_array 所有大小 N 然后你就可以

prescan(dev_prescan_array,dev_selection_array, N);
scatter(dev_result_array, dev_prescan_array,
         dev_selection_array, dev_elements_array, N);

在分散的地方

 __global__ void scatter_kernel( T*dev_result_array, 
                   const T* dev_prescan_array, 
                   const T* dev_selection_array,
                   const T* dev_elements_array, std::size_t size){

unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= size) return;
if (dev_selection_array[idx] == 1){
    dev_result_array[dev_prescan_array[idx]] = dev_elements_array[idx];
}
}

进行预扫描的其他很好的应用,请参阅论文 Ble93

玩得开心!

Is not as trivial as it seems. I just finished to implement one and I can tell what you need
read the scan Gpu Gems 3 Article in particular chapter 39.3.1 Stream Compaction.

To implement your own start from the LargeArrayScan example in the SDK, that will give you just the prescan. Assuming you have the selection array in device memory (an array of 1 and 0 meaning 1- select 0 - discard), dev_selection_array a dev_elements_array elements to be selected a dev_prescan_array and a dev_result_array all of size N then you do

prescan(dev_prescan_array,dev_selection_array, N);
scatter(dev_result_array, dev_prescan_array,
         dev_selection_array, dev_elements_array, N);

where the scatter is

 __global__ void scatter_kernel( T*dev_result_array, 
                   const T* dev_prescan_array, 
                   const T* dev_selection_array,
                   const T* dev_elements_array, std::size_t size){

unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= size) return;
if (dev_selection_array[idx] == 1){
    dev_result_array[dev_prescan_array[idx]] = dev_elements_array[idx];
}
}

for other nice application of the prescan see the paper Ble93

Have fun!

追星践月 2024-09-01 05:47:45

您正在谈论经典的流压缩。一般来说,我建议查看 ThrustCUDPP(这些链接转到压缩文档)。这两个都是开源的,如果您想自己开发,我还建议您查看“扫描”SDK 示例。

You're talking about classic stream compaction. Generally I would recommend looking at Thrust or CUDPP (those links go to the compaction documentation). Both of these are open source, if you want to roll your own then I would also suggest looking at the 'scan' SDK sample.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文