你会如何在 CUDA 中实现这个功能? (排序整数向量中的偏移量)

发布于 2024-12-16 13:32:24 字数 296 浏览 6 评论 0原文

我的设备上有一个排序的整数数组,例如:

[0,0,0,1,1,2,2]

我想要另一个数组中每个元素的偏移量:(

[0,3,5]

因为第一个 0 位于位置 0,第一个 1 位于位置 3,依此类推) 我事先知道会有多少种不同的元素。您将如何在 CUDA 中有效地实现这一点?我不是要代码,而是要实现计算此转换的算法的高级描述。我已经查看了推力名称空间中的各种函数,但无法想到推力函数的任何组合来实现此目的。另外,这种转变有一个被广泛接受的名称吗?

I have a sorted integer array on the device, e.g.:

[0,0,0,1,1,2,2]

And I want the offsets to each element in another array:

[0,3,5]

(since the first 0 is at position 0, the first 1 at position 3 and so on)
I know how many different elements there will be beforehand. How would you implement this efficiently in CUDA? I'm not asking for code, but a high level description of the algorithm you would implement to compute this transformation. I already hat a look at the various functions in the thrust name space, but could not think of any combination of thrust functions to achieve this. Also, does this transformation have a widely accepted name?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

手长情犹 2024-12-23 13:32:24

您可以在 Thrust 中使用 thrust::unique_by_key_copythrust::counting_iterator 来解决这个问题。这个想法是将整数数组视为 unique_by_key_copykeys 参数,并使用升序整数序列(即 counting_iterator)作为unique_by_key_copy 会将值数组压缩到每个唯一key 的索引中:

#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/unique.h>
#include <thrust/copy.h>
#include <iterator>
#include <iostream>

int main()
{
  thrust::device_vector<int> keys(7);
  keys[0] = 0; keys[1] = 0; keys[2] = 0;
  keys[3] = 1; keys[4] = 1; keys[5] = 2; keys[6] = 2;

  std::cout << "keys before unique_by_key_copy: [ ";
  thrust::copy(keys.begin(), keys.end(), std::ostream_iterator<int>(std::cout," "));
  std::cout << "]" << std::endl;

  thrust::device_vector<int> offsets(3);

  thrust::unique_by_key_copy(keys.begin(), keys.end(),          // keys
                             thrust::make_counting_iterator(0), // [0, 1, 2, 3, ...] are the values
                             thrust::make_discard_iterator(),   // discard the compacted keys
                             offsets.begin());                  // the offsets are the values

  std::cout << "offsets after unique_by_key_copy: [ ";
  thrust::copy(offsets.begin(), offsets.end(), std::ostream_iterator<int>(std::cout," "));
  std::cout << "]" << std::endl;

  return 0;
}

这是输出:

$ nvcc test.cu -run
keys before unique_by_key_copy: [ 0 0 0 1 1 2 2 ]
offsets after unique_by_key_copy: [ 0 3 5 ]

You can solve this in Thrust using thrust::unique_by_key_copy with thrust::counting_iterator. The idea is to treat your integer array as the keys argument to unique_by_key_copy and to use a sequence of ascending integers (i.e., counting_iterator) as the values. unique_by_key_copy will compact the values array into the indices of each unique key:

#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/unique.h>
#include <thrust/copy.h>
#include <iterator>
#include <iostream>

int main()
{
  thrust::device_vector<int> keys(7);
  keys[0] = 0; keys[1] = 0; keys[2] = 0;
  keys[3] = 1; keys[4] = 1; keys[5] = 2; keys[6] = 2;

  std::cout << "keys before unique_by_key_copy: [ ";
  thrust::copy(keys.begin(), keys.end(), std::ostream_iterator<int>(std::cout," "));
  std::cout << "]" << std::endl;

  thrust::device_vector<int> offsets(3);

  thrust::unique_by_key_copy(keys.begin(), keys.end(),          // keys
                             thrust::make_counting_iterator(0), // [0, 1, 2, 3, ...] are the values
                             thrust::make_discard_iterator(),   // discard the compacted keys
                             offsets.begin());                  // the offsets are the values

  std::cout << "offsets after unique_by_key_copy: [ ";
  thrust::copy(offsets.begin(), offsets.end(), std::ostream_iterator<int>(std::cout," "));
  std::cout << "]" << std::endl;

  return 0;
}

Here's the output:

$ nvcc test.cu -run
keys before unique_by_key_copy: [ 0 0 0 1 1 2 2 ]
offsets after unique_by_key_copy: [ 0 3 5 ]
终难愈 2024-12-23 13:32:24

虽然我从未使用过推力库,但这种可能的方法怎么样(简单但可能有效):

int input[N];  // your sorted array
int offset[N]; // the offset of the first values of each elements. Initialized with -1

// each thread will check an index position
if (input[id] > input[id-1]) // bingo! here begins a new value
{
    int oid = input[id];  // use the integer value as index
    offset[oid] = id;     // mark the offset with the beginning of the new value
}

在您的示例中,输出将是:

[0,3,5]

但如果输入数组是:

[0,0,0,2,2,4,4]

那么输出将是:

[0,-1, 3, -1, 5]

现在,如果推力可以做到这一点你,remove_if(offset[i] == -1)并压缩数组。

这种方法会浪费偏移数组的大量内存,但由于您不知道要找到多少偏移量,最坏的情况将使用与输入数组一样多的内存。

另一方面,与全局内存负载相比,每个线程的指令很少,这将限制内存带宽的实现。对于这种情况有一些优化,因为每个线程处理一些值。

我的2分钱!

Although I've never used thrust library, what about this possible approach (simple but maybe effective):

int input[N];  // your sorted array
int offset[N]; // the offset of the first values of each elements. Initialized with -1

// each thread will check an index position
if (input[id] > input[id-1]) // bingo! here begins a new value
{
    int oid = input[id];  // use the integer value as index
    offset[oid] = id;     // mark the offset with the beginning of the new value
}

In your example the output will be:

[0,3,5]

But if the input array is:

[0,0,0,2,2,4,4]

Then the output will be:

[0,-1, 3, -1, 5]

Now, if thrust can do it for you, remove_if( offset[i] == -1 ) and compact the array.

This approach will waste lot of memory for the offset array, but as you dont know how many offset you are going to find, the worst case will use as much memory as the input array.

On the other hand, the few instruction per thread compared to the global memory load will limit this implementation by memory bandwidth. There are some optimization for this case as process some values per thread.

My 2 cents!

快乐很简单 2024-12-23 13:32:24

Scan 是您正在寻找的算法。如果您没有现成的实现,Thrust 库将是一个很好的资源。 (查找推力::扫描)

扫描(或“并行前缀和”)采用输入数组并生成输出,其中每个元素是该点的输入之和: [1 5 3 7] => [1 6 9 16]

如果您扫描谓词(0 或 1,具体取决于评估条件),其中谓词检查给定元素是否与前面的元素相同,然后计算相关元素的输出索引。您的示例数组

[0 0 0 1 1 2 2]
[0 0 0 1 0 1 0] <= 谓词
[0 0 0 1 1 2 2] <= 扫描谓词

现在您可以使用扫描谓词作为索引来写入输出。

Scan is the algorithm you're looking for. If you don't have an implementation lying around, the Thrust library would be a good resource. (Look for thrust::scan)

Scan (or "parallel prefix sum") takes an input array and generates an output where each element is the sum of the inputs to that point: [1 5 3 7] => [1 6 9 16]

If you scan predicates (0 or 1 depending on an evaluated condition) where the predicate checks whether a given element the same as the preceding element, then you compute the output index of the element in question. Your example array

[0 0 0 1 1 2 2]
[0 0 0 1 0 1 0] <= predicates
[0 0 0 1 1 2 2] <= scanned predicates

Now you can use the scanned predicates as indices to write your output.

止于盛夏 2024-12-23 13:32:24

好问题,答案取决于您之后需要做什么。让我解释一下。

一旦这个问题可以在 CPU 上以 O(n)(其中 n 是输入长度)解决,您将遭受内存分配和复制(主机 -> 设备(输入)和设备 -> 主机(结果)) )缺点。这将导致简单 CPU 解决方案的性能下降。

它不能比在 CPU 上更快地完成。

即使你的数组​​已经在设备内存中,每个计算块都需要将其读取到本地或寄存器(至少访问设备内存),并且 如果满足以下条件,CUDA 可以很好地加速性能:

  1. 与输入数据长度相比,计算的渐近复杂性较高。例如,输入数据长度为 n,复杂度为 O(n^2) 或 O(n^3)。

  2. 有办法将任务拆分为独立或弱依赖的子任务。

因此,如果我是你,如果可能的话,我不会尝试在 CUDA 上进行此类计算。如果它必须是一些独立的函数或其他函数的输出格式转换,我会在 CPU 中进行。

如果它是某些更复杂算法的一部分,那么答案会更复杂。如果我在你的位置,我会尝试以某种方式更改 [0,3,5] 格式,因为它增加了利用 CUDA 计算能力的限制。您无法有效地将任务拆分为独立的块。举例来说,如果我在一个计算线程中处理 10 个整数,在另一个计算线程中处理接下来的 10 个整数。第二个不知道在哪里放置他的输出,直到第一个没有完成。也许我会将一个数组拆分为子数组,并分别存储每个子数组的答案。这在很大程度上取决于您正在执行的计算。

Good question and the answer depends on what you need to do with it after. Let me explain.

As soon as this problem can be solved in O(n) (where n is the input length) on CPU, you will suffer from memory allocation and copying (Host -> Device (input) and Device -> Host (result)) drawbacks. This will leads to performance degradation against simple CPU solution.

Even if your array already in device memory, each computation block need to read it to local or registers (at least access device memory), and it can't be done significantly faster than on CPU.

In general CUDA accelerate perfomance well if:

  1. Asymptotic complexity of computations is high comparing to input data length. For example input data length is n and complexity is O(n^2) or O(n^3).

  2. There is way to split task to independed or weak depended subtasks.

So if I was you, I would not try to do computations of such kind on CUDA if it's possible. And if it must be some standalone function or output format convertion for some other function I would do in CPU.

If it's part of some more complex algorithm the answer is more complicated. If I was on your place I would try to somehow change [0,3,5] format, because it adds limitations for utilization CUDA computation power. You can't effectively split your task on independed blocks. Just for example if I process 10 integers in one computation thread and next 10 integers in other. The second one don't know where to place his outputs until first one not finished. May be I will split an array on subarrays and store answer for each subarray separately. It's highly depends on what computations you are doing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文