你会如何在 CUDA 中实现这个功能? (排序整数向量中的偏移量)
我的设备上有一个排序的整数数组,例如:
[0,0,0,1,1,2,2]
我想要另一个数组中每个元素的偏移量:(
[0,3,5]
因为第一个 0 位于位置 0,第一个 1 位于位置 3,依此类推) 我事先知道会有多少种不同的元素。您将如何在 CUDA 中有效地实现这一点?我不是要代码,而是要实现计算此转换的算法的高级描述。我已经查看了推力名称空间中的各种函数,但无法想到推力函数的任何组合来实现此目的。另外,这种转变有一个被广泛接受的名称吗?
I have a sorted integer array on the device, e.g.:
[0,0,0,1,1,2,2]
And I want the offsets to each element in another array:
[0,3,5]
(since the first 0 is at position 0, the first 1 at position 3 and so on)
I know how many different elements there will be beforehand. How would you implement this efficiently in CUDA? I'm not asking for code, but a high level description of the algorithm you would implement to compute this transformation. I already hat a look at the various functions in the thrust name space, but could not think of any combination of thrust functions to achieve this. Also, does this transformation have a widely accepted name?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可以在 Thrust 中使用
thrust::unique_by_key_copy
和thrust::counting_iterator
来解决这个问题。这个想法是将整数数组视为unique_by_key_copy
的keys
参数,并使用升序整数序列(即counting_iterator
)作为值
。unique_by_key_copy
会将值数组压缩到每个唯一key
的索引中:这是输出:
You can solve this in Thrust using
thrust::unique_by_key_copy
withthrust::counting_iterator
. The idea is to treat your integer array as thekeys
argument tounique_by_key_copy
and to use a sequence of ascending integers (i.e.,counting_iterator
) as thevalues
.unique_by_key_copy
will compact the values array into the indices of each uniquekey
:Here's the output:
虽然我从未使用过推力库,但这种可能的方法怎么样(简单但可能有效):
在您的示例中,输出将是:
但如果输入数组是:
那么输出将是:
现在,如果推力可以做到这一点你,remove_if(offset[i] == -1)并压缩数组。
这种方法会浪费偏移数组的大量内存,但由于您不知道要找到多少偏移量,最坏的情况将使用与输入数组一样多的内存。
另一方面,与全局内存负载相比,每个线程的指令很少,这将限制内存带宽的实现。对于这种情况有一些优化,因为每个线程处理一些值。
我的2分钱!
Although I've never used thrust library, what about this possible approach (simple but maybe effective):
In your example the output will be:
But if the input array is:
Then the output will be:
Now, if thrust can do it for you, remove_if( offset[i] == -1 ) and compact the array.
This approach will waste lot of memory for the offset array, but as you dont know how many offset you are going to find, the worst case will use as much memory as the input array.
On the other hand, the few instruction per thread compared to the global memory load will limit this implementation by memory bandwidth. There are some optimization for this case as process some values per thread.
My 2 cents!
Scan 是您正在寻找的算法。如果您没有现成的实现,Thrust 库将是一个很好的资源。 (查找推力::扫描)
扫描(或“并行前缀和”)采用输入数组并生成输出,其中每个元素是该点的输入之和: [1 5 3 7] => [1 6 9 16]
如果您扫描谓词(0 或 1,具体取决于评估条件),其中谓词检查给定元素是否与前面的元素相同,然后计算相关元素的输出索引。您的示例数组
[0 0 0 1 1 2 2]
[0 0 0 1 0 1 0] <= 谓词
[0 0 0 1 1 2 2] <= 扫描谓词
现在您可以使用扫描谓词作为索引来写入输出。
Scan is the algorithm you're looking for. If you don't have an implementation lying around, the Thrust library would be a good resource. (Look for thrust::scan)
Scan (or "parallel prefix sum") takes an input array and generates an output where each element is the sum of the inputs to that point: [1 5 3 7] => [1 6 9 16]
If you scan predicates (0 or 1 depending on an evaluated condition) where the predicate checks whether a given element the same as the preceding element, then you compute the output index of the element in question. Your example array
[0 0 0 1 1 2 2]
[0 0 0 1 0 1 0] <= predicates
[0 0 0 1 1 2 2] <= scanned predicates
Now you can use the scanned predicates as indices to write your output.
好问题,答案取决于您之后需要做什么。让我解释一下。
一旦这个问题可以在 CPU 上以 O(n)(其中 n 是输入长度)解决,您将遭受内存分配和复制(主机 -> 设备(输入)和设备 -> 主机(结果)) )缺点。这将导致简单 CPU 解决方案的性能下降。
它不能比在 CPU 上更快地完成。
即使你的数组已经在设备内存中,每个计算块都需要将其读取到本地或寄存器(至少访问设备内存),并且 如果满足以下条件,CUDA 可以很好地加速性能:
与输入数据长度相比,计算的渐近复杂性较高。例如,输入数据长度为 n,复杂度为 O(n^2) 或 O(n^3)。
有办法将任务拆分为独立或弱依赖的子任务。
因此,如果我是你,如果可能的话,我不会尝试在 CUDA 上进行此类计算。如果它必须是一些独立的函数或其他函数的输出格式转换,我会在 CPU 中进行。
如果它是某些更复杂算法的一部分,那么答案会更复杂。如果我在你的位置,我会尝试以某种方式更改
[0,3,5]
格式,因为它增加了利用 CUDA 计算能力的限制。您无法有效地将任务拆分为独立的块。举例来说,如果我在一个计算线程中处理 10 个整数,在另一个计算线程中处理接下来的 10 个整数。第二个不知道在哪里放置他的输出,直到第一个没有完成。也许我会将一个数组拆分为子数组,并分别存储每个子数组的答案。这在很大程度上取决于您正在执行的计算。Good question and the answer depends on what you need to do with it after. Let me explain.
As soon as this problem can be solved in O(n) (where n is the input length) on CPU, you will suffer from memory allocation and copying (Host -> Device (input) and Device -> Host (result)) drawbacks. This will leads to performance degradation against simple CPU solution.
Even if your array already in device memory, each computation block need to read it to local or registers (at least access device memory), and it can't be done significantly faster than on CPU.
In general CUDA accelerate perfomance well if:
Asymptotic complexity of computations is high comparing to input data length. For example input data length is n and complexity is O(n^2) or O(n^3).
There is way to split task to independed or weak depended subtasks.
So if I was you, I would not try to do computations of such kind on CUDA if it's possible. And if it must be some standalone function or output format convertion for some other function I would do in CPU.
If it's part of some more complex algorithm the answer is more complicated. If I was on your place I would try to somehow change
[0,3,5]
format, because it adds limitations for utilization CUDA computation power. You can't effectively split your task on independed blocks. Just for example if I process 10 integers in one computation thread and next 10 integers in other. The second one don't know where to place his outputs until first one not finished. May be I will split an array on subarrays and store answer for each subarray separately. It's highly depends on what computations you are doing.