使用 CUDA 创建链表
是否可以使用 CUDA 在 GPU 上创建链表?
我正在尝试这样做,但遇到了一些困难。
如果我无法在 CUDA 内核中分配动态内存,那么如何创建新节点并将其添加到链表中?
Is it possible to create a linked list on a GPU using CUDA?
I am trying to do this and I am encoutering some difficulties.
If I can't allocate dynamic memory in a CUDA kernel, then how can I create a new node and add it to the linked list?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果可以的话,你真的不想这样做 - 如果你无法摆脱链接列表,你能做的最好的事情就是通过数组模拟它们并使用数组索引而不是链接指针。
You really don't want to do this if you can help it - the best thing you can do if you can't get away from linked lists is to emulate them via arrays and use array indices rather than pointers for your links.
GPU 上的链表有一些有效的用例。考虑使用跳过列表作为替代方案,因为它们提供更快的操作。通过 Google 搜索可以找到高度并发的跳过列表算法的示例。
查看此链接 http://www.cse.iitk.ac。在/users/mainakc/lockfree.html/
对于 CUDA 代码,有关许多无锁 CUDA 数据结构的 PDF 和 PPT 演示。
可以使用缩减算法方法并行构建链接列表。这假设所有成员在构建时都是已知的。每个线程首先连接 2 个节点。然后一半线程将 2 个节点段连接在一起,依此类推,每次迭代将线程数减少 2 个。这将在 log2 N 时间内构建一个列表。
内存分配是一个限制。预先分配主机上阵列中的所有节点。然后你可以使用数组下标来代替指针。这样做的优点是列表遍历在 GPU 和主机上都有效。
对于并发性,您需要使用 CUDA 原子操作。原子添加/递增用于对节点数组中使用的节点进行计数,并进行比较和交换以设置节点之间的链接。
再次仔细考虑用例和访问模式。使用一个大的链表是非常串行的。使用 100 - 100 个小链表更加并行。我希望内存访问不会合并,除非注意在相邻内存位置分配连接的节点。
There are some valid use cases for linked lists on a GPU. Consider using a Skip List as an alternative as they provide faster operations. There are examples of highly concurrent Skip List algorithms available via Google searches.
Check out this link http://www.cse.iitk.ac.in/users/mainakc/lockfree.html/
for CUDA code a PDF and PPT presentation on a number of lock free CUDA data structures.
Link Lists can be constructed in parallel using a reduction algorithm approach. This assumes that ALL members are known at construction time. Each thread starts by connecting 2 nodes. Then half the threads connect the 2 node segments together and so on, reducing the number threads by 2 each iteration. This will build a list in log2 N time.
Memory allocation is a constraint. Pre-allocate all the nodes in an array on the host. Then you can use array subscripts in place of pointers. That has the advantage that the list traversal is valid on the GPU and the host.
For concurrency you need to use CUDA atomic operations. Atomic add/increment to count the nodes used from the node array and Compare and Swap to to set the links between nodes.
Again carefully consider the use case and access patterns. Using one large linked list is very serial. Using 100 - 100's of small Linked list is more parallel. I expect the memory access be uncoalesced unless care is taken to allocate connected nodes in adjacent memory locations.
我同意保罗的观点,链表是一种非常“串行”的思维方式。忘记您所学到的有关串行操作的知识,立即完成所有操作:)
I agree with Paul, linked lists are a very 'serial' way of thinking. Forget what you've learned about serial operations and just do everything at once : )
查看 Thrust 了解常见操作的方式
take a look at Thrust for the way of doing common operations