推入用户编写的内核
我是 Thrust 的新手。我看到所有 Thrust 演示文稿和示例仅显示主机代码。
我想知道我是否可以将 device_vector 传递给我自己的内核?如何? 如果是,那么内核/设备代码中允许对其进行哪些操作?
I am a newbie to Thrust. I see that all Thrust presentations and examples only show host code.
I would like to know if I can pass a device_vector to my own kernel? How?
If yes, what are the operations permitted on it inside kernel/device code?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
正如最初编写的那样,Thrust 纯粹是主机端抽象。它不能在内核内部使用。您可以将封装在
thrust::device_vector
中的设备内存传递给您自己的内核,如下所示:您还可以通过使用推力算法实例化推力::device_ptr来使用未由推力分配的设备内存裸cuda设备内存指针。
四年半后编辑补充说,根据 @JackOLantern 的回答,thrust 1.8 添加了顺序执行策略,这意味着您可以在设备上运行 Thrust 算法的单线程版本。请注意,仍然无法将推力设备向量直接传递到内核,并且设备向量不能直接在设备代码中使用。
请注意,在某些情况下,还可以使用 Thrust::device 执行策略来让内核作为子网格启动并行推力执行。这需要单独的编译/设备链接和支持动态并行性的硬件。我不确定所有推力算法是否实际上都支持这一点,但肯定适用于某些算法。
As it was originally written, Thrust is purely a host side abstraction. It cannot be used inside kernels. You can pass the device memory encapsulated inside a
thrust::device_vector
to your own kernel like this:and you can also use device memory not allocated by thrust within thrust algorithms by instantiating a thrust::device_ptr with the bare cuda device memory pointer.
Edited four and half years later to add that as per @JackOLantern's answer, thrust 1.8 adds a sequential execution policy which means you can run single threaded versions of thrust's alogrithms on the device. Note that it still isn't possible to directly pass a thrust device vector to a kernel and device vectors can't be directly used in device code.
Note that it is also possible to use the
thrust::device
execution policy in some cases to have parallel thrust execution launched by a kernel as a child grid. This requires separate compilation/device linkage and hardware which supports dynamic parallelism. I am not certain whether this is actually supported in all thrust algorithms or not, but certainly works with some.编辑:Thrust 中的动态并行性已弃用,推力 1.15.0。请参阅 在设备代码中使用
thrust::device
执行策略应该无法编译 的推理和替代方案。这是我之前答案的更新。
从 Thrust 1.8.1 开始,CUDA Thrust 原语可以与
thrust::device
执行策略结合使用,利用 CUDA 动态并行性在单个 CUDA 线程中并行运行。下面报道一个例子。上面的示例以与 使用 CUDA 减少矩阵行<相同的方式执行矩阵行的减少/a>,但它的完成方式与上面的帖子不同,即直接从用户编写的内核调用 CUDA Thrust 原语。此外,上面的示例还用于比较使用两种执行策略(即
thrust::seq
和thrust::device
)完成相同操作的性能。下面的一些图表显示了性能差异。性能已在 Kepler K20c 和 Maxwell GeForce GTX 850M 上进行了评估。
Edit: Dynamic parallelism in Thrust was deprecated with Thrust 1.15.0. See Using
thrust::device
execution policy in device code should fail to compile for the reasoning and alternatives.This is an update to my previous answer.
Starting from Thrust 1.8.1, CUDA Thrust primitives can be combined with the
thrust::device
execution policy to run in parallel within a single CUDA thread exploiting CUDA dynamic parallelism. Below, an example is reported.The above example performs reductions of the rows of a matrix in the same sense as Reduce matrix rows with CUDA, but it is done differently from the above post, namely, by calling CUDA Thrust primitives directly from user written kernels. Also, the above example serves to compare the performance of the same operations when done with two execution policies, namely,
thrust::seq
andthrust::device
. Below, some graphs showing the difference in performance.The performance has been evaluated on a Kepler K20c and on a Maxwell GeForce GTX 850M.
我想提供这个问题的更新答案。
从 Thrust 1.8 开始,CUDA Thrust 原语可以与
thrust::seq
执行策略结合使用,以便在单个 CUDA 线程中顺序运行(或在单个 CPU 线程中顺序运行)。下面报道一个例子。如果您希望在线程内并行执行,那么您可以考虑使用 CUB 它提供了可以减少的例程从线程块内调用,前提是您的卡启用动态并行性。
这是 Thrust 的示例
I would like to provide an updated answer to this question.
Starting from Thrust 1.8, CUDA Thrust primitives can be combined with the
thrust::seq
execution policy to run sequentially within a single CUDA thread (or sequentially within a single CPU thread). Below, an example is reported.If you want parallel execution within a thread, then you may consider using CUB which provides reduction routines that can be called from within a threadblock, provided that your card enables dynamic parallelism.
Here is the example with Thrust
如果您的意思是使用推力分配/处理的数据,是的,您可以,只需获取分配数据的原始指针即可。
如果你想在内核中分配推力矢量,我从未尝试过,但我认为不会起作用
而且,如果它有效,我认为它不会带来任何好处。
If you mean to use the data allocated / processed by thrust yes you can, just get the raw pointer of the allocated data.
if you want to allocate thrust vectors in the kernel I never tried but I don't think will work
and also if it works I don't think it will provide any benefit.
如今,Thrust 作为 CCCL(CUDA C++ 核心库)的一部分,其中还包括 libcu++ 及其非-拥有
cuda::std::span
。遗憾的是,Thrust 矢量和 libcu++ span 之间的接口仍然需要那个丑陋的thrust::raw_pointer_cast
。但是,将带有指针和大小的 cuda::std::span 从 Thrust 向量传递到自定义内核或设备函子是我们可能最接近传递实际向量的方法,因为 span 具有所有优点我们习惯的成员函数如.begin()
、.end()
、.size()
。等等,并且cuda::
版本也将它们标记为__host__ __device__
,因此它们可以在主机和设备代码中使用。与通过thrust::device_ptr
处理thrust::device_vector
不同,主机代码不处理对设备内存的访问。我希望 CCCL 能够更新为允许使用推力向量直接初始化
cuda::std::span
,即cuda::std::span。 my_span{my_vec};
。虽然span
已经有一个接受范围的构造函数(即传递一个std::vector
应该可以工作),但它仍然与 Thrust 的包装器(如thrust::device_ptr)作斗争。
。Nowadays Thrust comes as a part of the CCCL (CUDA C++ Core Libraries) that also includes libcu++ with its non-owning
cuda::std::span
. Sadly, interfacing between a Thrust vector and a libcu++ span still needs that uglythrust::raw_pointer_cast
. But passing acuda::std::span
with pointer and size from a Thrust vector to a custom kernel or device functor is the closest we will probably get to passing the actual vector because span has all the nice member functions we are used to like.begin()
,.end()
,.size()
. etc. and thecuda::
version has marked them as__host__ __device__
as well so they can be used both in host and in device code. Access to device memory from host code is not handled unlike withthrust::device_vector
throughthrust::device_ptr
.I hope that CCCL will be updated to allow directly initializing a
cuda::std::span
with a Thrust vector, i.e.cuda::std::span<float> my_span{my_vec};
. Whilespan
already has a constructor taking a range (i.e. passing e.g. astd::vector
should work), it still struggles with Thrust's wrappers likethrust::device_ptr
.