在 CUDA 中混合自定义内存管理和 Thrust
在我的项目中,我实现了一个自定义内存分配器,以避免在应用程序“预热”后不必要地调用 cudaMalloc
。此外,我使用自定义内核进行基本数组填充、数组之间的算术运算等,并希望通过使用 Thrust 并摆脱这些内核来简化我的代码。设备上的每个数组都是通过原始指针创建和访问的(目前),我想在这些对象上使用 device_vector
和 Thrust
方法,但我发现自己一直在原始指针和 device_ptr<> 之间进行转换,这让我的代码有些混乱。
我相当模糊的问题:您将如何/如何以最易读的方式组织自定义内存管理、Thrust 数组方法和对自定义内核的调用的使用?
In my project, I have implemented a custom memory allocator to avoid unneccessary calls to cudaMalloc
once the application has "warmed up". Moreover, I use custom kernels for basic array filling, arithmetic operations between arrays, etc. and would like to simplify my code by using Thrust
and getting rid of these kernels. Every array on the device is created and accessed through raw pointers (for now) and I'd like to use device_vector
and Thrust
s methods on these objects, but I find myself converting between raw pointers and device_ptr<>
all the time, somewhat cluttering up my code.
My rather vague question: How would/do you organize the usage of custom memory management, Thrust
s array methods and calls to custom kernels in the most readable way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
与所有标准 C++ 容器一样,您可以通过提供自己的 "分配器"。默认情况下,
thrust::device_vector
的分配器是thrust::device_malloc_allocator
,它使用cudaMalloc
(cudaFree
)当 Thrust 的后端系统是 CUDA 时。有时,需要自定义 device_vector 分配内存的方式,例如在 OP 的情况下,他们希望在程序初始化时执行的单个大型分配中对存储进行子分配。这可以避免对底层分配方案(在本例中为 cudaMalloc)的多次单独调用可能产生的开销。
提供
device_vector
自定义分配器的一个简单方法是继承device_malloc_allocator
。原则上,我们可以从头开始编写整个分配器,但使用继承方法,只需要提供allocate
和deallocate
成员函数。定义自定义分配器后,可以将其作为第二个模板参数提供给device_vector
。此示例代码演示了如何提供一个自定义分配器,该分配器在分配和释放时打印一条消息:
这是输出:
在本示例中,请注意,我们在
时听到了
。当my_allocator::allocate()
的消息。 vec.resize(10,13)vec
因销毁其元素而超出范围时,my_allocator::deallocate()
会被调用一次。Like all standard c++ containers, you can customize how
thrust::device_vector
allocates storage by providing it with your own "allocator". By default,thrust::device_vector
's allocator isthrust::device_malloc_allocator
, which allocates (deallocates) storage withcudaMalloc
(cudaFree
) when Thrust's backend system is CUDA.Occasionally, it is desirable to customize the way
device_vector
allocates memory, such as in the OP's case, who would like to sub-allocate storage within a single large allocation performed at program initialization. This can avoid overhead which may be incurred by many individual calls to the underlying allocation scheme, in this case,cudaMalloc
.A simple way to provide
device_vector
a custom allocator is to inherit fromdevice_malloc_allocator
. One could in principle author an entire allocator from scratch, but with an inheritance approach, only theallocate
anddeallocate
member functions need to be provided. Once the custom allocator is defined, it can be provided todevice_vector
as its second template parameter.This example code demonstrates how to provide a custom allocator which prints a message upon allocation and deallocation:
Here's the output:
In this example, note that we hear from
my_allocator::allocate()
once uponvec.resize(10,13)
.my_allocator::deallocate()
is invoked once whenvec
goes out of scope as it destroys its elements.