在 CUDA 中混合自定义内存管理和 Thrust

发布于 2024-12-28 23:27:41 字数 343 浏览 2 评论 0原文

在我的项目中,我实现了一个自定义内存分配器,以避免在应用程序“预热”后不必要地调用 cudaMalloc。此外,我使用自定义内核进行基本数组填充、数组之间的算术运算等,并希望通过使用 Thrust 并摆脱这些内核来简化我的代码。设备上的每个数组都是通过原始指针创建和访问的(目前),我想在这些对象上使用 device_vectorThrust 方法,但我发现自己一直在原始指针和 device_ptr<> 之间进行转换,这让我的代码有些混乱。

我相当模糊的问题:您将如何/如何以最易读的方式组织自定义内存管理、Thrust 数组方法和对自定义内核的调用的使用?

In my project, I have implemented a custom memory allocator to avoid unneccessary calls to cudaMalloc once the application has "warmed up". Moreover, I use custom kernels for basic array filling, arithmetic operations between arrays, etc. and would like to simplify my code by using Thrust and getting rid of these kernels. Every array on the device is created and accessed through raw pointers (for now) and I'd like to use device_vector and Thrusts methods on these objects, but I find myself converting between raw pointers and device_ptr<> all the time, somewhat cluttering up my code.

My rather vague question: How would/do you organize the usage of custom memory management, Thrusts array methods and calls to custom kernels in the most readable way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

淡忘如思 2025-01-04 23:27:41

与所有标准 C++ 容器一样,您可以通过提供自己的 "分配器"。默认情况下,thrust::device_vector 的分配器是 thrust::device_malloc_allocator,它使用 cudaMalloc (cudaFree)当 Thrust 的后端系统是 CUDA 时。

有时,需要自定义 device_vector 分配内存的方式,例如在 OP 的情况下,他们希望在程序初始化时执行的单个大型分配中对存储进行子分配。这可以避免对底层分配方案(在本例中为 cudaMalloc)的多次单独调用可能产生的开销。

提供device_vector自定义分配器的一个简单方法是继承device_malloc_allocator。原则上,我们可以从头开始编写整个分配器,但使用继承方法,只需要提供 allocatedeallocate 成员函数。定义自定义分配器后,可以将其作为第二个模板参数提供给 device_vector

此示例代码演示了如何提供一个自定义分配器,该分配器在分配和释放时打印一条消息:

#include <thrust/device_malloc_allocator.h>
#include <thrust/device_vector.h>
#include <iostream>

template<typename T>
  struct my_allocator : thrust::device_malloc_allocator<T>
{
  // shorthand for the name of the base class
  typedef thrust::device_malloc_allocator<T> super_t;

  // get access to some of the base class's typedefs

  // note that because we inherited from device_malloc_allocator,
  // pointer is actually thrust::device_ptr<T>
  typedef typename super_t::pointer   pointer;

  typedef typename super_t::size_type size_type;

  // customize allocate
  pointer allocate(size_type n)
  {
    std::cout << "my_allocator::allocate(): Hello, world!" << std::endl;

    // defer to the base class to allocate storage for n elements of type T
    // in practice, you'd do something more interesting here
    return super_t::allocate(n);
  }

  // customize deallocate
  void deallocate(pointer p, size_type n)
  {
    std::cout << "my_allocator::deallocate(): Hello, world!" << std::endl;

    // defer to the base class to deallocate n elements of type T at address p
    // in practice, you'd do something more interesting here
    super_t::deallocate(p,n);
  }
};

int main()
{
  // create a device_vector which uses my_allocator
  thrust::device_vector<int, my_allocator<int> > vec;

  // create 10 ints
  vec.resize(10, 13);

  return 0;
}

这是输出:

$ nvcc my_allocator_test.cu -arch=sm_20 -run
my_allocator::allocate(): Hello, world!
my_allocator::deallocate(): Hello, world!

在本示例中,请注意,我们在 时听到了 my_allocator::allocate() 的消息。 vec.resize(10,13)。当 vec 因销毁其元素而超出范围时,my_allocator::deallocate() 会被调用一次。

Like all standard c++ containers, you can customize how thrust::device_vector allocates storage by providing it with your own "allocator". By default, thrust::device_vector's allocator is thrust::device_malloc_allocator, which allocates (deallocates) storage with cudaMalloc (cudaFree) when Thrust's backend system is CUDA.

Occasionally, it is desirable to customize the way device_vector allocates memory, such as in the OP's case, who would like to sub-allocate storage within a single large allocation performed at program initialization. This can avoid overhead which may be incurred by many individual calls to the underlying allocation scheme, in this case, cudaMalloc.

A simple way to provide device_vector a custom allocator is to inherit from device_malloc_allocator. One could in principle author an entire allocator from scratch, but with an inheritance approach, only the allocate and deallocate member functions need to be provided. Once the custom allocator is defined, it can be provided to device_vector as its second template parameter.

This example code demonstrates how to provide a custom allocator which prints a message upon allocation and deallocation:

#include <thrust/device_malloc_allocator.h>
#include <thrust/device_vector.h>
#include <iostream>

template<typename T>
  struct my_allocator : thrust::device_malloc_allocator<T>
{
  // shorthand for the name of the base class
  typedef thrust::device_malloc_allocator<T> super_t;

  // get access to some of the base class's typedefs

  // note that because we inherited from device_malloc_allocator,
  // pointer is actually thrust::device_ptr<T>
  typedef typename super_t::pointer   pointer;

  typedef typename super_t::size_type size_type;

  // customize allocate
  pointer allocate(size_type n)
  {
    std::cout << "my_allocator::allocate(): Hello, world!" << std::endl;

    // defer to the base class to allocate storage for n elements of type T
    // in practice, you'd do something more interesting here
    return super_t::allocate(n);
  }

  // customize deallocate
  void deallocate(pointer p, size_type n)
  {
    std::cout << "my_allocator::deallocate(): Hello, world!" << std::endl;

    // defer to the base class to deallocate n elements of type T at address p
    // in practice, you'd do something more interesting here
    super_t::deallocate(p,n);
  }
};

int main()
{
  // create a device_vector which uses my_allocator
  thrust::device_vector<int, my_allocator<int> > vec;

  // create 10 ints
  vec.resize(10, 13);

  return 0;
}

Here's the output:

$ nvcc my_allocator_test.cu -arch=sm_20 -run
my_allocator::allocate(): Hello, world!
my_allocator::deallocate(): Hello, world!

In this example, note that we hear from my_allocator::allocate() once upon vec.resize(10,13). my_allocator::deallocate() is invoked once when vec goes out of scope as it destroys its elements.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文