如何以程序员愉快的方式使用 CUDA 常量内存？

发布于 2024-09-29 13:30:17 字数 684 浏览 3 评论 0原文

我正在使用 CUDA 框架开发一个数字处理应用程序。我有一些应该可供所有线程访问的静态数据，因此我将其放入常量内存中，如下所示：

__device__ __constant__ CaseParams deviceCaseParams;

我使用调用 cudaMemcpyToSymbol 将这些参数从主机传输到设备：

void copyMetaData(CaseParams* caseParams)
{
    cudaMemcpyToSymbol("deviceCaseParams", caseParams, sizeof(CaseParams));
}

这有效。

无论如何，似乎（通过反复试验，也通过阅读网上的帖子），出于某种病态的原因，deviceCaseParams 的声明及其复制操作（对 cudaMemcpyToSymbol 的调用）必须位于同一个文件中。目前我将这两个放在 .cu 文件中，但我真的希望将参数结构放在 .cuh 文件中，以便任何实现都可以在需要时看到它。这意味着我还必须在头文件中包含 copyMetaData 函数，但这会弄乱链接（符号已定义），因为 .cpp 和 .cu 文件都包含此头文件（因此 MS C++ 编译器和 nvcc 都会编译它））。

有人对这里的设计有什么建议吗？

更新：查看评论

原文

I'm working on a number crunching app using the CUDA framework. I have some static data that should be accessible to all threads, so I've put it in constant memory like this:

__device__ __constant__ CaseParams deviceCaseParams;

I use the call cudaMemcpyToSymbol to transfer these params from the host to the device:

void copyMetaData(CaseParams* caseParams)
{
    cudaMemcpyToSymbol("deviceCaseParams", caseParams, sizeof(CaseParams));
}

which works.

Anyways, it seems (by trial and error, and also from reading posts on the net) that for some sick reason, the declaration of deviceCaseParams and the copy operation of it (the call to cudaMemcpyToSymbol) must be in the same file. At the moment I have these two in a .cu file, but I really want to have the parameter struct in a .cuh file so that any implementation could see it if it wants to. That means that I also have to have the copyMetaData function in the a header file, but this messes up linking (symbol already defined) since both .cpp and .cu files include this header (and thus both the MS C++ compiler and nvcc compiles it).

Does anyone have any advice on design here?

Update: See the comments

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心舞飞扬 2024-10-06 13:30:17

使用最新的 CUDA（例如 3.2），如果您在运行时查找符号（即通过将字符串作为第一个参数传递给 ，您应该能够从不同的翻译单元内执行 memcpy cudaMemcpyToSymbol 就像您在示例中一样）。

此外，对于 Fermi 类设备，您可以直接 malloc 内存 (cudaMalloc)，复制到设备内存，然后将参数作为 const 指针传递。编译器将识别您是否跨 Warp 统一访问数据，如果是，则将使用常量缓存。有关详细信息，请参阅 CUDA 编程指南。注意：您需要使用 -arch=sm_20 进行编译。

回复收藏 0 原文

巴黎盛开的樱花 2024-10-06 13:30:17

如果您使用的是 Fermi 之前的 CUDA，那么您现在会发现这个问题不仅仅适用于常量内存，它还适用于您想要的 CUDA 方面的任何内容。我发现的解决此问题的唯一两种方法是：

将所有 CUDA 写入单个文件 (.cu)，或者
如果您需要将代码分解为单独的文件，请将自己限制为单个 .cu 文件包含的标头。

如果您需要在 CUDA 和 C/C++ 之间共享代码，或者在项目之间共享一些通用代码，选项 2 是唯一的选择。一开始看起来很不自然，但它解决了问题。您仍然可以构建代码，只是不是以典型的 C 方式。主要开销是每次构建时都会编译所有内容。这样做的好处（我认为这可能是它以这种方式工作的原因）是 CUDA 编译器可以一次性访问所有源代码，这有利于优化。

回复收藏 0 原文

~没有更多了~