如何以程序员愉快的方式使用 CUDA 常量内存?
我正在使用 CUDA 框架开发一个数字处理应用程序。我有一些应该可供所有线程访问的静态数据,因此我将其放入常量内存中,如下所示:
__device__ __constant__ CaseParams deviceCaseParams;
我使用调用 cudaMemcpyToSymbol 将这些参数从主机传输到设备:
void copyMetaData(CaseParams* caseParams)
{
cudaMemcpyToSymbol("deviceCaseParams", caseParams, sizeof(CaseParams));
}
这有效。
无论如何,似乎(通过反复试验,也通过阅读网上的帖子),出于某种病态的原因,deviceCaseParams 的声明及其复制操作(对 cudaMemcpyToSymbol 的调用)必须位于同一个文件中。目前我将这两个放在 .cu 文件中,但我真的希望将参数结构放在 .cuh 文件中,以便任何实现都可以在需要时看到它。这意味着我还必须在头文件中包含 copyMetaData 函数,但这会弄乱链接(符号已定义),因为 .cpp 和 .cu 文件都包含此头文件(因此 MS C++ 编译器和 nvcc 都会编译它) )。
有人对这里的设计有什么建议吗?
更新:查看评论
I'm working on a number crunching app using the CUDA framework. I have some static data that should be accessible to all threads, so I've put it in constant memory like this:
__device__ __constant__ CaseParams deviceCaseParams;
I use the call cudaMemcpyToSymbol to transfer these params from the host to the device:
void copyMetaData(CaseParams* caseParams)
{
cudaMemcpyToSymbol("deviceCaseParams", caseParams, sizeof(CaseParams));
}
which works.
Anyways, it seems (by trial and error, and also from reading posts on the net) that for some sick reason, the declaration of deviceCaseParams and the copy operation of it (the call to cudaMemcpyToSymbol) must be in the same file. At the moment I have these two in a .cu file, but I really want to have the parameter struct in a .cuh file so that any implementation could see it if it wants to. That means that I also have to have the copyMetaData function in the a header file, but this messes up linking (symbol already defined) since both .cpp and .cu files include this header (and thus both the MS C++ compiler and nvcc compiles it).
Does anyone have any advice on design here?
Update: See the comments
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用最新的 CUDA(例如 3.2),如果您在运行时查找符号(即通过将字符串作为第一个参数传递给
,您应该能够从不同的翻译单元内执行 memcpy cudaMemcpyToSymbol
就像您在示例中一样)。此外,对于 Fermi 类设备,您可以直接 malloc 内存 (cudaMalloc),复制到设备内存,然后将参数作为 const 指针传递。编译器将识别您是否跨 Warp 统一访问数据,如果是,则将使用常量缓存。有关详细信息,请参阅 CUDA 编程指南。注意:您需要使用
-arch=sm_20
进行编译。With an up-to-date CUDA (e.g. 3.2) you should be able to do the memcpy from within a different translation unit if you're looking up the symbol at runtime (i.e. by passing a string as the first arg to
cudaMemcpyToSymbol
as you are in your example).Also, with Fermi-class devices you can just malloc the memory (
cudaMalloc
), copy to the device memory, and then pass the argument as a const pointer. The compiler will recognise if you are accessing the data uniformly across the warps and if so will use the constant cache. See the CUDA Programming Guide for more info. Note: you would need to compile with-arch=sm_20
.如果您使用的是 Fermi 之前的 CUDA,那么您现在会发现这个问题不仅仅适用于常量内存,它还适用于您想要的 CUDA 方面的任何内容。我发现的解决此问题的唯一两种方法是:
如果您需要在 CUDA 和 C/C++ 之间共享代码,或者在项目之间共享一些通用代码,选项 2 是唯一的选择。一开始看起来很不自然,但它解决了问题。您仍然可以构建代码,只是不是以典型的 C 方式。主要开销是每次构建时都会编译所有内容。这样做的好处(我认为这可能是它以这种方式工作的原因)是 CUDA 编译器可以一次性访问所有源代码,这有利于优化。
If you're using pre-Fermi CUDA, you will have found out by now that this problem doesn't just apply to constant memory, it applies to anything you want on the CUDA side of things. The only two ways I have found around this are to either:
If you need to share code between CUDA and C/C++, or have some common code you share between projects, option 2 is the only choice. It seems very unnatural to start with, but it solves the problem. You still get to structure your code, just not in a typically C like way. The main overhead is that every time you do a build you compile everything. The plus side of this (which I think is possibly why it works this way) is that the CUDA compiler has access to all the source code in one hit which is good for optimisation.