入口函数使用太多共享数据（0x8020 字节 + 0x10 字节系统，最大 0x4000） - CUDA 错误

发布于 2024-12-29 11:30:09 字数 450 浏览 2 评论 0原文

我使用的是 Tesla C2050，它的计算能力为 2.0，共享内存为 48KB。但是，当我尝试使用此共享内存时，nvcc 编译器给出以下错误

Entry function '_Z4SAT3PhPdii' uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max)

SAT1 是扫描算法的幼稚实现，因为我正在操作大小为订单 4096x2160 我必须使用 double 来计算累积和。尽管 Tesla C2050 不支持 double，但它仍然通过将其降级为 float 来完成任务。但对于 4096 的图像宽度，共享内存大小会大于 16KB，但完全在 48KB 限制之内。

谁能帮助我了解这里发生的事情。我正在使用 CUDA 工具包 3.0。

原文

I am using Tesla C2050, which has a compute capability 2.0 and has 48KB shared memory . But when I try to use this shared memory the nvcc compiler gives me the following error

Entry function '_Z4SAT3PhPdii' uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max)

SAT1 is the naive implementation of a scan algorithm, and because I am operating on images sizes of the order 4096x2160 I have to use double to calculate the cumulative sum. Though Tesla C2050 does not support double, but it nevertheless does the task by demoting it to float. But for an image width of 4096 the shared memory size comes out to be greater 16KB but it is well within the 48KB limit.

Can anybody help me understand what is happening here. I am using CUDA Toolkit 3.0.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你在看孤独的风景 2025-01-05 11:30:09

默认情况下，Fermi 卡以兼容模式运行，每个多处理器具有 16kb 共享内存和 48kb L1 缓存。如果需要，API 调用 cudaThreadSetCacheConfig 可用于将 GPU 更改为使用 48kb 共享内存和 16kb L1 缓存运行。然后，您必须编译计算能力 2.0 的代码，以避免出现您所看到的代码生成错误。

此外，您的 Telsa C2050 支持双精度。如果您收到有关降级双精度的编译器警告，则意味着您没有为正确的体系结构编译代码。添加

--arch=sm_20

到您的 nvcc 参数中，GPU 工具链将为您的 Fermi 卡进行编译，并将包括双精度支持和其他 Fermi 特定硬件功能，包括更大的共享内存大小。

By default, Fermi cards run in a compatibility mode, with 16kb shared memory and 48kb L1 cache per multiprocessor. The API call cudaThreadSetCacheConfig can be used to change the GPU to run with 48kb shared memory and 16kb L1 cache, if you require it. You then must compile the code for compute capability 2.0 to avoid the code generation error you are seeing.

Also, your Telsa C2050 does support double precision. If you are getting compiler warnings about demoting doubles, it means you are not compiling your code for the correct architecture. Add

--arch=sm_20

to your nvcc arguments and the GPU toolchain will compile for your Fermi card, and will include double precision support and other Fermi specific hardware features, including larger shared memory size.

回复收藏 0 原文