如何评估 CUDA GPU 的相对性能?

发布于 2024-12-05 16:07:31 字数 586 浏览 1 评论 0原文

我怎样才能估计我不拥有的卡的cuda性能,即。新卡?

例如,我发现了一个不完整的 Cuda 示例,作者写道,他在 GF 8600 GT 上花费了 0.7 秒。但在我的 Quadro 上需要 1.7 秒。

我的问题是:我用来填补空白的代码是否有问题,或者 GF 8600 的速度真的是两倍吗?

内核受内存限制,但我的卡具有更高的内存带宽。我不知道从中可以得出什么结论。

Name               Quadro FX 580     GeForce 8600 GT 
CUDA Cores                    32                  32
Core clock (MHz)             450                 540   
Memory clock (MHz)           400                 700
Memory BW (GB/s)              25.6                22.4  
Shader Clock (MHz)          ????                1180  

How can I estimate the cuda performance of cards that I don't own, ie. new cards?

For instance I found an incomplete Cuda example and the author wrote, that it takes him 0,7 s on his GF 8600 GT. But on my Quadro it takes 1,7s.

My question is: Is the code which I used to fill the gaps faulty or is the GF 8600 really twice as fast?

The kernel is memory bound, but my card has an higher memory bandwidth. I don't know what conclusions to draw from this.

Name               Quadro FX 580     GeForce 8600 GT 
CUDA Cores                    32                  32
Core clock (MHz)             450                 540   
Memory clock (MHz)           400                 700
Memory BW (GB/s)              25.6                22.4  
Shader Clock (MHz)          ????                1180  

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

一萌ing 2024-12-12 16:07:31

只是想为您提供一些可能是错误来源的指示。首先,使用 cudaEvents 对代码进行计时,而不是使用 cuda profiler,因为 cudaEvents 更准确。其次,请检查作者测量的是什么;他只是在谈论计算时间,还是也考虑了与 GPU 之间传输数据的时间。你们测量的是同一时间吗?

其次,cuda架构变化相当快。例如,对于cc 1.x的卡,建议我们使用共享内存以获得更好的性能;然而,对于具有 cc 2.x 的卡,每个多处理器都有一个 L1 缓存,这使得全局内存访问速度相当快。因此,您可能还想比较这两种卡的架构及其计算能力。

Just want to provide you with some pointers that may be possible sources of error. Firstly, use cudaEvents to time your code, not cuda profiler as cudaEvents is more accurate. Secondly, please check what the author is measuring; is he only talking about the computation time, or is he also considering the time to transfer data to and from the GPU. Are you measuring the same time?

Secondly, the cuda architecture is changing quite fast. For example, for cards with cc 1.x, it is suggested that we should use shared memory to get better performance; however, for cards with cc 2.x, there is a L1 cache with each multiprocessor that makes global memory accesses quite fast. So, you may aslo want to compare the architecture of the two cards and their compute capabilities.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文