估计更改 NVIDIA GPU 模型时的速度提升

发布于 2024-11-19 16:40:08 字数 81 浏览 4 评论 0原文

我目前正在开发一个 CUDA 应用程序,它在 GPU 上的部署肯定比我的好得多。给定另一个 GPU 模型,我如何估计我的算法在其上运行的速度有多快?

I am currently developing a CUDA application that will most certainly be deployed on a GPU much better than mine. Given another GPU model, how can I estimate how much faster my algorithm will run on it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

倾听心声的旋律 2024-11-26 16:40:08

由于多种原因,您将会遇到困难:

  1. 时钟频率和内存速度与代码速度只有微弱的关系,因为在幕后还有更多的事情发生(例如,线程上下文切换) )几乎所有新硬件都得到了改进/改变。

  2. 缓存已添加到新硬件(例如 Fermi)中,除非您对缓存命中/未命中率进行建模,否则您将很难预测这将如何影响速度。

  3. 浮点性能通常非常依赖于型号(例如:Tesla C2050 的性能比“顶级”GTX-480 更好)。

  4. 每个设备的寄存器使用情况可能因设备而异,这也会影响性能;在许多情况下,占用率会受到影响。

  5. 可以通过针对特定硬件来提高性能,因此即使您的算法非常适合您的 GPU,如果您针对新硬件对其进行优化可能会更好。

也就是说,如果您通过其中一个分析器(例如 NVIDIA Compute Profiler)运行您的应用程序,并且查看您的占用率和 SM 利用率,您可能可以做出一些预测。如果您的 GPU 有 2 个 SM,而您最终运行的 GPU 有 16 个 SM,那么您几乎肯定会看到改进,但并不是特别因为这个。

因此,不幸的是,做出您想要的预测类型并不容易。如果您正在编写开源代码,您可以发布代码并要求其他人使用更新的硬件对其进行测试,但这并不总是一种选择。

You're going to have a difficult time, for a number of reasons:

  1. Clock rate and memory speed only have a weak relationship to code speed, because there is a lot more going on under the hood (e.g., thread context switching) that gets improved/changed for almost all new hardware.

  2. Caches have been added to new hardware (e.g., Fermi) and unless you model cache hit/miss rates, you'll have a tough time predicting how this will affect the speed.

  3. Floating point performance in general is very dependent on model (e.g.: Tesla C2050 has better performance than the "top of the line" GTX-480).

  4. Register usage per device can change for different devices, and this can also affect performance; occupancy will be affected in many cases.

  5. Performance can be improved by targeting specific hardware, so even if your algorithm is perfect for your GPU, it could be better if you optimize it for the new hardware.

Now, that said, you can probably make some predictions if you run your app through one of the profilers (such as the NVIDIA Compute Profiler), and you look at your occupancy and your SM utilization. If your GPU has 2 SMs and the one you will eventually run on has 16 SMs, then you will almost certainly see an improvement, but not specifically because of that.

So, unfortunately, it isn't easy to make the type of predictions you want. If you're writing something open source, you could post the code and ask others to test it with newer hardware, but that isn't always an option.

荒人说梦 2024-11-26 16:40:08

对于某些硬件更改来说,这可能很难预测,而对于其他硬件更改来说,这可能微不足道。 突出显示您正在考虑的两张卡之间的差异

例如,这种变化可能是微不足道的——如果我购买了其中一款 EVGA 水冷庞然大物,它的性能会比标准 GTX 580 好多少?这只是计算限制时钟速度(内存或 GPU 时钟)差异的练习。当我想知道是否应该超频我的卡时,我也遇到了这个问题。

如果您要使用类似的架构,GTX 580 到 Tesla C2070,您可以在时钟速度上做出类似的差异,但必须小心单/双精度问题。

如果您正在做一些更激烈的事情,例如从移动卡 - GTX 240M - 到顶级卡 - Tesla C2070 - 那么您可能根本无法获得任何性能提升。< /strong>

注意:Chris 的回答非常正确,但我想强调这一点,因为我设想了这种常见的工作路径:

有人对老板说:

  1. 所以我听说过 CUDA 的事情......我认为它可以使函数 X 变得更多更有效率。
  2. 老板说你可以有 0.05% 的工作时间来测试 CUDA——嘿,我们已经有了这张移动卡,使用它
  3. 一年后……CUDA 可以让我们的速度提高三倍。我可以买一张更好的卡来测试一下吗? (GTX 580 只需 400 美元——比实习生的惨败还便宜……)
  4. 你花掉美元,购买了卡,你的 CUDA 代码运行更慢
  5. 你的老板现在很生气。您浪费了时间和金钱

那么发生了什么?在旧卡上进行开发,例如 8800、9800,甚至是具有 30 个核心的移动 GTX 2XX,会导致您以一种与有效利用具有 512 个核心的卡非常不同的方式优化和设计算法。 买者自负一分钱一分货 - 那些很棒的卡非常棒 - 但您的代码可能不会运行得更快

发出警告,离开消息是什么?当您获得更好的卡时,请务必投入时间进行调整、测试,甚至可能从头开始重新设计您的算法

好的,那么,经验法则? GPU 每六个月速度就会提高一倍。因此,如果您要从一张使用了两年的卡换成一张顶级卡,请向您的老板宣称它的运行速度将提高 4 到 8 倍(如果您获得了 16 倍的提升,太棒了! !)

This can be very hard to predict for certain hardware changes and trivial for others. Highlight the differences between the two cards you're considering.

For example, the change could be as trivial as -- if I had purchased one of those EVGA water-cooled behemoths, how much better would it perform over a standard GTX 580? This is just an exercise in computing the differences in the limiting clock speed (memory or gpu clock). I've also encountered this question when wondering if I should overclock my card.

If you're going to a similar architecture, GTX 580 to Tesla C2070, you can make a similar case of differences in clock speeds, but you have to be careful of the single/double precision issue.

If you're doing something much more drastic, say going from a mobile card -- GTX 240M -- to a top of the line card -- Tesla C2070 -- then you may not get any performance improvement at all.

Note: Chris is very correct in his answer, but I wanted to stress this caution because I envision this common work path:

One says to the boss:

  1. So I've heard about this CUDA thing... I think it could make function X much more efficient.
  2. Boss says you can have 0.05% of work time to test out CUDA -- hey we already have this mobile card, use that.
  3. One year later... So CUDA could get us a three fold speedup. Could I buy a better card to test it out? (A GTX 580 only costs $400 -- less than that intern fiasco...)
  4. You spend the $$, buy the card, and your CUDA code runs slower.
  5. Your boss is now upset. You've wasted time and money.

So what happened? Developing on an old card, think 8800, 9800, or even the mobile GTX 2XX with like 30 cores, leads one to optimize and design your algorithm in a very different way from how you would to efficiently utilize a card with 512 cores. Caveat Emptor You get what you pay for -- those awesome cards are awesome -- but your code may not run faster.

Warning issued, what's the walk away message? When you get that nicer card, be sure to invest time in tuning, testing, and possibly redesigning your algorithm from the ground up.

OK, so that said, rule of thumb? GPUs get twice as fast every six months. So if you're moving from a card that's two years old to a card that's top of the line, claim to your boss that it will run between 4 to 8 times faster (and if you get the full 16-fold improvement, bravo!!)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文