用于 cuda 开发的 GTX 295 与其他 nvidia 卡对比

发布于 2024-09-24 14:52:12 字数 138 浏览 2 评论 0原文

适合 cuda 开发的最佳 nvidia 显卡是什么？单个 GTX 295 有 2 个 GPU，是否可以拥有 2 个 GTX 295 并在我的 cuda 代码中使用 4 个 GPU？
买两张 480 卡比买两张 295 更好吗？费米卡会比这两张卡更好吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

知足的幸福 2024-10-01 14:52:13

直接回答：我会选择一个或两个 GTX 480。但我认为我的推理与@bobince 或@pszilard 有点不同。

背景：我刚刚做出了与您面临的相同决定，但我们的情况可能有很大不同。

我是一名统计学研究生，所在系的 GPU 计算资源资金很少，校园确实有一个费米盒连接到我可以访问的两个节点。但这些是在 Linux 中——我喜欢 Linux——但我真的想使用 nSight 来基准测试和调整我的代码，所以我需要 Windows——所以我决定购买一个双引导的开发盒，用于生产运行的 Ubuntu x64并使用 VS 2010 赢得 7（一场战斗我目前正在与) 和 nSight 1.5 进行开发。也就是说，回到我购买两个 GTX 480（EVGA 太棒了！！）而不是两个 GTX 285 或 295 的原因。

在过去的两年里，我开发了几个 CUDA 内核。对我来说，开发中最棘手的部分是内存管理。我花了三个月的大部分时间试图压缩 Cholesky 分解和回代到 16 个单精度寄存器——在 GTX 285 或 295 导致 50% 性能损失之前可以使用的最大值（从 17 个寄存器到 16 个寄存器实际上需要 3 周时间）。对我来说，所有 Fermi 架构都具有双倍寄存器的事实意味着这三个月将使我在 GTX 480 上获得大约 10% 的改进，而不是在 GTX 285 上获得 50% 的改进，因此，可能不值得我花时间——事实上比这更微妙一些，但你明白了。

如果你对 CUDA 相当陌生——你可能是因为你问这个问题——我会说 32 个寄存器是巨大的。其次，我认为 Fermi 架构的 L1 缓存可以直接转化为更快的全局内存访问——当然可以，但我还没有直接测量其影响。如果您不需要那么多的全局内存，您可以用更大的 L1 缓存来换取三倍的共享内存——随着矩阵大小的增加，这对我来说也是一个紧张的压力。

那么我会同意@pszilard 的观点，如果你需要双精度，费米绝对是最佳选择——尽管我仍然会首先以单精度编写你的代码，调整它，然后迁移到双精度。

我认为并发内核执行对您来说并不重要——这真的很酷，内核完成的延迟可以减少几个数量级——但您可能会首先关注一个内核，而不是并行内核。如果您想做流式或并行内核，那么您需要 Fermi - 285 / 295 根本无法做到这一点。

最后，使用 295 的缺点是您必须编写两层并行性：(1) 一层用于在卡之间分配块（或内核？），以及 (2) GPU 内核本身。如果您刚刚开始，将并行性保持在一个地方（在一张卡上）比同时进行两场战斗要容易得多。

诗。如果您还没有编写内核，您可能会考虑只获取一张卡并等待六个月，看看情况是否会再次发生变化 - 尽管我不知道下一张卡何时发布。

PP。我非常喜欢在 GTX 480 上运行我的 cuda 内核，该内核是我在 Tesla C1070 上调试/设计的，并立即实现了 2 倍的速度提升。钱花得值。

Direct answer: I would go with one or maybe two GTX 480's. But I think my reasoning is a bit different from @bobince or @pszilard.

Backgroud: I just made the same decision you're facing, but our situations may be vastly different.

I'm a statistics graduate student in a department with minimal funding for gpu computing resources, the campus does have one fermi box hooked up to two nodes that I have access to. But these were in linux -- which I love -- but I really want to use nSight to benchmark and tune my code, so I need windows -- so I decided to purchase a development box which I dual boot, Ubuntu x64 for production runs and Win 7 with VS 2010 (a battle which I'm presently fighting) and nSight 1.5 for development. That said, back to the reason why I bought two GTX 480's (EVGA is awesome!!) and not two GTX 285's or 295's.

I've spent the past two years developing a couple of CUDA kernels. The trickiest part of the development, for me, is the memory management. I spent the better part of three months trying to squeeze a Cholesky decomposition & back substitution into 16 single-precision registers -- the max you can use before either the GTX 285 or 295 incur a 50% performance penalty (literally 3 weeks going from 17 to 16 registers). For me, the fact that all Fermi architectures have double the registers means that those three months would've gained me about 10% improvement on a GTX 480 instead of 50% on GTX 285 and hence, probably not worth my time -- in truth a bit more subtle than that, but you get the drift.

If you're fairly new to CUDA -- which you probably are since you're asking -- I would say 32 registers is HUGE. Second, I think the L1 cache of the Fermi architecture can directly translate to faster global memory accesses -- of course it does, but I haven't measured the impact directly yet. If you don't need the global memory as much, you can trade the bigger L1 cache for triple the shared memory -- which was also a tight squeeze for me as the matrix sizes increased.

Then I would agree with @pszilard that if you need double precision, Fermi is definitely the way to go -- though I'd still write your code in single precision first, tune it, and then migrate to double.

I don't think that concurrent kernel execution will matter for you -- it's really cool, the delays to kernel completion can be orders of magnitude less -- but you're probably going to focus on one kernel first, not parallel kernels. If you want to do streaming or parallel kernels, then you need Fermi -- the 285 / 295's simply can't do it.

And lastly, the drawback of going with the 295's is that you have to write two layers of parallelism: (1) one to distribute blocks (or kernels?) across the cards and (2) the gpu kernel itself. If you're just starting out, it's much easier to keep the parallelism in one place (on a single card) as opposed to fighting two battles at once.

Ps. If you haven't written your kernels yet, you might consider getting only one card and waiting six months to see if the landscape changes again -- though I have no idea when the next cards are to be released.

PPs. I absolutely loved running my cuda kernel on the GTX 480 which I had debugged / designed on a Tesla C1070 and instantly realizing a 2x speed improvement. Money well spent.

回复收藏 0 原文