用于 cuda 开发的 GTX 295 与其他 nvidia 卡对比
适合 cuda 开发的最佳 nvidia 显卡是什么?单个 GTX 295 有 2 个 GPU,是否可以拥有 2 个 GTX 295 并在我的 cuda 代码中使用 4 个 GPU?
买两张 480 卡比买两张 295 更好吗?费米卡会比这两张卡更好吗?
what is the best nvidia Video Card for cuda development. a single GTX 295 has 2 GPUs, is it possible to have 2 GTX 295 and use the 4 GPUs in my cuda code?
is it better to get two 480 cards rather than two 295? would a fermi be better than both cards?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
直接回答:我会选择一个或两个 GTX 480。但我认为我的推理与@bobince 或@pszilard 有点不同。
背景:我刚刚做出了与您面临的相同决定,但我们的情况可能有很大不同。
我是一名统计学研究生,所在系的 GPU 计算资源资金很少,校园确实有一个费米盒连接到我可以访问的两个节点。但这些是在 Linux 中——我喜欢 Linux——但我真的想使用 nSight 来基准测试和调整我的代码,所以我需要 Windows——所以我决定购买一个双引导的开发盒,用于生产运行的 Ubuntu x64并使用 VS 2010 赢得 7(一场战斗我目前正在与) 和 nSight 1.5 进行开发。也就是说,回到我购买两个 GTX 480(EVGA 太棒了!!)而不是两个 GTX 285 或 295 的原因。
在过去的两年里,我开发了几个 CUDA 内核。对我来说,开发中最棘手的部分是内存管理。我花了三个月的大部分时间试图压缩 Cholesky 分解和回代到 16 个单精度寄存器——在 GTX 285 或 295 导致 50% 性能损失之前可以使用的最大值(从 17 个寄存器到 16 个寄存器实际上需要 3 周时间)。对我来说,所有 Fermi 架构都具有双倍寄存器的事实意味着这三个月将使我在 GTX 480 上获得大约 10% 的改进,而不是在 GTX 285 上获得 50% 的改进,因此,可能不值得我花时间——事实上比这更微妙一些,但你明白了。
如果你对 CUDA 相当陌生——你可能是因为你问这个问题——我会说 32 个寄存器是巨大的。其次,我认为 Fermi 架构的 L1 缓存可以直接转化为更快的全局内存访问——当然可以,但我还没有直接测量其影响。如果您不需要那么多的全局内存,您可以用更大的 L1 缓存来换取三倍的共享内存——随着矩阵大小的增加,这对我来说也是一个紧张的压力。
那么我会同意@pszilard 的观点,如果你需要双精度,费米绝对是最佳选择——尽管我仍然会首先以单精度编写你的代码,调整它,然后迁移到双精度。
我认为并发内核执行对您来说并不重要——这真的很酷,内核完成的延迟可以减少几个数量级——但您可能会首先关注一个内核,而不是并行内核。如果您想做流式或并行内核,那么您需要 Fermi - 285 / 295 根本无法做到这一点。
最后,使用 295 的缺点是您必须编写两层并行性:(1) 一层用于在卡之间分配块(或内核?),以及 (2) GPU 内核本身。 如果您刚刚开始,将并行性保持在一个地方(在一张卡上)比同时进行两场战斗要容易得多。
诗。如果您还没有编写内核,您可能会考虑只获取一张卡并等待六个月,看看情况是否会再次发生变化 - 尽管我不知道下一张卡何时发布。
PP。我非常喜欢在 GTX 480 上运行我的 cuda 内核,该内核是我在 Tesla C1070 上调试/设计的,并立即实现了 2 倍的速度提升。钱花得值。
Direct answer: I would go with one or maybe two GTX 480's. But I think my reasoning is a bit different from @bobince or @pszilard.
Backgroud: I just made the same decision you're facing, but our situations may be vastly different.
I'm a statistics graduate student in a department with minimal funding for gpu computing resources, the campus does have one fermi box hooked up to two nodes that I have access to. But these were in linux -- which I love -- but I really want to use nSight to benchmark and tune my code, so I need windows -- so I decided to purchase a development box which I dual boot, Ubuntu x64 for production runs and Win 7 with VS 2010 (a battle which I'm presently fighting) and nSight 1.5 for development. That said, back to the reason why I bought two GTX 480's (EVGA is awesome!!) and not two GTX 285's or 295's.
I've spent the past two years developing a couple of CUDA kernels. The trickiest part of the development, for me, is the memory management. I spent the better part of three months trying to squeeze a Cholesky decomposition & back substitution into 16 single-precision registers -- the max you can use before either the GTX 285 or 295 incur a 50% performance penalty (literally 3 weeks going from 17 to 16 registers). For me, the fact that all Fermi architectures have double the registers means that those three months would've gained me about 10% improvement on a GTX 480 instead of 50% on GTX 285 and hence, probably not worth my time -- in truth a bit more subtle than that, but you get the drift.
If you're fairly new to CUDA -- which you probably are since you're asking -- I would say 32 registers is HUGE. Second, I think the L1 cache of the Fermi architecture can directly translate to faster global memory accesses -- of course it does, but I haven't measured the impact directly yet. If you don't need the global memory as much, you can trade the bigger L1 cache for triple the shared memory -- which was also a tight squeeze for me as the matrix sizes increased.
Then I would agree with @pszilard that if you need double precision, Fermi is definitely the way to go -- though I'd still write your code in single precision first, tune it, and then migrate to double.
I don't think that concurrent kernel execution will matter for you -- it's really cool, the delays to kernel completion can be orders of magnitude less -- but you're probably going to focus on one kernel first, not parallel kernels. If you want to do streaming or parallel kernels, then you need Fermi -- the 285 / 295's simply can't do it.
And lastly, the drawback of going with the 295's is that you have to write two layers of parallelism: (1) one to distribute blocks (or kernels?) across the cards and (2) the gpu kernel itself. If you're just starting out, it's much easier to keep the parallelism in one place (on a single card) as opposed to fighting two battles at once.
Ps. If you haven't written your kernels yet, you might consider getting only one card and waiting six months to see if the landscape changes again -- though I have no idea when the next cards are to be released.
PPs. I absolutely loved running my cuda kernel on the GTX 480 which I had debugged / designed on a Tesla C1070 and instantly realizing a 2x speed improvement. Money well spent.
是的。或者四轮车,如果你完全疯了。
有争议的。 295 作为双 GPU 具有稍微更多的原始魅力,但 480 作为没有双 GPU 开销的 40nm 工艺卡可能会更好地利用其资源。基准各不相同。当然,Fermi 4xx 系列具有更现代的功能支持(3D、DirectX、OpenCL 等)。
但双 295 将会有非常巨大的 PSU 和冷却需求。双 480 的运行温度几乎一样高。更不用说费用了。您正在做什么,您认为您需要这个?您是否考虑过更主流的部件,例如460,通常认为它比陷入困境的470-480(GF100)部件提供更好的性价比?
Yes. Or quad, if you're totally insane.
Arguable. 295 as a dual-gpu has slightly more raw oomph, but 480 as a 40nm-process card without the dual-gpu overhead may use its resources better. Benchmarks vary. Of course the Fermi 4xx range has more modern feature support (3D, DirectX, OpenCL etc).
But dual-295 is going to have seriously huge PSU and cooling requirements. And dual-480 runs almost as hot. Not to mention the expense. What are you working on that you think you're going to need this? Have you considered the more mainstream parts, eg 460, which is generally considered to offer a better price/performance than the troubled 470–480 (GF100) part?
无论什么都适合您的预算并适合您的需求。我知道这有点模糊,但毕竟它确实就是这么简单;)
当然,是的。唯一的缺点是 GTX 295 上的 2 个 GPU 共享一个 PCI。这是否与您相关取决于应用程序是否需要与主机进行密集通信。
从原始峰值性能的角度来看,GTX 295(几乎是 GTX 280 的 2 倍,不考虑共享 PCI)比 480 更好。然而,与 GT200 相比,GF10x 系列架构在很多方面都有所改进,详细信息请参见“费米白皮书” 和 “费米调整指南”。
如果您计划使用双精度,GF10x 系列对双精度支持有了很大改进,但很高兴知道 GeForce 卡上的双精度支持上限为单精度性能的 1/8(通常约为一半)
。 ,我建议除非您有充分的理由以即将过时的硬件形式获得大量 GFlops(Folding@Home?),否则如果您想节省约 25%,请购买 GTX 480 或 470。
Whatever fits in your budget and suits your needs. I know this is a bit vague, but after all it really is as simple as that ;)
Sure, it is. The only drawback is that the 2 GPUs on the GTX 295 share a single PCI. Whether this is relevant for you or not depends if the application needs intensive communication with the host or not.
From the point of view of raw peak performance a GTX 295 (which is almost 2x GTX 280, not considering the shared PCI) is better than a 480. However the GF10x series architecture improved on many points compared to the GT200, for details see the "Fermi whitepaper" and the "Fermi Tuning Guide".
If you're planning to use double precision, the GF10x series has much improved double precision support, but it's good to know that this is capped on GeForce cards to 1/8-th of the single precision performance (normally it's about half)
Therefor, I would suggest that unless you have a strong reason to get lots of GFlops (Folding@Home?) in the form of soon to be outdated hardware, get a GTX 480 or a 470 if you want to save ~25%.