NVIDIA 与 AMD:GPGPU 性能
我想听听具有这两种编码经验的人的意见。就我个人而言,我只拥有 NVIDIA 的经验。
NVIDIA CUDA 似乎比竞争对手更受欢迎。 (仅计算该论坛上的问题标签,“cuda”优于“opencl”3:1,“nvidia”优于“ati”15:1,并且根本没有“ati-stream”标签)。
另一方面,根据维基百科,ATI/AMD 卡应该具有更大的潜力,尤其是性价比。目前市场上最快的 NVIDIA 卡 GeForce 580(500 美元)的额定单精度 TFlops 为 1.6。 AMD Radeon 6970 的售价为 370 美元,额定速度为 2.7 TFlops。 580 有 512 个执行单元,频率为 772 MHz。 6970 有 1536 个执行单元,频率为 880 MHz。
AMD 相对 NVIDIA 的纸面优势有多现实?它是否有可能在大多数 GPGPU 任务中实现?整数任务会发生什么?
I'd like to hear from people with experience of coding for both. Myself, I only have experience with NVIDIA.
NVIDIA CUDA seems to be a lot more popular than the competition. (Just counting question tags on this forum, 'cuda' outperforms 'opencl' 3:1, and 'nvidia' outperforms 'ati' 15:1, and there's no tag for 'ati-stream' at all).
On the other hand, according to Wikipedia, ATI/AMD cards should have a lot more potential, especially per dollar. The fastest NVIDIA card on the market as of today, GeForce 580 ($500), is rated at 1.6 single-precision TFlops. AMD Radeon 6970 can be had for $370 and it is rated at 2.7 TFlops. The 580 has 512 execution units at 772 MHz. The 6970 has 1536 execution units at 880 MHz.
How realistic is that paper advantage of AMD over NVIDIA, and is it likely to be realized in most GPGPU tasks? What happens with integer tasks?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
打个比方来说,与 nvidia 相比,ati 拥有更好的引擎。
但是nvidia有更好的汽车:D
这主要是因为nvidia投入了大量的资源(金钱和人力)来开发科学计算(BLAS,FFT)所需的重要库,然后在推广方面再次做得很好。这可能是 CUDA 与 ati(或 OpenCL)相比在这里占据主导地位的原因
至于在一般 GPGPU 任务中实现的优势,它最终将取决于其他问题(取决于应用程序),例如内存传输带宽,一个好的编译器,甚至可能还有驱动程序。 nvidia 拥有更成熟的编译器、Linux 上更稳定的驱动程序(Linux 因为它在科学计算中广泛使用),使天平向 CUDA 倾斜(至少目前如此)。
编辑 2013 年 1 月 12 日
自从我发表这篇文章以来已经两年了,它有时似乎仍然能吸引人们的注意。因此,我决定澄清
简而言之,OpenCL 在过去两年缩小了差距。该领域有新的参与者。但 CUDA 仍然领先一些。
Metaphorically speaking ati has a good engine compared to nvidia.
But nvidia has a better car :D
This is mostly because nvidia has invested good amount of its resources (in money and people) to develop important libraries required for scientific computing (BLAS, FFT), and then a good job again in promoting it. This may be the reason CUDA dominates the tags over here compared to ati (or OpenCL)
As for the advantage being realized in GPGPU tasks in general, it would end up depending on other issues (depending on the application) such as, memory transfer bandwidth, a good compiler and probably even the driver. nvidia having a more mature compiler, a more stable driver on linux (linux because, its use is widespread in scientific computing), tilt the balance in favor of CUDA (at least for now).
EDIT Jan 12, 2013
It's been two years since I made this post and it still seems to attract views sometimes. So I have decided to clarify a few things
In short OpenCL has closed the gap in the past two years. There are new players in the field. But CUDA is still a bit ahead of the pack.
我对 CUDA 和 OpenCL 没有什么强烈的感觉;想必 OpenCL 是一个长期的未来,因为它是一个开放标准。
但目前 NVIDIA 与 ATI 卡的 GPGPU(不是图形性能,而是 GPGPU),我确实对此有强烈的看法。为此,我将指出当前大型集群 500 强列表,NVIDIA 领先 AMD 4 系统至 1,并且在 gpgpu.org 上,搜索结果(论文、在线资源链接、等等)NVIDIA 的结果比 AMD 的结果多 6:1。
这种差异的很大一部分在于可用的在线信息量。查看 NVIDIA CUDA Zone 与 AMD 的 GPGPU 开发人员中心。对于刚起步的开发人员来说,那里的东西数量根本无法与之相比。在 NVIDIA 网站上,您会找到大量论文以及贡献的代码,这些论文来自可能致力于解决像您这样的问题的人。您会发现大量来自 NVIDIA 和其他地方的在线课程,以及非常有用的文档,例如开发人员的最佳实践指南等。免费开发工具(分析器、cuda-gdb 等)的可用性极大地倾斜了 NVIDIA 的方式。
(编者:本段信息不再准确。)而且有些区别还在于硬件。 AMD 显卡在峰值浮点运算方面具有更好的规格,但为了能够获得其中的很大一部分,您不仅必须将问题分解到许多完全独立的流处理器上,每个工作项还需要进行矢量化。鉴于 GPGPU 代码足够困难,额外的架构复杂性足以决定某些项目的成败。
所有这一切的结果是 NVIDIA 用户社区不断增长。在我认识的三四个正在考虑构建 GPU 集群的团体中,没有一个人认真考虑过 AMD 卡。这将意味着 NVIDIA 方面将有更多的团体撰写论文、贡献代码等。
我不是 NVIDIA 的骗子;我希望事情不是这样的,并且有两个(或更多!)同样引人注目的 GPGPU 平台。竞争是好的。也许 AMD 很快就会加大力度——即将推出的融合产品看起来非常引人注目。但是,在向某人提供有关今天购买哪些卡以及现在应该在哪些方面投入时间的建议时,我不能凭良心说这两种开发环境同样好。
编辑添加:我想上面的内容在回答原始问题方面有点省略,所以让我说得更明确一点。在无限可用时间的理想世界中,您可以从硬件获得的性能仅取决于底层硬件和编程语言的功能;但实际上,在固定的投入时间内获得的性能也很大程度上取决于开发工具、现有的社区代码库(例如,公开可用的库等)。这些考虑因素都强烈指向 NVIDIA。
(编者:本段信息不再准确。)在硬件方面,AMD 卡中 SIMD 单元内矢量化的要求也使得实现纸面性能比 NVIDIA 硬件更加困难。
I don't have any strong feelings about CUDA vs. OpenCL; presumably OpenCL is the long-term future, just by dint of being an open standard.
But current-day NVIDIA vs ATI cards for GPGPU (not graphics performance, but GPGPU), that I do have a strong opinion about. And to lead into that, I'll point out that on the current Top 500 list of big clusters, NVIDIA leads AMD 4 systems to 1, and on gpgpu.org, search results (papers, links to online resources, etc) for NVIDIA outnumber results for AMD 6:1.
A huge part of this difference is the amount of online information available. Check out the NVIDIA CUDA Zone versus AMD's GPGPU Developer Central. The amount of stuff there for developers starting up doesn't even come close to comparing. On NVIDIAs site you'll find tonnes of papers - and contributed code - from people probably working on problems like yours. You'll find tonnes of online classes, from NVIDIA and elsewhere, and very useful documents like the developers' best practice guide, etc. The availability of free devel tools - the profiler, the cuda-gdb, etc - overwhelmingly tilts NVIDIAs way.
(Editor: the information in this paragraph is no longer accurate.) And some of the difference is also hardware. AMDs cards have better specs in terms of peak flops, but to be able to get a significant fraction of that, you have to not only break your problem up onto many completely independent stream processors, each work item also needs to be vectorized. Given that GPGPUing ones code is hard enough, that extra architectural complexity is enough to make or break some projects.
And the result of all of this is that the NVIDIA user community continues to grow. Of the three or four groups I know thinking of building GPU clusters, none of them are seriously considering AMD cards. And that will mean still more groups writing papers, contributing code, etc on the NVIDIA side.
I'm not an NVIDIA shill; I wish it weren't this way, and that there were two (or more!) equally compelling GPGPU platforms. Competition is good. Maybe AMD will step up its game very soon - and the upcoming fusion products look very compelling. But in giving someone advice about which cards to buy today, and where to spend their time putting effort in right now, I can't in good conscience say that both development environments are equally good.
Edited to add: I guess the above is a little elliptical in terms of answering the original question, so let me make it a bit more explicit. The performance you can get from a piece of hardware is, in an ideal world with infinite time available, dependent only on the underlying hardware and the capabilities of the programming language; but in reality, the amount of performance you can get in a fixed amount of time invested is also strongly dependant on devel tools, existing community code bases (eg, publicly available libraries, etc). Those considerations all point strongly to NVIDIA.
(Editor: the information in this paragraph is no longer accurate.) In terms of hardware, the requirement for vectorization within SIMD units in the AMD cards also make achieving paper performance even harder than with NVIDIA hardware.
AMD 和 NVIDIA 架构之间的主要区别在于,AMD 针对可以在编译时确定算法行为的问题进行优化,而 NVIDIA 针对只能在运行时确定算法行为的问题进行优化。
AMD 拥有相对简单的架构,允许他们在 ALU 上花费更多的晶体管。只要问题可以在编译时完全定义,并以某种静态或线性方式成功映射到架构,AMD 很有可能比 NVIDIA 更快地运行算法。
另一方面,NVIDIA 的编译器在编译时进行的分析较少。相反,NVIDIA 拥有更先进的架构,他们在逻辑上花费了更多晶体管,能够处理仅在运行时出现的算法的动态行为。
我相信大多数使用 GPU 的超级计算机都采用 NVIDIA 的事实是,科学家们感兴趣的运行计算的问题类型通常更适合 NVIDIA 的架构,而不是 AMD 的架构。
The main difference between AMD's and NVIDIA's architectures is that AMD is optimized for problems where the behavior of the algorithm can be determined at compile-time while NVIDIA is optimized for problems where the behavior of the algorithm can only be determined at run-time.
AMD has a relatively simple architecture that allows them to spend more transistors on ALU's. As long as the problem can be fully defined at compile-time and be successfully mapped to the architecture in a somewhat static or linear way, there is a good chance that AMD will be able to run the algorithm faster than NVIDIA.
On the other hand, NVIDIA's compiler is doing less analysis at compile time. Instead, NVIDIA has a more advanced architecture where they have spent more transistors on logic that is able to handle dynamic behavior of the algorithm that only emerges at run-time.
I believe the fact that most supercomputers that use GPUs go with NVIDIA is that the type of problem that scientists are interested in running calculations on, in general map better to NVIDIA's architecture than AMD's.
在使用 Fermi 和 Kepler 的 CUDA 几年后,我在 GCN 卡上花了一些时间使用 OpenCL,但我仍然更喜欢 CUDA 作为编程语言,并且如果可以选择的话,我会选择带有 CUDA 的 AMD 硬件。
NVIDIA 和 AMD (OpenCL) 的主要区别:
对于 AMD:
即使使用 Maxwell,NVidia 仍然具有较长的命令延迟,并且在对两者进行简单优化后,AMD 上的复杂算法可能会快 10 倍(假设理论 Tflop 相同)。 Kepler VS GCN 的差距高达 60%。从这个意义上说,为 NVidia 优化复杂内核更加困难。
便宜的卡。
OpenCL 是与其他供应商合作的开放标准。
对于 Nvidia:
拥有适用于可靠的高服务器负载的 Tesla 系列硬件。
新 Maxwell 的能效更高。
编译器和工具更加先进。 AMD 仍然无法实现 maxregcout 参数,因此您可以轻松控制各种硬件上的占用,并且他们的编译器有很多关于什么是最佳代码的随机想法,这些想法随每个版本而变化,因此您可能需要每半年重新访问一次旧代码,因为它突然变慢了 40%。
此时,如果 GPGPU 是您的目标,CUDA 是唯一的选择,因为 AMD 的 opencL 尚未准备好用于服务器场,并且由于编译器似乎始终处于“测试版”,为 AMD 编写高效代码要困难得多。
Having spent some time with OpenCL for GCN cards after a few years of CUDA for Fermi and Kepler, I still prefer CUDA as a programming language and would choose AMD hardware with CUDA if I had an option.
Main differences of NVIDIA and AMD (OpenCL):
For AMD:
Even with Maxwell, NVidia still has longer command latencies and complex algorithms are likely to be 10 faster on AMD(assuming same theoretical Tflops) after easy optimizations for both. The gap was up to 60% for Kepler VS GCN. It's harder to optimize complex kernels for NVidia in this sense.
Cheap cards.
OpenCL is open standard with other vendors available.
For Nvidia:
Has the Tesla line of hardware that's suitable for reliable high server loads.
New Maxwell is way more power efficient.
Compiler and tools are way more advanced. AMD still can't get to implement
maxregcout
parameter, so you can easily control occupancy on various hardware and their compiler has a lot of random ideas of what is an optimal code that change with every version, so you may need to revisit old code every half a year because it suddenly became 40% slower.At this point if GPGPU is your goal, CUDA is the only choice, since opencL with AMD is not ready for server farm and it's significantly harder to write efficient code for AMD due to the fact that the compiler always seems to be "in beta".
我在 OpenCL 中完成了一些迭代编码。在 NVIDIA 和 ATI 上运行的结果几乎是一样的。
相同价值($)卡的速度接近相同。
在这两种情况下,速度都是 CPU 的 10 倍到 30 倍。
我没有测试 CUDA,但我怀疑它能否神奇地解决我的随机内存获取问题。如今,CUDA 和 OpenCL 或多或少是相同的,而且我认为 OpenCL 比 CUDA 更有未来。主要原因是英特尔正在为其处理器推出采用 OpenCL 的驱动程序。这在未来将是一个巨大的进步(在 CPU 中运行 OpenCL 的 16、32 或 64 线程非常快,并且非常容易移植到 GPU)。
I've done some iterative coding in OpenCL. And the results of running it in NVIDIA and ATI, are pretty much the same.
Near the same speed in the same value ($) cards.
In both cases, speeds were ~10x-30x comparing to a CPU.
I didn't test CUDA, but I doubt it could solve my random memory fetch problems magically. Nowadays, CUDA and OpenCL are more or less the same, and I see more future on OpenCL than on CUDA. The main reason is that Intel is launching drivers with OpenCL for their processors. This will be a huge advance in the future (running 16, 32 or 64 threads of OpenCL in CPU is REALLY fast, and really easy to port to GPU).
我是 GPGPU 新手,但我在科学计算方面有一些经验(物理学博士学位)。我正在组建一个研究团队,并且希望使用 GPGPU 进行计算。我必须在可用的平台之间进行选择。我决定选择 Nvidia,有几个原因:虽然 ATI 在纸面上可能更快,但 Nvidia 拥有更成熟的平台和更多文档,因此有可能在该平台上更接近峰值性能。
Nvidia还有一个学术研究支持计划,可以申请支持,我刚刚收到一张TESLA 2075卡,我对此感到非常高兴。我不知道ATI或Intel是否支持这种方式的研究。
我听说 OpenCL 试图同时实现所有功能,确实您的 OpenCL 代码将更加可移植,但它也可能无法充分利用任一平台的全部功能。我宁愿多学一点并编写更好地利用资源的程序。今年刚刚推出的 TESLA K10 Nvidia 处于 4.5 TeraFlops 范围内,因此尚不清楚 Nvidia 是否落后......但是英特尔 MIC 可能会被证明是一个真正的竞争对手,特别是如果他们成功地将 GPGPU 单元转移到主板。但现在,我选择了 Nvidia。
I am new to GPGPU but I have some experience in scientific computing (PhD in Physics). I am putting together a research team and I want to go towards using GPGPU for my calculations. I had to choose between the available platforms. I decided on Nvidia, for a couple of reasons: while ATI might be faster on paper, Nvidia has a more mature platform and more documentation so it will be possible to get closer to the peak performance on this platform.
Nvidia also has an academic research support program, one can apply for support, I just received a TESLA 2075 card which I am very happy about. I don't know if ATI or Intel supports research this way.
What I heard about OpenCL is that it's trying to be everything at once, it is true that your OpenCL code will be more portable but it's also likely to not exploit the full capabilities of either platform. I'd rather learn a bit more and write programs that utilize the resources better. With the TESLA K10 that just came out this year Nvidia is in the 4.5 TeraFlops range so it is not clear that Nvidia is behind ... however Intel MICs could prove to be a real competitor, especially if they succeed in moving the GPGPU unit to the motherboard. But for now, I chose Nvidia.
我评估 OpenCL 浮点性能的经验倾向于 NVIDIA 卡。我在 NVIDIA 卡上进行了几个浮点基准测试,范围从 8600M GT 到 GTX 460。NVIDIA 卡在这些基准测试中始终达到理论单精度峰值的一半左右。
我使用过的 ATI 卡很少能达到单精度峰值的三分之一以上。
请注意,我对 ATI 的体验是有偏差的;我只能使用一张 5000 系列卡。我的经验主要是使用 HD 4000 系列卡,这些卡从未得到很好的支持。对 HD 5000 系列卡的支持要好得多。
My experience in evaluating OpenCL floating point performance tends to favor NVIDIA cards. I've worked with a couple of floating point benchmarks on NVIDIA cards ranging from the 8600M GT to the GTX 460. NVIDIA cards consistently achieve about half of theoretical single-precisino peak on these benchmarks.
The ATI cards I have worked with rarely achieve better than one third of single-precision peak.
Note that my experience with ATI is skewed; I've only been able to work with one 5000 series card. My experience is mostly with HD 4000 series cards, which were never well supported. Support for the HD 5000 series cards is much better.
我想补充一下辩论。对于我们从事软件业务的人来说,我们可以牺牲原始单精度性能来提高生产力,但即便如此,我也不必妥协,因为正如已经指出的那样,您无法在使用 OpenCL 的 ATI 硬件上获得尽可能多的性能如果你在 NVIDIA 的硬件上用 CUDA 编写。
是的,随着 PGI 宣布推出适用于 CUDA 的 x86 编译器,就没有任何充分的理由花更多时间和资源在 OpenCL 中编写代码了:)
PS:我的论点可能有偏见,因为我们几乎所有 GPGPU 工作都是在 CUDA 上完成的。我们有一个图像处理/计算机视觉库 CUVI(用于视觉和成像的 CUDA),它可以加速 CUDA 上的一些核心 IP/CV 功能。
I would like to add to the debate. For us in the business of software, we can compromise raw single-precision performance to productivity but even that I do not have to compromise since, as already pointed out, you cannot achieve as much performance on on ATI's hardware using OpenCL as you can achieve if you write in CUDA on NVIDIA's hardware.
And yes, with PGI's announcement of x86 compiler for CUDA, there won't be any good reason to spend more time and resources writing in OpenCL :)
P.S: My argument might be biased since we do almost all our GPGPU work on CUDA. We have an Image Processing/Computer Vision library CUVI (CUDA for Vision and Imaging) which accelerates some core IP/CV functionality on CUDA.
到目前为止,Cuda 肯定比 OpenCL 更受欢迎,因为它比 OpenCL 早三四年发布。自从 OpenCL 发布以来,Nvidia 并没有为该语言做出太多贡献,因为他们主要关注 CUDA。他们甚至还没有发布任何驱动程序的 openCL 1.2 版本。
就异构计算以及手持设备而言,OpenCl 肯定会在不久的将来获得更多的普及。截至目前,OpenCL 的最大贡献者是 AMD,这可以在他们的网站上看到。
Cuda is certainly popular than OpenCL as of today, as it was released 3 or 4 years before OpenCL. Since OpenCL been has released, Nvidia has not contributed much for the language as they concentrate much on CUDA. They have not even released openCL 1.2 version for any driver.
As far as heterogenous computing as well as hand held devices as concerned OpenCl will surely gain more popularity in near future. As of now biggest contributor to OpenCL is AMD, It's visible on their site.
根据我的经验:
如果您想要最佳的绝对性能,那么您需要查看谁使用最新的硬件迭代,并使用他们的堆栈(包括最新/测试版)。
如果您想要物有所值,您将瞄准游戏玩家卡而不是“专业”卡,并且针对不同平台的灵活性有利于 opencl。
如果您特别是刚开始,cuda 往往会更加完善,并且拥有更多工具和库。
最后,在来自 nvidia 的令人震惊的“支持”之后(我们得到了一辆死掉的特斯拉,并且在客户等待时几个月都没有改变),我个人的看法是:跳槽到 opencl 的灵活性值得冒性能稍低的风险当 nvidia 在发布周期中处于领先地位时。
in my experience:
if you want best absolute performance then you need to see who is on the latest hardware iteration, and use their stack (including latest / beta releases).
if you want the best performance for the money you will be aiming at gamer cards rather than "professional" cards and the flexibility of targetting different platforms favors opencl.
if you are starting out, in particular, cuda tends to be more polished and have more tools and libraries.
finally, my personal take, after appalling "support" from nvidia (we got a dead tesla and it wasn't changed for months, while a client was waiting): the flexibility to jump ship with opencl is worth the risk of slightly lower performance when nvidia are ahead in the release cycle.