GPGPU:仍然处于前沿?
GPGPU 是否已准备好用于生产和原型设计,或者您仍然认为它主要是一种研究/前沿技术?我在计算生物学领域工作,它开始吸引该领域更多以计算机科学为导向的人们的注意力,但大部分工作似乎都是移植众所周知的算法。算法的移植本身就是一个研究项目,绝大多数业内人士对此了解不多。
我在传统多核上做了一些计算量相当大的项目。我想知道 GPGPU 距离新算法原型设计和日常生产使用还有多远。通过阅读维基百科,我得到的印象是编程模型很奇怪(大量 SIMD)并且有些限制(没有递归或虚拟函数,尽管这些限制正在慢慢被消除;没有比 C 或 C++ 的有限子集更高级别的语言),并且存在多个相互竞争、不兼容的标准。我还得到的印象是,与常规多核不同,细粒度并行性是唯一的游戏。基本的库函数需要重写。与传统的多核不同,仅通过并行化程序的外循环并调用老式串行库函数无法获得巨大的加速。
这些限制在实践中有多严重? GPGPU 现在准备好投入使用了吗?如果没有,您猜需要多长时间?
编辑:我想要解决的一个要点是,编程模型与具有大量非常慢的核心的常规多核 CPU 有多大不同。
编辑#2:我想我总结得到的答案的方式是,GPGPU 对于利基市场的早期采用者来说足够实用,它非常适合,但仍然足够前沿,不能被视为“标准”像多核或分布式并行这样的工具,即使在那些性能很重要的领域也是如此。
Is GPGPU ready for production and prototyping use, or would you still consider it mostly a research/bleeding edge technology? I work in the computational biology field and it's starting to attract attention from the more computer science oriented people in the field, but most of the work seems to be porting well-known algorithms. The porting of the algorithm is itself the research project and the vast majority of people in the field don't know much about it.
I do some pretty computationally intensive projects on conventional multicores. I'm wondering how close GPGPU is to being usable enough for prototyping new algorithms, and for everyday production use. From reading Wikipedia, I get the impression that the programming model is strange (heavily SIMD) and somewhat limited (no recursion or virtual functions, though these limitations are slowly being removed; no languages higher level than C or a limited subset of C++), and that there are several competing, incompatible standards. I also get the impression that, unlike regular multicore, fine-grained parallelism is the only game in town. Basic library functions would need to be rewritten. Unlike with conventional multicore, you can't get huge speedups just by parallelizing the outer loop of your program and calling old-school serial library functions.
How severe are these limitations in practice? Is GPGPU ready for serious use now? If not, how long would you guess it will take?
Edit: One major point I'm trying to wrap my head around is, how much different is the programming model from a regular multicore CPU with lots and lots of really slow cores.
Edit # 2: I guess the way I'd summarize the answers I've been given is that GPGPU is practical enough for early adopters in niches that it's extremely well suited for, but still bleeding edge enough not to be considered a "standard" tool like multicore or distributed parallelism, even in those niches where performance is important.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
毫无疑问,人们可以使用 GPU 进行有用的生产计算。
大多数情况下,这里表现良好的计算都是那些非常接近令人尴尬的并行性的计算。 CUDA 和 OpenCL 都可以让您以一种稍微痛苦的方式表达这些计算。因此,如果你能以这种方式进行计算,你就能做得很好。
我认为这一限制永远不会被认真取消;如果他们能做到这一点,那么通用 CPU 也能做到。至少我不会屏住呼吸。
您应该能够通过查看现有代码来判断您当前的应用程序是否适合。与大多数并行编程语言一样,在编写完整的应用程序之前,您不会知道自己的真实性能。不幸的是,经验是无可替代的。
There isn't any question that people can do useful, production, computations with GPUs.
Mostly the computations that do well here are those that have pretty close to embarrasing parallelism. Both CUDA and OpenCL will let you express these computations in an only moderately painful way. So if you can cast your computation that way, you can do well.
I don't think this restriction will ever be seriously removed; if they could do that, then general CPUs could do it, too. At least I wouldn't hold my breath.
You should be able to tell if your present application is suitable mostly by looking at your existing code. Like most parallel programming languages, you won't know your real performance until you've coded a complete application. Unfortunately there's no substitute for experience.
我是一名计算机科学研究生,曾接触过 GPGPU。我还知道至少有一个组织目前正在将其部分软件移植到 CUDA。这样做是否值得实际上取决于性能对您来说有多重要。
我认为使用CUDA会给你的项目增加很多开支。首先,GPU 领域非常分散。即使在 NVIDIA 卡中,也有相当广泛的功能集,并且某些在一个 GPU 上运行的代码可能无法在另一个 GPU 上运行。其次,CUDA 以及视频卡的功能集正在快速变化。无论你今年写的是什么,都不太可能在 2-3 年内重写,才能充分利用新显卡的优势。最后,正如您所指出的,编写 GPGPU 程序非常困难,以至于并行化 GPGPU 的现有算法通常是一个可发布的研究项目。
您可能想要研究现有的 CUDA 库,例如 CUBLAS,您可能可以将其用于您的项目,并且可以帮助您避免这些问题。
I am a graduate student in CS who has worked a bit with GPGPU. I also know of at least one organization that is currently porting parts of their software to CUDA. Whether doing so is worth it really depends on how important performance is to you.
I think that using CUDA will add a lot of expense to your project. First, the field of GPUs is very fractured. Even among NVIDIA cards you have a pretty wide array of feature sets and some code that works on one GPU might not work on another. Second, the feature set of CUDA, as well as of the video cards, is changing very quickly. It is not unlikely that whatever you write this year will have to be rewritten in 2-3 years to take full advantage of the new graphics cards. Finally, as you point out, writing GPGPU programs is just very difficult, so much so that parallelizing an existing algorithm for GPGPU is typically a publishable research project.
You might want to look into CUDA libraries that are already out there, for example CUBLAS, that you might be able to use for your project and that could help insulate you from these issues.
CUDA 现在已用于金融服务的生产代码中,并且一直在增加。
现在它不仅“准备好认真使用”,而且您实际上已经错过了机会。
CUDA is in use in production code in financial services now, and increasing all the time.
Not only is it "ready for serious use" now, you've practically missed the boat.
有点间接的答案,但我在药理学中的非线性混合效应建模领域工作。我听二手资料说CUDA已经尝试过。目前使用的算法种类繁多,并且不断出现新算法,因此有些算法看起来比其他算法对 SIMD 模型更友好,尤其是基于马尔可夫链蒙特卡罗的算法。这就是我怀疑金融应用程序的所在。
在 Fortran 中,已建立的建模算法是如此大的代码块,而最内层的循环是如此复杂的目标函数,即使可以找到 SIMD 加速的机会,也很难看出如何完成转换。可以并行化外循环,这就是我们所做的。
Kind of an indirect answer, but I work in the area of nonlinear mixed-effect modeling in pharmacometrics. I've heard second-hand information that CUDA has been tried. There's such a variety of algorithms in use, and new ones coming all the time, that some look more friendly to a SIMD model than others, particularly the ones based on Markov-Chain Monte Carlo. That is where I suspect the financial applications are.
The established modeling algorithms are such large chunks of code, in Fortran, and the innermost loops are such complicated objective functions, that it's hard to see how the translation could be done even if opportunities for SIMD speedup could be found. It is possible to parallelize outer loops, which is what we do.
与许多成功移植到 GPU 的金融算法相比,计算生物学算法在结构上往往不太规则。这意味着它们需要在算法层面进行一些重新设计,以便从 GPU 中的大量并行性中受益。您希望拥有密集且方形的数据结构,并围绕大型“for”循环和很少的“if”语句构建代码。
这需要一些思考,但这是可能的,我们开始通过与 Ateji PX 并行的蛋白质折叠代码获得有趣的性能。
Computational biology algorithms tend to be less regular in structure than many of the financial algorithms successfully ported to GPUs. This means that they require some redesign at the algorithmic level in order to benefit from the huge amount of parallelism found in GPUs. You want to have dense and square data structures, and architect your code around large "for" loops with few "if" statements.
This requires some thinking but this is possible and we're beginning to get interesting performance with a protein folding code parallelized with Ateji PX.