用于特殊用途 3D 图形计算的 CUDA 或 FPGA?

发布于 2024-07-08 18:22:33 字数 1449 浏览 7 评论 0 原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(15

残龙傲雪 2024-07-15 18:22:33

我不久前调查了同样的问题。 在与从事 FPGA 工作的人交谈后,我得到以下结论:

  • FPGA 非常适合实时系统,即使 1ms 的延迟也可能太长。 这不适用于您的情况;
  • FPGA 可以非常快,特别是对于明确定义的数字信号处理用途(例如雷达数据),但好的 FPGA 甚至比专业 GPGPU 更昂贵且更专业;
  • FPGA 的编程相当麻烦。 由于需要编译硬件配置组件,因此可能需要几个小时。 与软件开发人员相比,它似乎更适合电子工程师(通常是从事 FPGA 工作的工程师)。

如果您能让 CUDA 为您服务,那么它可能是目前最好的选择。 它肯定比 FPGA 更灵活。

其他选择包括 ATI 的 Brook,但在大事发生之前,它根本没有像 CUDA 那样被广泛采用。 之后,仍然有所有传统的 HPC 选项(x86/PowerPC/Cell 集群),但它们都相当昂贵。

希望有帮助。

I investigated the same question a while back. After chatting to people who have worked on FPGAs, this is what I get:

  • FPGAs are great for realtime systems, where even 1ms of delay might be too long. This does not apply in your case;
  • FPGAs can be very fast, espeically for well-defined digital signal processing usages (e.g. radar data) but the good ones are much more expensive and specialised than even professional GPGPUs;
  • FPGAs are quite cumbersome to programme. Since there is a hardware configuration component to compiling, it could take hours. It seems to be more suited to electronic engineers (who are generally the ones who work on FPGAs) than software developers.

If you can make CUDA work for you, it's probably the best option at the moment. It will certainly be more flexible than a FPGA.

Other options include Brook from ATI, but until something big happens, it is simply not as well adopted as CUDA. After that, there's still all the traditional HPC options (clusters of x86/PowerPC/Cell), but they are all quite expensive.

Hope that helps.

初心未许 2024-07-15 18:22:33

我们对 FPGA 和 CUDA 进行了一些比较。 如果您能够真正以 SIMD 方式表述您的问题并且能够访问合并的内存,那么 CUDA 就会大放异彩。 如果内存访问未合并 (1) 或者不同线程中有不同的控制流,则 GPU 的性能可能会大幅下降,而 FPGA 的性能可能会优于它。 另一件事是,当你的业务规模很小,但你的业务量很大时。 但是你不能(例如由于同步)不在一个内核中循环启动它,那么你对 GPU 内核的调用时间就会超过计算时间。

此外,FPGA 的性能可能会更好(取决于您的应用场景,即 GPU 仅在始终进行计算时才更便宜(以瓦特/浮点计算))。

当然,FPGA 也有一些缺点: IO 可能是其中之一(我们这里有一个应用程序,我们需要 70 GB/s,对于 GPU 来说没问题,但要将如此大量的数据输入 FPGA,传统设计需要比可用的引脚更多的引脚)。 另一个缺点是时间和金钱。 FPGA 比最好的 GPU 贵得多,而且开发时间非常长。

(1) 不同线程同时访问内存必须是顺序地址。 这有时确实很难实现。

We did some comparison between FPGA and CUDA. One thing where CUDA shines if you can realy formulate your problem in a SIMD fashion AND can access the memory coalesced. If the memory accesses are not coalesced(1) or if you have different control flow in different threads the GPU can lose drastically its performance and the FPGA can outperform it. Another thing is when your operation is realtive small, but you have a huge amount of it. But you cant (e.g. due to synchronisation) no start it in a loop in one kernel, then your invocation times for the GPU kernel exceeds the computation time.

Also the power of the FPGA could be better (depends on your application scenarion, ie. the GPU is only cheaper (in terms of Watts/Flop) when its computing all the time).

Offcourse the FPGA has also some drawbacks: IO can be one (we had here an application were we needed 70 GB/s, no problem for GPU, but to get this amount of data into a FPGA you need for conventional design more pins than available). Another drawback is the time and money. A FPGA is much more expensive than the best GPU and the development times are very high.

(1) Simultanously accesses from different thread to memory have to be to sequential addresses. This is sometimes really hard to achieve.

万劫不复 2024-07-15 18:22:33

我会选择 CUDA。
我从事图像处理工作,多年来一直在尝试硬件附加组件。 首先我们有 i860,然后是 Transputer,然后是 DSP,然后是 FPGA 和直接编译到硬件。
不可避免地发生的事情是,当硬件板真正调试并可靠并且代码已移植到它们时 - 常规 CPU 已经先进以击败它们,或者主机架构发生了变化,我们无法使用旧板,或者董事会的制造者破产了。

通过坚持使用 CUDA 之类的东西,您就不再受制于一家小型 FPGA 板专业制造商。 GPU 的性能比 CPU 提升得更快,并且得到了游戏玩家的资助。 它是一种主流技术,因此将来可能会与多核 CPU 合并,从而保护您的投资。

I would go with CUDA.
I work in image processing and have been trying hardware add-ons for years. First we had i860, then Transputer, then DSP, then the FPGA and direct-compiliation-to-hardware.
What innevitably happened was that by the time the hardware boards were really debugged and reliable and the code had been ported to them - regular CPUs had advanced to beat them, or the hosting machine architecture changed and we couldn't use the old boards, or the makers of the board went bust.

By sticking to something like CUDA you aren't tied to one small specialist maker of FPGA boards. The performence of GPUs is improving faster then CPUs and is funded by the gamers. It's a mainstream technology and so will probably merge with multi-core CPUs in the future and so protect your investment.

天煞孤星 2024-07-15 18:22:33

FPGA

  • 您需要什么:
    • 学习 VHDL/Verilog(相信我,你不想学习)
    • 购买测试硬件、综合工具许可证
    • 如果您已经拥有基础架构并且只需开发核心
      • 开发设计(可能需要数年时间)
    • 如果您不这样做:
      • DMA、硬件驱动程序、超昂贵的综合工具
      • 大量有关总线、内存映射、硬件综合的知识
      • 构建硬件,购买 IP 核
      • 开发设计
      • 未提及董事会开发
  • 例如,带有 Xilinx ZynqUS+ 芯片的普通 FPGA PCIe 卡成本超过 3000 美元
  • FPGA 云的成本也高达 2 美元/小时以上
  • 结果:
    • 这至少需要运营公司的资源。

GPGPU (CUDA/OpenCL)

  • 您已经有了要测试的硬件。
  • 与 FPGA 相比:
    • 一切都有详细记录。
    • 一切都很便宜
    • 一切正常
    • 一切都很好地集成到编程语言
  • 还有 GPU 云。
  • 结果:
    • 您只需下载 sdk 即可开始。

FPGAs

  • What you need:
    • Learn VHDL/Verilog (and trust me you don't want to)
    • Buy hw for testing, licences for synthesis tools
    • If you already have infrastructure and you need to develop only your core
      • Develop design ( and it can take years )
    • If you don't:
      • DMA, hw driver, ultra expensive synthesis tools
      • tons of knowledge about buses, memory mapping, hw synthesis
      • build the hw, buy the ip cores
      • Develop design
      • Not mentioning of board developement
  • For example average FPGA pcie card with chip Xilinx ZynqUS+ costs more than 3000$
  • FPGA cloud is also costly 2$/h+
  • Result:
    • This is something which requires resources of running company at least.

GPGPU (CUDA/OpenCL)

  • You already have hw to test on.
  • Compare to FPGA stuff:
    • Everything is well documented .
    • Everything is cheap
    • Everything works
    • Everything is well integrated to programming languages
  • There is GPU cloud as well.
  • Result:
    • You need to just download sdk and you can start.
断舍离 2024-07-15 18:22:33

显然这是一个复杂的问题。 这个问题可能还包括细胞处理器。
对于其他相关问题,可能没有一个正确的答案。

根据我的经验,任何以抽象方式完成的实现,即编译的高级语言与机器级实现,都将不可避免地产生性能成本,特别是在复杂的算法实现中。 对于任何类型的 FPGA 和处理器都是如此。 专门为实现复杂算法而设计的 FPGA 比处理元件通用的 FPGA 性能更好,从而使其能够从输入控制寄存器、数据 I/O 等方面实现一定程度的可编程性。FPGA

性能更高的另一个常见示例是在级联流程中,一个流程的输出成为另一个流程的输入,并且它们不能同时完成。 FPGA 中的级联流程很简单,可以显着降低内存 I/O 要求,同时处理器内存将用于有效级联存在数据依赖性的两个或多个流程。

GPU 和 CPU 也是如此。 在不考虑高速缓冲存储器或主存储器系统的固有性能特征的情况下开发的、在CPU上执行的用C实现的算法将不会像用C实现的算法那样执行。 当然,不考虑这些性能特征可以简化实施。 但以性能为代价。

虽然没有直接使用 GPU 的经验,但知道其固有的内存系统性能问题,它也会受到性能问题的影响。

Obviously this is a complex question. The question might also include the cell processor.
And there is probably not a single answer which is correct for other related questions.

In my experience, any implementation done in abstract fashion, i.e. compiled high level language vs. machine level implementation, will inevitably have a performance cost, esp in a complex algorithm implementation. This is true of both FPGA's and processors of any type. An FPGA designed specifically to implement a complex algorithm will perform better than an FPGA whose processing elements are generic, allowing it a degree of programmability from input control registers, data i/o etc.

Another general example where an FPGA can be much higher performance is in cascaded processes where on process outputs become the inputs to another and they cannot be done concurrently. Cascading processes in an FPGA is simple, and can dramatically lower memory I/O requirements while processor memory will be used to effectively cascade two or more processes where there are data dependencies.

The same can be said of a GPU and CPU. Algorithms implemented in C executing on a CPU developed without regard to the inherent performance characteristics of the cache memory or main memory system will not perform as well as one implemented which does. Granted, not considering these performance characteristics simplifies implementation. But at a performance cost.

Having no direct experience with a GPU, but knowing its inherent memory system performance issues, it too will be subjected to performance issues.

丘比特射中我 2024-07-15 18:22:33

这是 2008 年开始的老话题,但最好回顾一下自那时以来 FPGA 编程发生的事情:
1. FPGA 中的 C 到门是许多公司的主流开发方式,与 Verilog/SystemVerilog HDL 相比,可以节省大量时间。 在C到gates中系统级设计是最难的部分。
2. FPGA 上的 OpenCL 已经存在 4 年多了,包括 Microsoft (Asure) 和 Amazon F1 (Ryft API) 的浮点和“云”部署。 使用 OpenCL,系统设计相对容易,因为主机和计算设备之间定义了非常明确的内存模型和 API。

软件人员只需要了解一些有关 FPGA 架构的知识,就能够完成 GPU 和 CPU 无法完成的任务,因为它们都是固定芯片,并且没有与外界连接的宽带 (100Gb+) 接口。 缩小芯片几何尺寸已不再可能,也不再可能在不熔化单芯片封装的情况下从单芯片封装中提取更多热量,因此这看起来像是单封装芯片的终结。 我的论点是,未来属于多芯片系统的并行编程,FPGA 很有可能在竞争中处于领先地位。 如果您对性能等有疑问,请查看 http://isfpga.org/

This is an old thread started in 2008, but it would be good to recount what happened to FPGA programming since then:
1. C to gates in FPGA is the mainstream development for many companies with HUGE time saving vs. Verilog/SystemVerilog HDL. In C to gates System level design is the hard part.
2. OpenCL on FPGA is there for 4+ years including floating point and "cloud" deployment by Microsoft (Asure) and Amazon F1 (Ryft API). With OpenCL system design is relatively easy because of very well defined memory model and API between host and compute devices.

Software folks just need to learn a bit about FPGA architecture to be able to do things that are NOT EVEN POSSIBLE with GPUs and CPUs for the reasons of both being fixed silicon and not having broadband (100Gb+) interfaces to the outside world. Scaling down chip geometry is no longer possible, nor extracting more heat from the single chip package without melting it, so this looks like the end of the road for single package chips. My thesis here is that the future belongs to parallel programming of multi-chip systems, and FPGAs have a great chance to be ahead of the game. Check out http://isfpga.org/ if you have concerns about performance, etc.

橘香 2024-07-15 18:22:33

CUDA 拥有相当丰富的示例代码库和 SDK,包括 BLAS 后端。 尝试找到一些与您正在做的事情类似的示例,也许还可以查看 GPU Gems 系列书籍,用于衡量 CUDA 与您的应用程序的契合程度。 我想说,从逻辑角度来看,CUDA 比任何专业 FPGA 开发工具包更容易使用,而且便宜得多。

有一次,我确实研究了 CUDA 来进行索赔准备金模拟建模。 网站上有很多很好的讲座链接可供学习。 在 Windows 上,您需要确保 CUDA 运行在没有显示器的卡上,因为图形子系统有一个看门狗计时器,它会破坏任何运行时间超过 5 秒的进程。 这在 Linux 上不会发生。

任何具有两个 PCI-e x16 插槽的机器都应该支持此功能。 我使用的是 HP XW9300,您可以从 eBay 上以相当便宜的价格买到它。 如果这样做,请确保它有两个 CPU(而不是一个双核 CPU),因为 PCI-e 插槽位于单独的 Hypertransport 总线上,并且机器中需要两个 CPU 才能使两条总线都处于活动状态。

CUDA has a fairly substantial code base of examples and a SDK, including a BLAS back-end. Try to find some examples similar to what you are doing, perhaps also looking at the GPU Gems series of books, to gauge how well CUDA will fit your applications. I'd say from a logistic point of view, CUDA is easier to work with and much, much cheaper than any professional FPGA development toolkit.

At one point I did look into CUDA for claim reserve simulation modelling. There is quite a good series of lectures linked off the web-site for learning. On Windows, you need to make sure CUDA is running on a card with no displays as the graphics subsystem has a watchdog timer that will nuke any process running for more than 5 seconds. This does not occur on Linux.

Any mahcine with two PCI-e x16 slots should support this. I used a HP XW9300, which you can pick up off ebay quite cheaply. If you do, make sure it has two CPU's (not one dual-core CPU) as the PCI-e slots live on separate Hypertransport buses and you need two CPU's in the machine to have both buses active.

心的憧憬 2024-07-15 18:22:33

基于 FPGA 的解决方案可能比 CUDA 昂贵得多。

FPGA-based solution is likely to be way more expensive than CUDA.

撩发小公举 2024-07-15 18:22:33

你部署什么? 谁是你的客户? 即使不知道这些问题的答案,我也不会使用 FPGA,除非您正在构建一个实时系统,并且您的团队中有熟悉 VHDL 和 Verilog 等硬件描述语言的电气/计算机工程师。 它有很多内容,并且需要与传统编程不同的思维框架。

What are you deploying on? Who is your customer? Without even know the answers to these questions, I would not use an FPGA unless you are building a real-time system and have electrical/computer engineers on your team that have knowledge of hardware description languages such as VHDL and Verilog. There's a lot to it and it takes a different frame of mind than conventional programming.

血之狂魔 2024-07-15 18:22:33

我是一名 CUDA 开发人员,对 FPGA 的经验很少,但我一直在尝试寻找两者之间的比较。

到目前为止我得出的结论是:

GPU 具有更高的(可访问的)峰值性能
它具有更有利的 FLOP/瓦特比。
这更便宜
它的发展速度更快(很快你就会真正拥有“真正的”TFLOP)。
编程更容易(阅读有关此内容的文章,而不是个人观点)

请注意,我说的是真实的/可访问的,以区别于您在 GPGPU 商业广告中看到的数字。

但是,当您需要随机访问数据时,GPU 并不更有利。 这有望随着新的 Nvidia Fermi 架构而改变,该架构具有可选的 l1/l2 缓存。

我的2分钱

I'm a CUDA developer with very littel experience with FPGA:s, however I've been trying to find comparisons between the two.

What I've concluded so far:

The GPU has by far higher ( accessible ) peak performance
It has a more favorable FLOP/watt ratio.
It is cheaper
It is developing faster (quite soon you will literally have a "real" TFLOP available).
It is easier to program ( read article on this not personal opinion)

Note that I'm saying real/accessible to distinguish from the numbers you will see in a GPGPU commercial.

BUT the gpu is not more favorable when you need to do random accesses to data. This will hopefully change with the new Nvidia Fermi architecture which has an optional l1/l2 cache.

my 2 cents

看春风乍起 2024-07-15 18:22:33

其他人已经给出了很好的答案,只是想补充一些不同的观点。 这是我在 2015 年 ACM 计算调查中发表的调查论文(其永久链接为 此处),比较了 GPU 与 FPGA 和 CPU 的能效指标。 大多数论文都报道:FPGA 比 GPU 更节能,而 GPU 又比 CPU 更节能。 由于功耗预算是固定的(取决于冷却能力),FPGA 的能效意味着 FPGA 可以在相同的功耗预算内完成更多的计算,从而使用 FPGA 获得比 GPU 更好的性能。 当然,正如其他人提到的,还要考虑 FPGA 的限制。

Others have given good answers, just wanted to add a different perspective. Here is my survey paper published in ACM Computing Surveys 2015 (its permalink is here), which compares GPU with FPGA and CPU on energy efficiency metric. Most papers report: FPGA is more energy efficient than GPU, which, in turn, is more energy efficient than CPU. Since power budgets are fixed (depending on cooling capability), energy efficiency of FPGA means one can do more computations within same power budget with FPGA, and thus get better performance with FPGA than with GPU. Of course, also account for FPGA limitations, as mentioned by others.

软糯酥胸 2024-07-15 18:22:33
  • FPGA 的并行性比 GPU 高出三个数量级。 好的 GPU 具有数千个内核,而 FPGA 可能拥有数百万个可编程门。
  • 虽然 CUDA 核心必须执行高度相似的计算才能提高工作效率,但 FPGA 单元真正相互独立。
  • FPGA 可以非常快地处理某些任务组,并且通常用于毫秒已经被视为较长持续时间的情况。
  • GPU 核心比 FPGA 单元更强大,并且更容易编程。 它是一个核心,当 FPGA 单元只能执行相当简单的布尔逻辑时,它可以进行除法和乘法运算。
  • 由于GPU核心是一个核心,因此用C++对其进行编程是高效的。 即使也可以用 C++ 对 FPGA 进行编程,但效率很低(只是“高效”)。 必须使用 VDHL 或 Verilog 等专业语言 - 它们很难掌握且具有挑战性。
  • 软件工程师的大多数真实且经过考验的本能对于 FPGA 来说都是无用的。 您想要一个带有这些门的for循环吗? 你来自哪个星系? 你需要转变为电子工程师的思维方式来理解这个世界。
  • FPGAs are more parallel than GPUs, by three orders of magnitude. While good GPU features thousands of cores, FPGA may have millions of programmable gates.
  • While CUDA cores must do highly similar computations to be productive, FPGA cells are truly independent from each other.
  • FPGA can be very fast with some groups of tasks and are often used where a millisecond is already seen as a long duration.
  • GPU core is way more powerful than FPGA cell, and much easier to program. It is a core, can divide and multiply no problem when FPGA cell is only capable of rather simple boolean logic.
  • As GPU core is a core, it is efficient to program it in C++. Even it it is also possible to program FPGA in C++, it is inefficient (just "productive"). Specialized languages like VDHL or Verilog must be used - they are difficult and challenging to master.
  • Most of the true and tried instincts of a software engineer are useless with FPGA. You want a for loop with these gates? Which galaxy are you from? You need to change into the mindset of electronics engineer to understand this world.
挽袖吟 2024-07-15 18:22:33

FPGA 不会受到那些有软件偏见的人的青睐,因为他们需要学习 HDL 或至少了解 systemC。

对于那些有硬件偏好的人来说,FPGA 将是首选。

事实上,两者都需要牢牢掌握。 然后才能做出客观的决定。

OpenCL 设计为在 FPGA 和 FPGA 上运行。 GPU,甚至CUDA都可以移植到FPGA上。

FPGA与 GPU 加速器可以一起使用,

所以这并不存在哪个更好哪个更好的问题。 还有关于 CUDA 与 OpenCL 的争论

再次出现,除非你已经优化和优化了。 对您的具体应用程序进行基准测试,您无法 100% 确定。

许多人会简单地选择 CUDA,因为它的商业性质和可扩展性。 资源。 其他人会选择 openCL,因为它具有多功能性。

FPGA will not be favoured by those with a software bias as they need to learn an HDL or at least understand systemC.

For those with a hardware bias FPGA will be the first option considered.

In reality a firm grasp of both is required & then an objective decision can be made.

OpenCL is designed to run on both FPGA & GPU, even CUDA can be ported to FPGA.

FPGA & GPU accelerators can be used together

So it's not a case of what is better one or the other. There is also the debate about CUDA vs OpenCL

Again unless you have optimized & benchmarked both to your specific application you can not know with 100% certainty.

Many will simply go with CUDA because of its commercial nature & resources. Others will go with openCL because of its versatility.

熟人话多 2024-07-15 18:22:33

最近在 GTC'13 上,许多 HPC 人士一致认为 CUDA 将继续存在。 FGPA 很麻烦,CUDA 越来越成熟,支持 Python/C/C++/ARM .. 不管怎样,这是一个过时的问题

at latest GTC'13 many HPC people agreed that CUDA is here to stay. FGPA's are cumbersome, CUDA is getting quite more mature supporting Python/C/C++/ARM.. either way, that was a dated question

沉溺在你眼里的海 2024-07-15 18:22:33

在 CUDA 中对 GPU 进行编程肯定更容易。 如果您没有任何使用 HDL 进行 FPGA 编程的经验,这对您来说几乎肯定是一个太大的挑战,但您仍然可以使用与 CUDA 有点相似的 OpenCL 对其进行编程。 然而,它比 GPU 编程更难实现,而且可能更昂贵。

哪一个更快?

GPU 运行速度更快,但 FPGA 效率更高。

GPU 的运行速度有可能高于 FPGA 所能达到的速度。 但仅适用于特别适合于此的算法。 如果算法不是最优的,GPU就会损失很多性能。

另一方面,FPGA 的运行速度要慢得多,但您可以实现针对特定问题的硬件,该硬件将非常高效并在更短的时间内完成工作。

这有点像用叉子很快地吃汤,而不是用勺子更慢地吃汤。

这两种设备的性能都基于并行化,但各自的方式略有不同。 如果算法可以被分成许多执行相同操作的块(关键字:SIMD),那么 GPU 将会更快。 如果算法可以实现为长流水线,那么 FPGA 的速度将会更快。 另外,如果你想使用浮点,FPGA 不会很满意:)

我的整个硕士论文都致力于这个主题。
使用 OpenCL 在 FPGA 上进行算法加速

Programming a GPU in CUDA is definitely easier. If you don't have any experience with programming FPGAs in HDL it will almost surely be too much of a challenge for you, but you can still program them with OpenCL which is kinda similar to CUDA. However, it is harder to implement and probably a lot more expensive than programming GPUs.

Which one is Faster?

GPU runs faster, but FPGA can be more efficient.

GPU has the potential of running at a speed higher than FPGA can ever reach. But only for algorithms that are specially suited for that. If the algorithm is not optimal, the GPU will loose a lot of performance.

FPGA on the other hand runs much slower, but you can implement problem-specific hardware that will be very efficient and get stuff done in less time.

It's kinda like eating your soup with a fork very fast vs. eating it with a spoon more slowly.

Both devices base their performance on parallelization, but each in a slightly different way. If the algorithm can be granulated into a lot of pieces that execute the same operations (keyword: SIMD), the GPU will be faster. If the algorithm can be implemented as a long pipeline, the FPGA will be faster. Also, if you want to use floating point, FPGA will not be very happy with it :)

I have dedicated my whole master's thesis to this topic.
Algorithm Acceleration on FPGA with OpenCL

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文