算法 FPGA 主导 CPU

发布于 2024-09-03 10:34:29 字数 256 浏览 13 评论 0原文

我一生的大部分时间都在对 CPU 进行编程;尽管对于大多数算法来说,big-Oh 运行时间在 CPU/FPGA 上保持相同,但常数却截然不同(例如,大量 CPU 功率浪费在数据混洗上;而对于 FPGA 来说,它通常受计算限制)。

我想了解更多相关信息——任何人都知道处理以下问题的好书/参考论文/教程:

FPGA 在哪些任务上主导 CPU(就纯速度而言) FPGA 在哪些任务上主导 CPU(就每 jule 的工作量而言)

注:标记的社区 wiki

For most of my life, I've programmed CPUs; and although for most algorithms, the big-Oh running time remains the same on CPUs / FPGAs, the constants are quite different (for example, lots of CPU power is wasted shuffling data around; whereas for FPGAs it's often compute bound).

I would like to learn more about this -- anyone know of good books / reference papers / tutorials that deals with the issue of:

what tasks do FPGAs dominate CPUs on (in terms of pure speed)
what tasks do FPGAs dominate CPUs on (in terms of work per jule)

Note: marked community wiki

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

晨与橙与城 2024-09-10 10:34:30

[没有链接,只是我的思考]

FPGA 本质上是硬件的解释器
该架构类似于专用 ASIC,但为了获得快速开发,您需要付出大约 10 倍的频率和 [不知道,至少 10 倍?] 的功率效率。

因此,对于任何专用硬件能够大幅胜过 CPU 的任务,除以 FPGA 10/[?] 因子,您可能仍然会获胜。此类任务的典型特征:

  • 细粒度并行的大量机会。
    (一次执行 4 个操作不算数;128 个操作才算。)
  • 深度流水线的机会。
    这也是一种并行,但是很难应用到
    单个任务,因此如果您可以完成许多单独的任务,这会有所帮助
    并行工作。
  • (大部分)固定数据流路径。
    一些多路复用器是可以的,但是大量的随机访问是不好的,因为你
    无法并行化它们。但请参阅下面关于记忆的内容。
  • 许多小内存提供高总带宽。
    FPGA 具有数百个小型 (O(1KB)) 内部存储器
    (Xilinx 术语中的 BlockRAM),所以如果您可以对您进行分区
    内存使用分成很多独立的缓冲区,可以享受一个数据
    CPU 从未梦想过的带宽。
  • 外部带宽小(与内部工作相比)。
    理想的 FPGA 任务具有较小的输入和输出,但需要
    大量的内部工作。这样您的 FPGA 就不会挨饿等待
    输入/输出。 (CPU 已经饱受饥饿之苦,它们缓解了这种情况
    具有非常复杂(且)的缓存,这是 FPGA 无法比拟的。)
    将巨大的 I/O 带宽连接到一个设备是完全可能的。
    FPGA(如今约有 1000 个引脚,其中一些具有高速 SERDES)-
    但要做到这一点需要一个专门设计的定制板
    带宽;在大多数情况下,您的外部 I/O 将是
    瓶颈。
  • 对于硬件来说足够简单(又名良好的软件/硬件分区)。
    许多任务由 90% 的不规则粘合逻辑组成,而只有 10% 的任务由不规则的粘合逻辑组成。
    努力工作(DSP 意义上的“内核”)。如果你把所有这些
    到 FPGA 上,您将在不执行任何操作的逻辑上浪费宝贵的区域
    大部分时间都在工作。理想情况下,你想要所有的垃圾
    在软件中处理并充分利用内核的硬件。
    (FPGA 内的“软核”CPU 是一种流行的方式来封装大量
    如果您无法将其卸载到中等区域,则将不规则逻辑减慢到
    真正的 CPU。)
  • 奇怪的位操作是一个优点。
    不能很好地映射到传统 CPU 指令集的东西,
    例如对打包位、散列函数、编码和编码的未对齐访问。
    压缩...但是,不要高估它给出的因素
    你 - 你将遇到的大多数数据格式和算法已经
    旨在简化 CPU 指令集,并且 CPU 保持
    添加多媒体专用指令。
    很多浮点特别是一个减号,因为两者
    CPU 和 GPU 在极其优化的专用芯片上处理它们。
    (所谓的“DSP”FPGA 也有许多专用的乘法/加法单元,
    但据我所知,这些只处理整数?)
  • 低延迟/实时要求是一个优点。
    在这样的需求下,硬件确实可以大放异彩。

编辑:其中一些条件 - 尤其是。固定数据流和许多单独的任务需要处理 - 还可以在 CPU 上启用位切片,这在某种程度上是水平的领域。

[no links, just my musings]

FPGAs are essentially interpreters for hardware!
The architecture is like dedicated ASICs, but to get rapid development, and you pay a factor of ~10 in frequency and a [don't know, at least 10?] factor in power efficiency.

So take any task where dedicated HW can massively outperform CPUs, divide by the FPGA 10/[?] factors, and you'll probably still have a winner. Typical qualities of such tasks:

  • Massive opportunities for fine-grained parallelism.
    (Doing 4 operations at once doesn't count; 128 does.)
  • Opportunity for deep pipelining.
    This is also a kind of parallelism, but it's hard to apply it to a
    single task, so it helps if you can get many separate tasks to
    work on in parallel.
  • (Mostly) Fixed data flow paths.
    Some muxes are OK, but massive random accesses are bad, cause you
    can't parallelize them. But see below about memories.
  • High total bandwidth to many small memories.
    FPGAs have hundreds of small (O(1KB)) internal memories
    (BlockRAMs in Xilinx parlance), so if you can partition you
    memory usage into many independent buffers, you can enjoy a data
    bandwidth that CPUs never dreamed of.
  • Small external bandwidth (compared to internal work).
    The ideal FPGA task has small inputs and outputs but requires a
    lot of internal work. This way your FPGA won't starve waiting for
    I/O. (CPUs already suffer from starving, and they alleviate it
    with very sophisticated (and big) caches, unmatchable in FPGAs.)
    It's perfectly possible to connect a huge I/O bandwidth to an
    FPGA (~1000 pins nowdays, some with high-rate SERDESes) -
    but doing that requires a custom board architected for such
    bandwidth; in most scenarios, your external I/O will be a
    bottleneck.
  • Simple enough for HW (aka good SW/HW partitioning).
    Many tasks consist of 90% irregular glue logic and only 10%
    hard work ("kernel" in the DSP sense). If you put all that
    onto an FPGA, you'll waste precious area on logic that does no
    work most of the time. Ideally, you want all the muck
    to be handled in SW and fully utilize the HW for the kernel.
    ("Soft-core" CPUs inside FPGAs are a popular way to pack lots of
    slow irregular logic onto medium area, if you can't offload it to
    a real CPU.)
  • Weird bit manipulations are a plus.
    Things that don't map well onto traditional CPU instruction sets,
    such as unaligned access to packed bits, hash functions, coding &
    compression... However, don't overestimate the factor this gives
    you - most data formats and algorithms you'll meet have already
    been designed to go easy on CPU instruction sets, and CPUs keep
    adding specialized instructions for multimedia.
    Lots of Floating point specifically is a minus because both
    CPUs and GPUs crunch them on extremely optimized dedicated silicon.
    (So-called "DSP" FPGAs also have lots of dedicated mul/add units,
    but AFAIK these only do integers?)
  • Low latency / real-time requirements are a plus.
    Hardware can really shine under such demands.

EDIT: Several of these conditions — esp. fixed data flows and many separate tasks to work on — also enable bit slicing on CPUs, which somewhat levels the field.

雨落星ぅ辰 2024-09-10 10:34:30

最新一代的 Xilinx 部件刚刚发布了 4.7TMACS 和 600MHz 的通用逻辑。 (这些基本上是在较小的进程上制造的 Virtex 6。)

在这样的野兽上,如果您可以在定点运算中实现算法,主要是乘法、加法和减法,并利用宽并行性和流水线并行性,您可以吃得最多个人电脑在功率和处理能力方面都充满活力。

您可以在这些上浮动,但性能会受到影响。 DSP 模块包含一个 25x18 位 MACC,总和为 48 位。如果您可以摆脱奇怪的格式并绕过一些通常发生的浮点标准化,您仍然可以从中获得大量的性能。 (即使用 18 位输入作为固定点或具有 17 位尾数的浮点数,而不是普通的 24 位。)双精度浮点数会消耗大量资源,因此如果您需要它,您可能会在 PC 上做得更好。

如果您的算法可以用加法和减法运算来表示,那么其中的通用逻辑可以用于实现无数的加法器。像 Bresenham 的直线/圆/yadda/yadda/yadda 算法这样的算法非常适合 FPGA 设计。

如果你需要除法...呃...这很痛苦,而且可能会相对较慢,除非你可以将除法实现为乘法。

如果您需要大量高精度三角函数,则不需要那么多...同样可以完成,但它不会很漂亮或很快。 (就像在 6502 上可以完成的那样。)如果您可以在有限范围内仅使用查找表来应对,那么您的黄金!

说到 6502,6502 演示编码器可以让其中一个东西唱歌。任何熟悉程序员在老式机器上使用的所有旧数学技巧的人仍然适用。现代程序员告诉你“让库为你做”的所有技巧都是你需要知道的来实现数学的东西类型。如果你能找到一本谈论在基于 68000 的 Atari 或 Amiga 上进行 3D 的书,他们将讨论很多如何仅以整数实现东西。

实际上,任何可以使用查找表实现的算法都非常适合 FPGA。您不仅可以在整个部件中分布块内存,而且逻辑单元本身也可以配置为各种大小的 LUTS 和迷你内存。

您可以免费查看固定位操作等内容!只需通过路由来处理即可。固定移位或位反转不需要任何成本。动态位操作(例如按可变数量进行移位)将花费最少的逻辑量,并且可以完成直到奶牛回家!

最大的部分有3960个乘数! 142,200 个切片,每个切片都可以是一个 8 位加法器。 (每片 4 个 6 位 Lut 或每片 8 个 5 位 Lut,具体取决于配置。)

Well the newest generation of Xilinx parts just anounced brag 4.7TMACS and general purpose logic at 600MHz. (These are basically Virtex 6s fabbed on a smaller process.)

On a beast like this if you can implement your algorithms in fixed point operations, primarily multiply, adds and subtracts, and take advantage of both Wide parallelism and Pipelined parallelism you can eat most PCs alive, in terms of both power and processing.

You can do floating on these, but there will be a performance hit. The DSP blocks contain a 25x18 bit MACC with a 48bit sum. If you can get away with oddball formats and bypass some of the floating point normalization that normally occurs you can still eek out a truck load of performance out of these. (i.e. Use the 18Bit input as strait fixed point or float with a 17 bit mantissia, instead of the normal 24 bit.) Doubles floats are going to eat alot of resources so if you need that, you probably will do better on a PC.

If your algorithms can be expressed as in terms of add and subtract operations, then the general purpose logic in these can be used to implement gazillion adders. Things like Bresenham's line/circle/yadda/yadda/yadda algorithms are VERY good fits for FPGA designs.

IF you need division... EH... it's painful, and probably going to be relatively slow unless you can implement your divides as multiplies.

If you need lots of high percision trig functions, not so much... Again it CAN be done, but it's not going to be pretty or fast. (Just like it can be done on a 6502.) If you can cope with just using a lookup table over a limited range, then your golden!

Speaking of the 6502, a 6502 demo coder could make one of these things sing. Anybody who is familiar with all the old math tricks that programmers used to use on the old school machine like that will still apply. All the tricks that modern programmer tell you "let the libary do for you" are the types of things that you need to know to implement maths on these. If yo can find a book that talks about doing 3d on a 68000 based Atari or Amiga, they will discuss alot of how to implement stuff in integer only.

ACTUALLY any algorithms that can be implemented using look up tables will be VERY well suited for FPGAs. Not only do you have blockrams distributed through out the part, but the logic cells themself can be configured as various sized LUTS and mini rams.

You can view things like fixed bit manipulations as FREE! It's simply handle by routing. Fixed shifts, or bit reversals cost nothing. Dynamic bit operations like shift by a varable amount will cost a minimal amount of logic and can be done till the cows come home!

The biggest part has 3960 multipliers! And 142,200 slices which EACH one can be an 8 bit adder. (4 6Bit Luts per slice or 8 5bit Luts per slice depending on configuration.)

柠檬色的秋千 2024-09-10 10:34:30

选择一个粗糙的软件算法。我们公司以软件算法的硬件加速为生。

我们已经完成了正则表达式引擎的硬件实现,该引擎将以高达 10Gb/秒的速度并行执行 1000 个规则集。其目标市场是路由器,其中防病毒和 ips/ids 可以在数据流过时实时运行,而不会减慢路由器的速度。

我们已经在硬件中完成了高清视频编码。过去,每秒处理电影需要几个小时才能将其转换为高清。现在我们几乎可以实时完成......转换 1 秒的胶片大约需要 2 秒的处理时间。 Netflix 几乎专门将我们的硬件用于其视频点播产品。

我们甚至在硬件中完成了 RSA、3DES 和 AES 加密和解密等简单的工作。我们已经在硬件中完成了简单的压缩/解压缩。其目标市场是安全摄像机。政府拥有大量摄像机,可产生大量实时数据流。他们在通过网络发送之前实时压缩它,然后在另一端实时解压缩它。

我工作过的另一家公司曾经使用 FPGA 制作雷达接收器。他们将直接从几个不同的天线对数字化的敌方雷达数据进行采样,并根据到达的时间增量,找出敌方发射机的方向和距离。哎呀,我们甚至可以检查 FPGA 中信号脉冲的意外调制,以找出特定发射器的指纹,这样我们就可以知道该信号来自一个特定的俄罗斯 SAM 站点,该站点曾经驻扎在不同的边境,这样我们就可以跟踪武器的动向和销售。

尝试在软件中这样做! :-)

Pick a gnarly SW algorithm. Our company does HW acceleration of SW algo's for a living.

We've done HW implementations of regular expression engines that will do 1000's of rule-sets in parallel at speeds up to 10Gb/sec. The target market for that is routers where anti-virus and ips/ids can run real-time as the data is streaming by without it slowing down the router.

We've done HD video encoding in HW. It used to take several hours of processing time per second of film to convert it to HD. Now we can do it almost real-time...it takes almost 2 seconds of processing to convert 1 second of film. Netflix's used our HW almost exclusively for their video on demand product.

We've even done simple stuff like RSA, 3DES, and AES encryption and decryption in HW. We've done simple zip/unzip in HW. The target market for that is for security video cameras. The government has some massive amount of video cameras generating huge streams of real-time data. They zip it down in real-time before sending it over their network, and then unzip it in real-time on the other end.

Heck, another company I worked for used to do radar receivers using FPGA's. They would sample the digitized enemy radar data directly several different antennas, and from the time delta of arrival, figure out what direction and how far away the enemy transmitter is. Heck, we could even check the unintended modulation on pulse of the signals in the FPGA's to figure out the fingerprint of specific transmitters, so we could know that this signal is coming from a specific Russian SAM site that used to be stationed at a different border, so we could track weapons movements and sales.

Try doing that in software!! :-)

画中仙 2024-09-10 10:34:30

对于纯粹的速度:
- 可并行化的
- DSP,例如视频滤波器
- 移动数据,例如 DMA

For pure speed:
- Paralizable ones
- DSP, e.g. video filters
- Moving data, e.g. DMA

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文