当前位置：文江博客话题详情

FPGA时序问题

发布于 2024-10-28 20:09:30 字数 618 浏览 7 评论 0原文

我是 FPGA 编程新手，对整体执行时间方面的性能有疑问。

我读到延迟是根据周期时间计算的。因此，总执行时间 = 延迟 * 周期时间。

我想优化处理数据所需的时间，我将测量总体执行时间。

假设我有一个计算 a = b * c * d。

如果我让它在两个周期内计算 (result1 = b * c) & (a = result1 * d)，总体执行时间将是 2 * 周期时间的延迟（由乘法运算的延迟决定，例如值 X）= 2X

如果我在一个周期内进行计算 ( a = b *光盘）。总体执行时间将是 1 * 周期时间的延迟（假设值为 2X，因为由于两次乘法而不是一次，它的延迟是两倍）= 2X

因此，似乎为了在执行时间方面优化性能，如果我仅关注减少延迟，周期时间就会增加，反之亦然。是否存在延迟和周期时间都可以减少，从而导致执行时间减少的情况？我什么时候应该关注优化延迟，什么时候应该关注周期时间？

另外，当我用 C++ 编程时，似乎当我想优化代码时，我想优化延迟（执行所需的周期）。然而，对于 FPGA 编程来说，优化延迟似乎还不够，因为周期时间会增加。因此，我应该专注于优化执行时间（延迟*周期时间）。如果我想提高程序的速度，我的说法正确吗？

希望有人能帮助我解决这个问题。提前致谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浅唱々樱花落 2024-11-04 20:09:30

我倾向于将延迟视为从第一个输入到第一个输出的时间。由于通常存在一系列数据，因此查看逐个处理多个输入所需的时间很有用。

以您的示例为例，要在一个周期内执行 a = bxcxd （一个周期 = 2t）处理 10 个项目将需要 20t。然而，在两个 1t 周期中进行，处理 10 个项目将需要 11t。

希望有帮助。

编辑添加计时。

1个2t周期的计算。 10 次计算。

Time   0  2  2  2  2  2  2  2  2  2  2  = 20t

Input  1  2  3  4  5  6  7  8  9 10
Output    1  2  3  4  5  6  7  8  9 10

两个 1t 周期中的计算，流水线，10 次计算

Time   0  1  1  1  1  1  1  1  1  1  1  1  = 11t

Input  1  2  3  4  5  6  7  8  9 10
Stage1    1  2  3  4  5  6  7  8  9 10
Output       1  2  3  4  5  6  7  8  9 10

两个解决方案的延迟均为 2t，第一个解决方案为一个 2t 周期，第二个为两个 1t 周期。然而，第二个解决方案的吞吐量是第二个解决方案的两倍。一旦考虑到延迟，您每 1t 周期都会得到一个新答案。

因此，如果您有一个复杂的计算，需要 5 个 1t 周期，那么延迟将为 5t，但吞吐量仍为 1t。

I tend to think of latency as the time from the first input to the first output. As there is usually a series of data, it is useful to look at the time taken to process multiple inputs, one after another.

With your example, to process 10 items doing a = b x c x d in one cycle (one cycle = 2t) would take 20t. However doing it in two 1t cycles, to process 10 items would take 11t.

Hope that helps.

Edit Add timing.

Calculation in one 2t cycle. 10 calculations.

Time   0  2  2  2  2  2  2  2  2  2  2  = 20t

Input  1  2  3  4  5  6  7  8  9 10
Output    1  2  3  4  5  6  7  8  9 10

Calculation in two 1t cycles, pipelined, 10 calculations

Time   0  1  1  1  1  1  1  1  1  1  1  1  = 11t

Input  1  2  3  4  5  6  7  8  9 10
Stage1    1  2  3  4  5  6  7  8  9 10
Output       1  2  3  4  5  6  7  8  9 10

Latency for both solutions is 2t, one 2t cycle for the first one, and two 1t cycles for the second one. However the through put of the second solution is twice as fast. Once the latency is accounted for, you get a new answer every 1t cycle.

So if you had a complex calculation that required say 5 1t cycles, then the latency would be 5t, but the through put would still be 1t.

回复收藏 0 原文

冰雪梦之恋 2024-11-04 20:09:30

除了延迟和周期时间之外，您还需要另一个词，那就是吞吐量。即使需要 2 个周期才能得到答案，如果您可以在每个周期放入新数据并在每个周期将其取出，那么您的吞吐量可以比“在一个周期内完成所有操作”提高 2 倍。

假设您的计算在一个周期内需要 40 ns，因此吞吐量为 2500 万个数据项/秒。

如果你采用流水线方式（这是将计算分成多个周期的技术术语），你可以在 2 批 20ns + 一点的时间内完成它（你在必须输入的额外寄存器中丢失了一点）。假设该位为 10 ns（这个时间很多，但很容易求和）。所以现在需要 2x25+10=50 ns => 2000 万条/秒。更糟糕的是！

但是，如果您可以使 2 个阶段彼此独立（在您的情况下，不共享乘法器），您可以每 25+ 位 ns 将新数据推送到管道中。这个“一点”会比前面的小一点，但即使是整个 10 ns，你也可以以 35ns 次或近 30M items/sec 的速度推送数据，这比你开始的要好和。

在现实生活中，10ns 会少得多，通常是 100s 的 ps，因此增益要大得多。

回复收藏 0 原文

凉宸 2024-11-04 20:09:30

乔治准确地描述了延迟的含义（这不一定与计算时间有关）。您似乎想优化设计以提高速度。这是非常复杂的，需要很多经验。总运行时间为

execution_time = (latency + (N * computation_cycles) ) * cycle_time

其中 N 是您要执行的计算数量。如果您进行加速开发，则应该只在大型数据集上进行计算，即 N 很大。通常，您对延迟没有要求（这在实时应用程序中可能有所不同）。决定因素是cycle_time 和computation_cycles。这里确实很难优化，因为存在关系。 cycle_time 由设计的关键路径决定，并且其上的寄存器越少，周期时间就越长。时间越长，cycle_time 就越大。但是，您拥有的寄存器越多，您的computation_cycles 就越高（每个寄存器都会将所需的周期数增加一）。

也许我应该补充一点，延迟通常是计算周期的数量（它是造成延迟的第一个计算），但理论上这可能是不同的。

George described accurately the meaning latency (which does not necessary relate to computation time). Its seems you want to optimize your design for speed. This is very complex and requires much experience. The total runtime is

execution_time = (latency + (N * computation_cycles) ) * cycle_time

Where N is the number of calculations you want to perform. If you develop for acceleration you should only compute on large data sets, i.e. N is big. Usually you then dont have requirements for latency (which could be in real time applications different). The determining factors are then the cycle_time and the computation_cycles. And here it is really hard to optimize, because there is a relation. The cycle_time is determined by the critical path of your design, and that gets longer the fewer registers you have on it. The longer it gets, the bigger is the cycle_time. But the more registers you have the higher is your computation_cycles (each register increases the number of required cycles by one).

Maybe I should add, that the latency is usually the number of computation_cycles (its the first computation that makes the latency) but in theory this can be different.

回复收藏 0 原文

~没有更多了~