FPGA时序问题
我是 FPGA 编程新手,对整体执行时间方面的性能有疑问。
我读到延迟是根据周期时间计算的。因此,总执行时间 = 延迟 * 周期时间。
我想优化处理数据所需的时间,我将测量总体执行时间。
假设我有一个计算 a = b * c * d。
如果我让它在两个周期内计算 (result1 = b * c) & (a = result1 * d),总体执行时间将是 2 * 周期时间的延迟(由乘法运算的延迟决定,例如值 X)= 2X
如果我在一个周期内进行计算 ( a = b *光盘)。总体执行时间将是 1 * 周期时间的延迟(假设值为 2X,因为由于两次乘法而不是一次,它的延迟是两倍)= 2X
因此,似乎为了在执行时间方面优化性能,如果我仅关注减少延迟,周期时间就会增加,反之亦然。是否存在延迟和周期时间都可以减少,从而导致执行时间减少的情况?我什么时候应该关注优化延迟,什么时候应该关注周期时间?
另外,当我用 C++ 编程时,似乎当我想优化代码时,我想优化延迟(执行所需的周期)。然而,对于 FPGA 编程来说,优化延迟似乎还不够,因为周期时间会增加。因此,我应该专注于优化执行时间(延迟*周期时间)。如果我想提高程序的速度,我的说法正确吗?
希望有人能帮助我解决这个问题。提前致谢。
I am new to FPGA programming and I have a question regarding the performance in terms of overall execution time.
I have read that latency is calculated in terms of cycle-time. Hence, overall execution time = latency * cycle time.
I want to optimize the time needed in processing the data, I would be measuring the overall execution time.
Let's say I have a calculation a = b * c * d.
If I make it to calculate in two cycles (result1 = b * c) & (a = result1 * d), the overall execution time would be latency of 2 * cycle time(which is determined by the delay of the multiplication operation say value X) = 2X
If I make the calculation in one cycle ( a = b * c * d). the overall execution time would be latency of 1 * cycle time (say value 2X since it has twice of the delay because of two multiplication instead of one) = 2X
So, it seems that for optimizing the performance in terms of execution time, if I focus only on decreasing the latency, the cycle time would increase and vice versa. Is there a case where both latency and the cycle time could be decreased, causing the execution time to decrease? When should I focus on optimizing the latency and when should I focus on cycle-time?
Also, when I am programming in C++, it seems that when I want to optimize the code, I would like to optimize the latency( the cycles needed for the execution). However, it seems that for FPGA programming, optimizing the latency is not adequate as the cycle time would increase. Hence, I should focus on optimizing the execution time ( latency * cycle time). Am I correct in this if I could like to increase the speed of the program?
Hope that someone would help me with this. Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我倾向于将延迟视为从第一个输入到第一个输出的时间。由于通常存在一系列数据,因此查看逐个处理多个输入所需的时间很有用。
以您的示例为例,要在一个周期内执行 a = bxcxd (一个周期 = 2t)处理 10 个项目将需要 20t。然而,在两个 1t 周期中进行,处理 10 个项目将需要 11t。
希望有帮助。
编辑添加计时。
1个2t周期的计算。 10 次计算。
两个 1t 周期中的计算,流水线,10 次计算
两个解决方案的延迟均为 2t,第一个解决方案为一个 2t 周期,第二个为两个 1t 周期。然而,第二个解决方案的吞吐量是第二个解决方案的两倍。一旦考虑到延迟,您每 1t 周期都会得到一个新答案。
因此,如果您有一个复杂的计算,需要 5 个 1t 周期,那么延迟将为 5t,但吞吐量仍为 1t。
I tend to think of latency as the time from the first input to the first output. As there is usually a series of data, it is useful to look at the time taken to process multiple inputs, one after another.
With your example, to process 10 items doing a = b x c x d in one cycle (one cycle = 2t) would take 20t. However doing it in two 1t cycles, to process 10 items would take 11t.
Hope that helps.
Edit Add timing.
Calculation in one 2t cycle. 10 calculations.
Calculation in two 1t cycles, pipelined, 10 calculations
Latency for both solutions is 2t, one 2t cycle for the first one, and two 1t cycles for the second one. However the through put of the second solution is twice as fast. Once the latency is accounted for, you get a new answer every 1t cycle.
So if you had a complex calculation that required say 5 1t cycles, then the latency would be 5t, but the through put would still be 1t.
除了延迟和周期时间之外,您还需要另一个词,那就是吞吐量。即使需要 2 个周期才能得到答案,如果您可以在每个周期放入新数据并在每个周期将其取出,那么您的吞吐量可以比“在一个周期内完成所有操作”提高 2 倍。
假设您的计算在一个周期内需要 40 ns,因此吞吐量为 2500 万个数据项/秒。
如果你采用流水线方式(这是将计算分成多个周期的技术术语),你可以在 2 批 20ns + 一点的时间内完成它(你在必须输入的额外寄存器中丢失了一点)。假设该位为 10 ns(这个时间很多,但很容易求和)。所以现在需要 2x25+10=50 ns => 2000 万条/秒。更糟糕的是!
但是,如果您可以使 2 个阶段彼此独立(在您的情况下,不共享乘法器),您可以每 25+ 位 ns 将新数据推送到管道中。这个“一点”会比前面的小一点,但即使是整个 10 ns,你也可以以 35ns 次或近 30M items/sec 的速度推送数据,这比你开始的要好和。
在现实生活中,10ns 会少得多,通常是 100s 的 ps,因此增益要大得多。
You need another word in addition to latency and cycle-time, which is throughput. Even if it takes 2 cycles to get an answer, if you can put new data in every cycle and get it out every cycle, your throughput can be increased by 2x over the "do it all in one cycle".
Say your calculation takes 40 ns in one cycle, so a throughput of 25 million data items/sec.
If you pipeline it (which is the technical term for splitting up the calculation into multiple cycles) you can do it in 2 lots of 20ns + a bit (you lose a bit in the extra registers that have to go in). Let's say that bit is 10 ns (which is a lot, butmakes the sums easy). So now it takes 2x25+10=50 ns => 20M items/sec. Worse!
But, if you can make the 2 stages independent of each other (in your case, not sharing the multiplier) you can push new data into the pipeline every 25+a bit ns. This "a bit" will be smaller than the previous one, but even if it's the whole 10 ns, you can push data in at 35ns times or nearly 30M items/sec, which is better than your started with.
In real life the 10ns will bemuch less, often 100s of ps, so the gains are much larger.
乔治准确地描述了延迟的含义(这不一定与计算时间有关)。您似乎想优化设计以提高速度。这是非常复杂的,需要很多经验。总运行时间为
其中 N 是您要执行的计算数量。如果您进行加速开发,则应该只在大型数据集上进行计算,即 N 很大。通常,您对延迟没有要求(这在实时应用程序中可能有所不同)。决定因素是
cycle_time
和computation_cycles
。这里确实很难优化,因为存在关系。cycle_time
由设计的关键路径决定,并且其上的寄存器越少,周期时间就越长。时间越长,cycle_time
就越大。但是,您拥有的寄存器越多,您的computation_cycles
就越高(每个寄存器都会将所需的周期数增加一)。也许我应该补充一点,延迟通常是计算周期的数量(它是造成延迟的第一个计算),但理论上这可能是不同的。
George described accurately the meaning latency (which does not necessary relate to computation time). Its seems you want to optimize your design for speed. This is very complex and requires much experience. The total runtime is
Where N is the number of calculations you want to perform. If you develop for acceleration you should only compute on large data sets, i.e. N is big. Usually you then dont have requirements for latency (which could be in real time applications different). The determining factors are then the
cycle_time
and thecomputation_cycles
. And here it is really hard to optimize, because there is a relation. Thecycle_time
is determined by the critical path of your design, and that gets longer the fewer registers you have on it. The longer it gets, the bigger is thecycle_time
. But the more registers you have the higher is yourcomputation_cycles
(each register increases the number of required cycles by one).Maybe I should add, that the latency is usually the number of computation_cycles (its the first computation that makes the latency) but in theory this can be different.