通过了解 Xilinx 综合报告减少延迟
我正在 Xilinx 中用 VHDL 对 8051 指令集进行编程。编写逻辑并生成综合报告后,我看到延迟为 13.330ns(频率为 75.020 MHz),逻辑级别 = 10。
这个值相当小(频率),我需要加强它,但我使用综合报告无法了解延迟是什么/哪里。
这是报告中讨论时间安排的部分:
=========================================================================
Timing constraint: Default period analysis for Clock 'clk_div1'
Clock period: 13.330ns (frequency: 75.020MHz)
Total number of paths / destination ports: 156134 / 3086
-------------------------------------------------------------------------
Delay: 13.330ns (Levels of Logic = 10)
Source: SEQ/alu_op_code_1 (FF)
Destination: SEQ/alu_src_2L_7 (FF)
Source Clock: clk_div1 rising
Destination Clock: clk_div1 rising
Data Path: SEQ/alu_op_code_1 to SEQ/alu_src_2L_7
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
FDE:C->Q 40 0.591 1.345 SEQ/alu_op_code_1 (SEQ/alu_op_code_1)
LUT4:I1->O 2 0.643 0.527 ALU1/ci32_SW0 (N2251)
LUT4:I1->O 1 0.643 0.000 ALU1/adder_comp/C11_F (N1292)
MUXF5:I0->O 3 0.276 0.531 ALU1/adder_comp/C11 (ALU1/adder_comp/C1)
MUXF5:S->O 12 0.756 0.964 ALU1/adder_comp/C21 (ALU1/adder_comp/C2)
LUT4:I3->O 8 0.648 0.760 ALU1/ans_L<5>104 (ALU1/ans_L<5>104)
LUT4:I3->O 17 0.648 1.054 ALU1/ans_L<7>95_SW0 (N264)
LUT4:I3->O 1 0.648 0.000 SEQ/alu_src_2H_and000055_SW3_F (N1304)
MUXF5:I0->O 1 0.276 0.423 SEQ/alu_src_2H_and000055_SW3 (N599)
LUT4_D:I3->O 15 0.648 1.049 SEQ/alu_src_2L_mux0005<7>121228 (N285)
LUT4:I2->O 1 0.648 0.000 SEQ/alu_src_2H_mux0007<6> (SEQ/alu_src_2H_mux0007<6>)
FDE:D 0.252 SEQ/alu_src_2H_1
----------------------------------------
Total 13.330ns (6.677ns logic, 6.653ns route)
(50.1% logic, 49.9% route)
有人能解释一下发生了什么吗?
I am programming the 8051 instruction set in VHDL in Xilinx. After writing the logic and generating the synthesis report, I saw that the Delay is 13.330ns (frequency of 75.020 MHz) with Levels of Logic = 10.
This value is pretty less (the frequency) and I need to beef it up but I am not able to understand what/where is the delay using the synthesis report.
This is the part of the report which talks about the timing:
=========================================================================
Timing constraint: Default period analysis for Clock 'clk_div1'
Clock period: 13.330ns (frequency: 75.020MHz)
Total number of paths / destination ports: 156134 / 3086
-------------------------------------------------------------------------
Delay: 13.330ns (Levels of Logic = 10)
Source: SEQ/alu_op_code_1 (FF)
Destination: SEQ/alu_src_2L_7 (FF)
Source Clock: clk_div1 rising
Destination Clock: clk_div1 rising
Data Path: SEQ/alu_op_code_1 to SEQ/alu_src_2L_7
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
FDE:C->Q 40 0.591 1.345 SEQ/alu_op_code_1 (SEQ/alu_op_code_1)
LUT4:I1->O 2 0.643 0.527 ALU1/ci32_SW0 (N2251)
LUT4:I1->O 1 0.643 0.000 ALU1/adder_comp/C11_F (N1292)
MUXF5:I0->O 3 0.276 0.531 ALU1/adder_comp/C11 (ALU1/adder_comp/C1)
MUXF5:S->O 12 0.756 0.964 ALU1/adder_comp/C21 (ALU1/adder_comp/C2)
LUT4:I3->O 8 0.648 0.760 ALU1/ans_L<5>104 (ALU1/ans_L<5>104)
LUT4:I3->O 17 0.648 1.054 ALU1/ans_L<7>95_SW0 (N264)
LUT4:I3->O 1 0.648 0.000 SEQ/alu_src_2H_and000055_SW3_F (N1304)
MUXF5:I0->O 1 0.276 0.423 SEQ/alu_src_2H_and000055_SW3 (N599)
LUT4_D:I3->O 15 0.648 1.049 SEQ/alu_src_2L_mux0005<7>121228 (N285)
LUT4:I2->O 1 0.648 0.000 SEQ/alu_src_2H_mux0007<6> (SEQ/alu_src_2H_mux0007<6>)
FDE:D 0.252 SEQ/alu_src_2H_1
----------------------------------------
Total 13.330ns (6.677ns logic, 6.653ns route)
(50.1% logic, 49.9% route)
Can someone explain what is happening?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
查看报告中的名称并与源代码进行比较。
基本上,您只有在“SEQ”实例中从 ALU 操作码流向 ALU 输出信号“alu_src_2L”的组合逻辑:
来源:SEQ/alu_op_code_1 (FF)
目的地:SEQ/alu_src_2L_7 (FF)
查看详细信息,您可以看到在这个特定路径中,大部分时间都用在您的 ALU“ALU1”中,特别是在加法器/比较逻辑“adder_comp”中。如果您想减少该路径中的延迟,则必须优化逻辑或使用另一个寄存器切断该路径(并使设计的其余部分仍然适用于该更改)。
Look at the names in the report and compare to your source code.
Basically, you have just combinational logic flowing in the "SEQ" instance from the ALU op code to an ALU output signal "alu_src_2L":
Source: SEQ/alu_op_code_1 (FF)
Destination: SEQ/alu_src_2L_7 (FF)
Looking at the details, you can see that in this particular path, most of the time is used in your ALU "ALU1", and specifically in the adder/comparison logic "adder_comp". If you want to have less delay in this path, you are going to have to either optimize the logic or cut the path with another register (and make the rest of the design still work with that change).
一些定义:
13.33ns 由两部分组成。门延迟为 6.677ns,网络延迟为 6.653ns
门延迟的主要因素是逻辑锥内包含函数的复杂程度。网络延迟的主要因素是有多少东西是由信号驱动的。
报告中的每一行都在讨论一个逻辑块。所以第一行alu_op_code_1寄存器,以及从C引脚(Clk)到Q引脚(输出)所需的时间。扇出列表示 Q 引脚驱动的逻辑块数量。在本例中它是 40,这就是网络延迟相当高的原因。对于像 ALU 的操作码这样的常用寄存器来说,具有高扇出是完全可以理解的。
我们还可以从整体上看该路径,并看到它从 SEQ 中的操作码进入 ALU。通过加法器,返回 SEQ 块,最终进入另一个名为 alu_src_2H_1 的寄存器。那条路是什么,我不能告诉你。只有掌握源代码的人才能做到这一点,然后就是尝试了解这两个寄存器之间的逻辑。
我有点困惑的是,这条路径看起来符合时序(13.33ns 是目标),但你说你需要“加强它”。为什么?
A Few definitions:
The 13.33ns is made up of two parts. 6.677ns of Gate delay, and 6.653ns of Net delay
The main factor in gate delay is how complex a function is contained within the cone of logic. The main factor in net delay is how many things are driven by the signals.
Each line in the report is talking about one logic block. So the first line alu_op_code_1 register, and the time it takes from the C pin (Clk) to the Q pin (output). The fanout column says how many logic blocks the Q pin drives. In this case it's 40, which is why the Net delay is quite high. It's quite understandable for a commonly used register like the opcode of an ALU to have a high fanout though.
We can also look at the path as a whole, and see that it goes from the opcode in SEQ, into an ALU. through an adder, back into the SEQ block, and eventually into another register called alu_src_2H_1. What that path is, I can't tell you. Only someone with the source can do that, and then it's a case of trying to see what logic is between those two registers.
What I'm a little confused at is that this path looks like it met timing (13.33ns is the target), but you say you need to "beef it up". Why?
首先,在为 FPGA 编写 HDL 或改编 HDL 时,了解特定 FPGA 的功能和限制确实非常有益。 Xilinx 在记录每个 FPGA 模型方面做得非常出色。查看 LUT4 和 MUXF5 模块,您的 FPGA 系列可能是 Spartan 3?通过研究数据表,您可以了解哪些硬件结构实施起来非常高效,哪些需要更多资源。一般来说,硬件与芯片上的实际映射越接近,它的执行速度就越快,占用的面积也就越小。
例如,Xilinx LUT 也可以用作移位寄存器,这意味着您不必在片中使用触发器。如果您确保移位寄存器映射到 LUT,这会带来非常显着的改进。 XST 尽力使用 HDL 来推断这些有效的映射,但通常会出现一些愚蠢的事情来阻止这些有效的映射,例如在复位信号之前检查使能信号。确保研究合成器的输出以及布局和布线,以发现可以改进 FPGA 映射的实例。 Xilinx 文档提供了 XST 可以用来推断更高效组件的 VHDL 和 Verilog 示例。有时直接实例化组件通常更容易。对于复杂的组件,可以使用 UNIMACRO 和 COREGEN 向导,它们可以生成非常高效的硬件。
举一个极端的例子,PicoBlaze 微控制器是专门为利用 Xilinx FPGA 架构而编写的。研究 PicoBlaze 源代码以查看这种高效映射的示例可能会有所帮助。
其次,如果你的组合逻辑路径太长,那么它会限制你的最大时钟频率。除了重写代码以更好地映射到 FPGA 或重写以消除不必要的硬件资源之外,您还可以在组合逻辑链中间的某个位置插入触发器(寄存器)。在计算机体系结构中,这称为流水线,它将导致您增加每条指令的周期数。例如,PicoBlaze 每条指令使用两个周期。 Intel Pentium 4 每条指令大约有 17 个周期。如果您很聪明,那么您可以以开始处理一条指令的方式编写 HDL,同时完成处理最后一条指令。这意味着每条指令仍需要 2 个时钟周期(延迟),但您可以每个周期退出一条指令(吞吐量)。大多数微控制器(例如 8051 和 PicoBlaze)都关心延迟,大多数微处理器(例如 x86 架构)都关心吞吐量。
First, when writing HDL or adapting HDL for an FPGA, it really pays off to understand your particular FPGA's capabilities and limitations. Xilinx does an excellent job documenting each FPGA model. Looking at the LUT4 and MUXF5 blocks, your FPGA family might be Spartan 3? By studying the datasheets you can see which hardware constructs are very efficient to implement and which require more resources. In general the closer a piece of hardware maps to what is actually on the chip, the faster it will perform and the less area it will occupy.
For example, a Xilinx LUT can also be used as a shift register, meaning that you don't have to use the flipflops in a slice. This results in a very noticeable improvement if you make sure that your shift registers are mapped to LUTs. XST tries its best with your HDL to infer these efficient mappings, but often there are stupid things that prevent these efficient mappings, like the enable signal being checked before the reset signal. Make sure you study the output of the synthesizer, and the place and route, to spot instances where you can improve the mapping onto your FPGA. The Xilinx documentation gives example VHDL and Verilog that XST can use to infer the more efficient components. Sometimes it is often easier to simply instantiate the component directly. And for complicated components, there are UNIMACROs and the COREGEN wizard, which produce very efficient hardware.
For an extreme example, the PicoBlaze microcontroller was written specifically to take advantage of the Xilinx FPGA architectures. It might be helpful to study the PicoBlaze source code to see examples of this efficient mapping.
Second, if your combinational logic path is too long, then it will limit your maximum clock frequency. Besides rewriting your code either to map better to your FPGA, or rewriting to eliminate unnecessary hardware resources, you can also insert flip flops (registers) somewhere in the middle of your combinational logic chain. In computer architecture this is called pipelining and will cause you to increase the number of cycles per instruction. For example, the PicoBlaze uses two cycles per instruction. The Intel Pentium 4 had about 17 cycles per instruction. If you are clever then you can write your HDL in a way that you start processing one instruction while at the same time finish processing the last instruction. This means that it would still take 2 clock cycles per instruction (latency), but you would be able to retire one instruction per cycle (throughput). Most microcontrollers like the 8051 and the PicoBlaze are concerned with latency and most microprocessors like the x86 architecture are concerned with throughput.