计算单周期数据路径中的前导零
大家可能都知道,MIPS 指令集支持 clz(计数前导零),如下所示:
clz $t0,$t1 计数前导零 t0 = t1 中前导零的数量
我正在 verilog 中编写单周期数据路径,只是想知道 ALU 需要支持什么才能做到这一点...有什么想法吗?
As you all might know that the MIPS instruction set supports clz (count leading zero) as follows:
clz $t0,$t1 count leading zeros t0 = # of leading zeros in t1
I am writing a single cycle datapath in verilog and was just wondering what the ALU needs to support in order for me to do this... any ideas??
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是一种可能的方法(我忽略输入 0 的情况,这可能最好被视为特殊情况):
在 Verilog 中,它可能看起来像这样:
Here's a possible approach (I'm ignoring the case of an input of 0, which is probably best treated as a special case):
In Verilog, it might look something like this:
我能想到的最简单的实现(不是很优化)是根据 32 个(如果是 32 位)掩码检查单词,最长的优先,决定哪个最适合并返回其编号。
类似(伪代码):
等。
The simplest implementation I can think of (not very optimized) is checking the word against 32 (in case of 32-bit) masks, longest first, deciding which fits first and returning its number.
Something like (pseudocode):
etc.
构建一个 clz16 单元,它查看 16 位,并具有 4 位结果 (0..15) 和“allzero”输出。将其中两个放在一起制作 clz32,您需要一个多路复用器来选择哪 4 个较低位,以及用于较高 2 个输出位的逻辑位。
clz16是由两片clz8以同样的方式组合而成。 clz8由两个clz4组成。
clz4 只是 <= 4 个输入的三个布尔函数,所以你如何做并不重要,合成器会将其归结为几个门。
这种分层方法比 Matthew Slattery 的级联多路复用器解决方案大,但可能没有那么多(它不需要宽门来切换多路复用器),而且我相信它允许较低的道具。延迟。两种方法都可以通过延迟属性很好地扩展到更大的尺寸(例如 64、128 位)。到 log2(n)。
Build a clz16 unit which looks at 16 bits, and has a 4-bit result (0..15) and 'allzero' output. Put two of these together to make clz32, you need a mux to select which 4 lower bits and a bit of logic for the upper 2 output bits.
The clz16 is made of two clz8 in the same way. The clz8 is made of two clz4.
The clz4 is just three boolean functions of <= 4 inputs, so it doesn't matter much how you do it, synth will boil it down to a few gates.
This hierarchical approach is larger than Matthew Slattery's solution with the cascaded muxes, but probably not by that much (it doesn't need the wide gates to switch the muxes), and I believe it allows a lower prop. delay. Both approaches scale well to larger sizes (e.g 64, 128 bits) with delay prop. to log2(n).