IEEE 754 浮点加法/舍入
我不明白如何添加 IEEE 754 浮点(主要是指数的“重新对齐”)
另外对于舍入,Guard、Round & 是如何进行舍入的?粘性发挥作用了吗?一般如何进行舍入(即以 2 为底的浮点数)
例如。假设 qn:添加以十六进制 0x383FFBAD 表示的 IEEE 754 Float 和 0x3FD0ECCD,然后在 Round to 0 中给出答案,\$\pm \infty\$, 最近的
所以我有
0x383FFBAD 0 | 0111 0000 | 0111 1111 1111 0111 0101 1010
0x3FD0ECCD 0 | 0111 1111 | 1010 0001 1101 1001 1001 1010
那么我应该如何继续?如果您愿意,请随意使用其他示例
I don't understand how I can add in IEEE 754 Floating Point (mainly the "re-alignment" of the exponent)
Also for rounding, how does the Guard, Round & Sticky come into play? How to do rounding in general (base 2 floats that is)
eg. Suppose the qn: Add IEEE 754 Float represented in hex 0x383FFBAD
and 0x3FD0ECCD, then give answers in Round to 0, \$\pm \infty\$,
nearest
So I have
0x383FFBAD 0 | 0111 0000 | 0111 1111 1111 0111 0101 1010
0x3FD0ECCD 0 | 0111 1111 | 1010 0001 1101 1001 1001 1010
Then how should I continue? Feel free to use other examples if you wish
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果我正确理解了您的指数“重新对齐”...
这里解释了格式与实际值的关系。
1.b(22)b(21)...b(0) * 2e-127 可以解释为左移 e-127 位位置的二进制整数。当然,移位量可以是负数,这就是我们得到分数(0 到 1 之间的值)的方式。
为了将两个相同符号的浮点数相加,您需要首先使它们的指数部分相等,或者换句话说,对其中一个加数(指数较小的那个)进行非规格化。
原因很简单。例如,当您添加 1 千零 1 时,您希望将十与十相加、百与百相加等等。因此,您有 1.000*103 + 1.000*100 = 1.000*103 + 0.001*103(<--非规范化) = 1.001*103。当然,如果格式无法准确表示结果(例如,如果只能有 2 位有效数字,则最终会得到相同的 1.0*103),这可能会导致截断/舍入为总和)。
因此,就像上面 1000 和 1 的示例一样,您可能需要在添加尾数之前先移至右侧的加数之一。您需要记住,格式中有一个隐式的
1.
位,它不存储在浮点数中,在移位和添加时必须考虑到这一点。添加尾数后,您很可能会遇到尾数溢出,并且必须再次非规范化以消除溢出。这就是基础知识。还有一些特殊情况需要考虑。
If I understood your "re-alignment" of the exponent correctly...
Here's an explanation of how the format relates to the actual value.
1.b(22)b(21)...b(0) * 2e-127 can be interpreted as a binary integer shifted left by e-127 bit positions. Of course, the shift amount can be negative, which is how we get fractions (values between 0 and 1).
In order to add 2 floating-point numbers of the same sign you need to first have their exponent part equal, or, in other words, denormalize one of the addends (the one with the smaller exponent).
The reason is very simple. When you add, for example, 1 thousand and 1 you want to add tens with tens, hundreds with hundreds, etc. So, you have 1.000*103 + 1.000*100 = 1.000*103 + 0.001*103(<--denormalized) = 1.001*103. This can, of course, result in truncation/rounding, if the format cannot represent the result exactly (e.g. if it could only have 2 significant digits, you'd end up with the same 1.0*103 for the sum).
So, just like in the above example with 1000 and 1, you may need to shift to the right one of the addends before adding their mantissas. You need to remember that there's an implict
1.
bit in the format, which isn't stored in the float, which you have to account for when shifting and adding. After adding the mantissas, you most likely will run into a mantissa overflow and will have to denormalize again to get rid of the overflow.That's the basics. There're special cases to consider as well.