关于浮点数的一些问题
我想知道一个数字是否在浮点表示中以一种方式表示,是否会在更大尺寸的表示中以相同的方式表示。 也就是说,如果一个数字具有 float
的特定表示形式,那么如果将该 float
转换为 double
且该数字具有相同的表示形式,并且那么当转换为long double
时仍然相同。
我想知道,因为我正在编写一个 BigInteger 实现,并且我将传入的任何浮点数发送到接受 long double
进行转换的函数。这引出了我的下一个问题。显然,浮点并不总是具有精确的表示,因此在我的 BigInteger 类中,当给定浮点数时我应该尝试表示什么。尝试表示与 std::cout << 给出的相同数字是否合理? std::固定<< someFloat; 即使这与传入的数字不同。这是我能得到的最准确的表示吗?如果是这样,...
提取该值的最佳方法是什么(以 10 的幂为基数),目前我只是将其作为字符串获取并将其传递给我的字符串构造函数。这会起作用,但我忍不住觉得有更好的方法,但是当除以我的基数时,用浮点数除以余数肯定不准确。
最后,我想知道是否存在与 uintmax_t 等效的浮点,它是一个始终是系统上最大浮点类型的类型名,或者没有意义,因为 long double< /code> 将始终是最大的(即使它与 double 相同)。
谢谢,T。
I'm wondering if a number is represented one way in a floating point representation, is it going to be represented in the same way in a representation that has a larger size.
That is, if a number has a particular representation as a float
, will it have the same representation if that float
is cast to a double
and then still the same when cast to a long double
.
I'm wondering because I'm writing a BigInteger implementation and any floating point number that is passed in I am sending to a function that accepts a long double
to convert it. Which leads me to my next question. Obviously floating points do not always have exact representations, so in my BigInteger class what should I be attempting to represent when given a float. Is it reasonable to try and represent the same number as given by std::cout << std::fixed << someFloat;
even if that is not the same as the number passed in. Is that the most accurate representation I will be able to get? If so, ...
What's the best way to extract that value (in base some power of 10), at the moment I'm just grabbing it as a string and passing it to my string constructor. This will work, but I can't help but feel theres a better way, but certainly taking the remainder when dividing by my base is not accurate with floats.
Finally, I wonder if there is a floating point equivalent of uintmax_t
, that is a typename that will always be the largest floating point type on a system, or is there no point because long double
will always be the largest (even if it 's the same as a double).
Thanks, T.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果“相同表示”的意思是“内存中除了填充之外完全相同的二进制表示”,那么不是。双精度具有更多的指数和尾数位数,并且还具有不同的指数偏差。但我相信任何单精度值都可以精确地用双精度表示(除了可能的非规范化值)。
我不确定当你说“浮点并不总是有精确的表示”时你的意思是什么。当然,并非所有十进制浮点值都具有精确的二进制浮点值(反之亦然),但我不确定这是否是一个问题。只要您的浮点输入没有小数部分,那么适当大的“BigInteger”格式应该能够准确地表示它。
通过以 10 为基数表示的转换并不是正确的方法。理论上,您所需要的只是一个长度约为 1024 的位数组,将其全部初始化为零,然后将尾数位移入指数值。但如果不了解更多关于您的实现的信息,我就没有更多建议了!
If by "same representation" you mean "exactly the same binary representation in memory except for padding", then no. Double-precision has more bits of both exponent and mantissa, and also has a different exponent bias. But I believe that any single-precision value is exactly representable in double-precision (except possibly denormalised values).
I'm not sure what you mean when you say "floating points do not always have exact representations". Certainly, not all decimal floating-point values have exact binary floating-point values (and vice versa), but I'm not sure that's a problem here. So long as your floating-point input has no fractional part, then a suitably large "BigInteger" format should be able to represent it exactly.
Conversion via a base-10 representation is not the way to go. In theory, all you need is a bit-array of length ~1024, initialise it all to zero, and then shift the mantissa bits in by the exponent value. But without knowing more about your implementation, there's not a lot more I can suggest!
double
包含float
的所有值;long double
包含double
的所有值。因此,转换为long double
时您不会丢失任何值信息。但是,您将丢失有关原始类型的相关信息(见下文)。为了遵循常见的 C++ 语义,将浮点值转换为整数应截断该值,而不是舍入。
主要问题是不精确的大值。您可以使用 frexp 函数查找浮点值的以 2 为底的指数。您可以使用 std::numeric_limits::digits 来检查它是否在可以精确表示的整数范围内。
我个人的设计选择是断言 fp 值在可以精确表示的范围内,即对任何实际参数的范围的限制。
为了正确地做到这一点,您需要使用
float
和double
参数进行重载,因为可以精确表示的范围取决于实际参数的类型。当您的 fp 值在允许的范围内时,您可以使用
floor
和fmod
提取您想要的任何数字系统中的数字。double
includes all values offloat
;long double
includes all values ofdouble
. So you're not losing any value information by conversion tolong double
. However, you're losing information about the original type, which is relevant (see below).In order to follow common C++ semantics, conversion of a floating point value to integer should truncate the value, not round.
The main problem is with large values that are not exact. You can use the
frexp
function to find the base 2 exponent of the floating point value. You can usestd::numeric_limits<T>::digits
to check if that's within the integer range that can be exactly represented.My personal design choice would be an assert that the fp value is within the range that can be exactly represented, i.e. a restriction on the range of any actual argument.
To do that properly you need overloads taking
float
anddouble
arguments, since the range that can be represented exactly depends on the actual argument's type.When you have an fp value that is within the allowed range, you can use
floor
andfmod
to extract digits in any numeral system you want.是的,从 IEEE float 到 double 再到扩展,您将看到从小格式到大格式的位,例如
,您将左对齐尾数,然后添加零。
指数右对齐,符号扩展下一个 msbit,然后复制 msbit。
例如 -2 的指数。 -2 减 1 即为 -3。 -3 的二进制补码是 0xFD 或 0b11111101,但格式中的指数位是 0b01111101,即 msbit 反转。对于双精度 -2 指数 -2-1 = -3。或 0b1111...1101,变成 0b0111...1101,msbit 反转。 (指数位 =twos_complement(exponent-1),其中 msbit 反转)。
正如我们在上面看到的,指数 3 3-1 = 2 0b000...010 反转高位 0b100...010
所以,是的,您可以从单精度中取出这些位并将它们复制到双精度数中的正确位置。我没有方便的扩展浮动参考,但很确定它的工作方式相同。
yes, going from IEEE float to double to extended you will see bits from the smaller format to the larger format, for example
The mantissa you will left justify and then add zeros.
The exponent is right justified, sign extend the next to msbit then copy the msbit.
An exponent of -2 for example. take -2 subtract 1 which is -3. -3 in twos complement is 0xFD or 0b11111101 but the exponent bits in the format are 0b01111101, the msbit inverted. And for double a -2 exponent -2-1 = -3. or 0b1111...1101 and that becomes 0b0111...1101, the msbit inverted. (exponent bits = twos_complement(exponent-1) with the msbit inverted).
As we see above an exponent of 3 3-1 = 2 0b000...010 invert the upper bit 0b100...010
So yes you can take the bits from single precision and copy them to the proper locations in the double precision number. I dont have an extended float reference handy but pretty sure it works the same way.