float_t 的意义是什么以及何时应该使用它?
我正在与一位使用旧版本 GCC(准确地说是 3.2.3)但想要升级的客户合作,而升级到新版本的绊脚石的一个原因是类型 < 的大小差异code>float_t 果然是正确的:
在 GCC 3.2.3 上
sizeof(float_t) = 12
sizeof(float) = 4
sizeof(double_t) = 12
sizeof(double) = 8
在 GCC 4.1.2 上
sizeof(float_t) = 4
sizeof(float) = 4
sizeof(double_t) = 8
sizeof(double) = 8
但是造成这种差异的原因是什么?为什么尺寸会变小,什么时候应该使用 float_t
或 double_t
?
I'm working with a client who is using an old version of GCC (3.2.3 to be precise) but wants to upgrade and one reason that's been given as stumbling block to upgrading to a newer version is differences in the size of type float_t
which, sure enough is correct:
On GCC 3.2.3
sizeof(float_t) = 12
sizeof(float) = 4
sizeof(double_t) = 12
sizeof(double) = 8
On GCC 4.1.2
sizeof(float_t) = 4
sizeof(float) = 4
sizeof(double_t) = 8
sizeof(double) = 8
but what's the reason for this difference? Why did the size get smaller and when should and shouldn't you use float_t
or double_t
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
float_t 的原因是,对于某些处理器和编译器来说,使用更大的类型(例如 float 的 long double)可能会更有效,因此 float_t 允许编译器使用更大的类型而不是 float。
因此,在使用 float_t 的 OP 情况下,大小的变化是标准允许的。如果原始代码想要使用较小的浮动大小,则应该使用浮动。
open-std doc 中有一些基本原理
The reason for float_t is that for some processors and compilers using a larger type e.g. long double for float could be more efficient and so the float_t allows the compiler to use the larger type instead of float.
thus in the OPs case using float_t the change in size is what the standard allows for. If the original code wanted to use the smaller float sizes it should be using float.
There is some rationale in open-std doc
“为什么”是一些编译器会在浮点寄存器中返回浮点值。这些寄存器只有一种大小。例如,在 X86 上,它是 80 位宽。返回浮点值的函数的结果将被放入此寄存器中,无论类型是否已声明为 float、double、float_t 或 double_t。如果返回值的大小和浮点寄存器的大小不同,则在某些时候将需要指令向下舍入到所需的大小。
对于整数也需要进行相同类型的转换,但是对于后续的加法和减法,没有开销,因为有指令选择要参与操作的字节。将整数转换为较小尺寸的规则指定丢弃最高有效位,因此缩小尺寸的结果可能会产生完全不同的结果(例如 (short)(2147450880) --> -32768),但是出于某种原因,这似乎对编程社区来说是可以接受的。
在进行浮点缩小时,结果被指定为四舍五入到最接近的可表示数字。如果整数遵循相同的规则,则上面的示例将因此被截断 (short)(2147450880) -> +32767。显然,执行这样的操作需要更多的逻辑,而不仅仅是截断高位。对于浮点数,指数和尾数在 float、double 和 long double 之间变化大小,因此更加复杂。此外,还需要考虑无穷大、NaN、归一化数和重归一化数之间的转换问题。硬件可以在与整数加法相同的时间内实现这些转换,但如果需要在软件中实现转换,则可能需要 20 条指令,这会对性能产生显着影响。由于C编程模型确保无论浮点是在硬件还是软件中实现,都会生成相同的结果,因此软件必须执行这些额外的指令以符合计算模型。 float_t 和 double_t 类型旨在公开最有效的返回值类型。
编译器定义了一个FLT_EVAL_METHOD,它指定在中间计算中使用多少精度。对于整数,规则是使用所涉及操作数的最高精度进行中间计算。这对应于FLT_EVAL_METHOD==0。然而,最初的 K&R 指定所有中间计算均以双精度完成,从而产生 FLT_EVAL_METHOD==1。然而,随着 IEEE 浮点标准的引入,在某些平台上变得司空见惯,尤其是 Macintosh PowerPC 和 Windows X86,以 long double(80 位)执行中间计算,从而产生 FLT_EVAL_METHOD= =2。
回归测试将受到FLT_EVAL_METHOD计算模型的影响。因此,您的回归代码应该考虑到这一点。一种方法是测试FLT_EVAL_METHOD并为每个模型设置不同的分支。类似的方法是测试 sizeof(float_t),并有不同的分支。第三种方法是使用某种 epsilon 来检查结果是否足够接近。
不幸的是,有些计算是根据计算结果做出决定的,导致true或false,这不能通过使用epsilon来解决。例如,这发生在计算机图形学中,以确定一个点是在多边形内部还是外部,从而确定是否应填充特定像素。如果您的回归涉及其中之一,则无法使用 epsilon 方法,并且必须根据计算模型使用不同的分支。
解决模型之间决策回归的另一种方法是将结果显式转换为特定的所需精度。这在大多数编译器上都有效,但有些编译器认为它们比你聪明,并拒绝进行转换。这种情况发生在中间结果存储在寄存器中但在后续计算中使用的情况下。您可以在中间结果中随意放弃精度,但编译器不会执行任何操作 - 除非您将中间结果声明为 易失性。然后,这会强制编译器缩小中间结果并将其存储在内存中指定大小的变量中,然后在需要计算时检索它。 IEEE 浮点标准对于基本运算 (+-*/) 和平方根来说精确。我相信 sin()、cos()、exp()、log() 等被指定在最接近的数字可表示结果的 2 个 ULP(最低有效位置的单位)内。长双精度(80 位)格式的设计目的是允许将其他超越函数精确地计算为最接近的数字可表示结果。
这涵盖了该线程中提出(和暗示)的许多问题,但没有回答何时应该使用 float_t 和 double_t 类型的问题。显然,在与使用这些类型的 API 进行交互时,尤其是在传递其中一种类型的地址时,您需要这样做。
如果您最关心的是性能,那么您可能需要考虑在计算和 API 中使用 float_t 和 double_t 类型。但很可能您获得的性能提升既不可测量也不明显。
但是,如果您担心不同编译器和不同机器之间的回归,您可能应该尽可能避免这些类型,并自由地使用强制转换以确保跨平台兼容性。
The "why" is that some compilers will return floating point values in a floating-point register. These registers have only one size. For example, on X86, it is 80 bits wide. The results of a function that returns a floating point value will be placed into this register regardless of whether the type has been declared as float, double, float_t or double_t. If the size of the return value and the size of the floating-point register differ, then at some point an instruction will be required to round down to the desired size.
The same kind of conversion is necessary for integers as well, but for subsequent additions and subtractions there is no overhead, because there are instructions to pick which bytes to involve in the operation. The rules for conversion of integers to a smaller size specify that the most significant bits be tossed away, so the result of downsizing can produce a result that is radically different (e.g. (short)(2147450880) --> -32768), but for some reason that seems to be OK with the programming community.
In doing a floating-point downsizing, the result is specified to be rounded to the closest representable number. If integers were subject to the same rules, then the above example would truncate thusly (short)(2147450880) -> +32767. Obviously a little more logic is required to perform such an operation that mere truncation of the upper bits. With floating-point, the exponent and the significand change sizes between float, double and long double, so it is more complicated. Additionally, there are issues of conversion between infinity, NaN, normalized numbers, and renormalized numbers that need to be taken into account. Hardware can implement these conversions in the same amount of time as an integer addition, but if the conversion needs to be implemented in software, it may take 20 instructions, which can have a noticeable effect on performance. Since the C programming model assures that the same results be generated regardless of whether the floating-point is implemented in hardware or software, the software is obliged to execute these extra instructions in order to comply with the computational model. The float_t and double_t types were designed to expose the most efficient return value type.
The compiler defines a FLT_EVAL_METHOD, which specifies how much precision is to be used in the intermediate computations. With integers, the rule is to do intermediate computations using the highest precision of the operands involved. This would correspond to a FLT_EVAL_METHOD==0. However, the original K&R specified that all intermediate computations be done in double, thus yielding FLT_EVAL_METHOD==1. However, with the introduction of the IEEE floating-point standard, it became commonplace on some platforms, notably the Macintosh PowerPC and Windows X86 to perform intermediate computations in long double -- 80 bits, thus yielding FLT_EVAL_METHOD==2.
Regression testing will be affected by the FLT_EVAL_METHOD computational model. Thus, your regression code should take this into account. One way is to test FLT_EVAL_METHOD and have different branches for each model. A similar method would be to test sizeof(float_t), and have different branches. A third method would be to use some kind of epsilon that would be used to check whether the results are close enough.
Unfortunately, there are some computations that make a decision based on the results of a computation, resulting in a true or false, which cannot be resolved by using an epsilon. This occurs in computer graphics, for example, to decide whether a point is inside or outside a polygon, which determines whether a particular pixel should be filled. If your regression involves one of these, you cannot use the epsilon method, and must use different branches depending on the computational model.
Another way to resolve the decision regression between models is to cast the result explicitly to a particular desired precision. This works most of the time on many compilers, but some compilers think that they are smarter than you, and refuse to do the conversion. This happens in the case where an intermediate result is stored in a register, but is used in a subsequent computation. You can cast away precision as much as you want in the intermediate result, but the compiler will do nothing -- unless you declare the intermediate result as volatile. This then forces the compiler to downsize and store the intermediate result in a variable of the specified size in memory, then to retrieve it when needed for computation. The IEEE floating point standard is exact for elementary operations (+-*/) and square root. I believe that sin(), cos(), exp(), log(), etc. are specified to be within 2 ULP (units in the least significant position) of the closest numerically-representable result. The long double (80 bit) format was designed to allow computation of those other transcendental functions exactly to the closest numerically-represenatble result.
This covers a lot of the issues brought up (and implied) in this thread, but does not answer the question of when you should use the float_t and double_t types. Obviously, you need to do so when interfacing to an API that uses these types, especially when passing the address of one of these types.
If your prime concern is about performance, then you might want to consider using the float_t and double_t types in your computations and APIs. But it is most probable that the performance increase that you get is neither measurable nor noticeable.
However, if you are concerned about regression between different compilers and different machines, you should probably avoid these types as much as possible, and use casting liberally to assure cross-platform compatibility.
C99标准说:
事实上,在 gcc 的早期版本中,它们默认被定义为
long double
。The C99 standard says:
And indeed, in previous versions of gcc they were defined as
long double
by default.