如何使浮点计算具有确定性?
浮点计算在处理器上既不是关联性也不是分布式的。因此,
(a + b) + c
不等于 a + (b + c)
并且 a * (b + c)
是不等于 a * b + a * c
有没有办法执行确定性浮点计算,并且不会给出不同的结果。当然,它在单处理器上是确定性的,但在多线程程序中,如果线程相加,则它不会是确定性的,因为线程可能有不同的交错。
所以我的问题是,如何在多线程程序中实现浮点计算的确定性结果?
Floating point calculation is neither associative nor distributive on processors. So,
(a + b) + c
is not equal to a + (b + c)
and a * (b + c)
is not equal to a * b + a * c
Is there any way to perform deterministic floating point calculation that do not give different results. It would be deterministic on uniprocessor ofcourse, but it would not be deterministic in multithreaded programs if threads add to a sum for example, as there might be different interleavings of the threads.
So my question is, how can one achieve deterministic results for floating point calculations in multithreaded programs?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
浮点是确定性的。相同的浮点运算在相同的硬件上运行,总是产生相同的结果。浮点不存在黑魔法、噪声、随机性、模糊测试或人们通常认为的任何其他特性。牙仙子没有出现,拿走你结果中的低位,并在你的枕头下留下四分之一。
现在,也就是说,某些常用于大规模并行计算的阻塞算法在执行浮点计算的顺序方面是不确定的,这可能会导致非位- 运行中的精确结果。
你能做些什么呢?
首先,确保您确实无法忍受这种情况。您可能尝试在并行计算中强制执行排序的许多事情都会损害性能。就是这样。
我还要指出的是,尽管阻塞算法可能会引入一定程度的不确定性,但它们通常会比简单的非阻塞串行算法提供具有更小舍入误差的结果(令人惊讶,但却是事实!)。如果您可以忍受简单的串行算法产生的错误,那么您也可能可以忍受并行阻塞算法的错误。
现在,如果您确实需要运行之间的精确再现性,这里有一些建议,这些建议往往不会对性能产生太大不利影响:
不要使用可以重新排序浮点计算的多线程算法。问题解决了。这并不意味着您根本不能使用多线程算法,只是您需要确保每个单独的结果仅由同步点之间的单个线程触及。请注意,如果做得正确,这实际上可以通过减少内核之间的 D$ 争用来提高某些架构的性能。
在归约操作中,您可以让每个线程将其结果存储到数组中的索引位置,等待所有线程完成,然后按顺序累积数组的元素。这会增加少量的内存开销,但通常是可以忍受的,特别是当线程数量“很小”时。
找到提高并行性的方法。不是计算 24 个矩阵乘法(每个矩阵乘法都使用并行算法),而是并行计算 24 个矩阵乘积,每个矩阵乘积都使用串行算法。这也对性能有好处(有时非常有好处)。
还有很多其他方法可以处理这个问题。它们都需要思考和关怀。并行编程通常是这样的。
Floating-point is deterministic. The same floating-point operations, run on the same hardware, always produce the same result. There is no black magic, noise, randomness, fuzzing, or any of the other things that people commonly attribute to floating-point. The tooth fairy does not show up, take the low bits of your result, and leave a quarter under your pillow.
Now, that said, certain blocked algorithms that are commonly used for large-scale parallel computations are non-deterministic in terms of the order in which floating-point computations are performed, which can result in non-bit-exact results across runs.
What can you do about it?
First, make sure that you actually can't live with the situation. Many things that you might try to enforce ordering in a parallel computation will hurt performance. That's just how it is.
I would also note that although blocked algorithms may introduce some amount of non-determinism, they frequently deliver results with smaller rounding errors than do naive unblocked serial algorithms (surprising but true!). If you can live with the errors produced by a naive serial algorithm, you can probably live with the errors of a parallel blocked algorithm.
Now, if you really, truly, need exact reproducibility across runs, here are a few suggestions that tend not to adversely affect performance too much:
Don't use multithreaded algorithms that can reorder floating-point computations. Problem solved. This doesn't mean you can't use multithreaded algorithms at all, merely that you need to ensure that each individual result is only touched by a single thread between synchronization points. Note that this can actually improve performance on some architectures if done properly, by reducing D$ contention between cores.
In reduction operations, you can have each thread store its result to an indexed location in an array, wait for all threads to finish, then accumulate the elements of the array in order. This adds a small amount of memory overhead, but is generally pretty tolerable, especially when the number of threads is "small".
Find ways to hoist the parallelism. Instead of computing 24 matrix multiplications, each one of which uses parallel algorithms, compute 24 matrix products in parallel, each one of which uses a serial algorithm. This, too, can be beneficial for performance (sometimes enormously so).
There are lots of other ways to handle this. They all require thought and care. Parallel programming usually does.
编辑:我删除了旧的答案,因为我似乎误解了OP的问题。如果您想查看它,可以阅读编辑历史记录。
我认为理想的解决方案是切换到为每个线程使用单独的累加器。这避免了所有锁定,这会对性能产生巨大的影响。您可以在整个操作结束时简单地对累加器求和。
或者,如果您坚持使用单个累加器,一种解决方案是使用“定点”而不是浮点。这可以通过浮点类型来完成,方法是在累加器中包含一个巨大的“偏差”项,以将指数锁定在固定值。例如,如果您知道累加器永远不会超过 2^32,则可以在
0x1p32
处启动累加器。这会将您锁定在小数点左侧的 32 位精度和 20 位小数精度(假设double
)。如果精度不够,您可以使用较小的偏差(假设累加器不会变得太大)或切换到long double
。如果long double
是 80 位扩展格式,则 2^32 的偏差将提供 31 位小数精度。然后,每当您想要实际“使用”累加器的值时,只需减去偏差项即可。
Edit: I've removed my old answer since I seem to have misunderstood OP's question. If you want to see it you can read the edit history.
I think the ideal solution would be to switch to having a separate accumulator for each thread. This avoids all locking, which should make a drastic difference to performance. You can simply sum the accumulators at the end of the whole operation.
Alternatively, if you insist on using a single accumulator, one solution is to use "fixed-point" rather than floating point. This can be done with floating-point types by including a giant "bias" term in your accumulator to lock the exponent at a fixed value. For example if you know the accumulator will never exceed 2^32, you can start the accumulator at
0x1p32
. This will lock you at 32 bits of precision to the left of the radix point, and 20 bits of fractional precision (assumingdouble
). If that's not enough precision, you could us a smaller bias (assuming the accumulator will not grow too large) or switch tolong double
. Iflong double
is 80-bit extended format, a bias of 2^32 would give 31 bits of fractional precision.Then, whenever you want to actually "use" the value of the accumulator, simply subtract out the bias term.
即使使用高精度定点数据类型也无法解决使所述方程的结果具有确定性的问题(某些情况除外)。正如 Keith Thompson 在评论中指出的那样,1/3 是一个简单的反例,无法以标准的 10 进制或 2 进制浮点表示形式正确存储该值(无论使用的精度或内存如何)。
根据特定需求,可以解决此问题的一种解决方案(它仍然有限制)是使用 Rational number 数据类型(同时存储分子和分母的数据类型)。 Keith 建议将 GMP 作为这样的库之一:
是否适合(或足够)执行此任务是另一回事...
快乐编码。
Even using a high-precision fixed point datatype would not solve the problem of making the results for said equations determinisic (except in certain cases). As Keith Thompson pointed out in a comment, 1/3 is a trivial counter-example of a value that cannot be stored correctly in either a standard base-10 or base-2 floating point representation (regardless of precision or memory used).
One solution that, depending upon particular needs, may address this issue (it still has limits) is to use a Rational number data-type (one that stores both a numerator and denominator). Keith suggested GMP as one such library:
Whether it is suitable (or adequate) for this task is another story...
Happy coding.
使用十进制类型或支持此类类型的库。
Use a decimal type or library supporting such a type.
尝试将每个中间结果存储在易失性对象中:
这可能会对性能产生严重影响。我建议测量两个版本。
编辑:
易失性
的目的是抑制即使在单线程环境中也可能影响结果的优化,例如更改操作顺序或 将中间结果存储在更广泛的寄存器中。它不解决多线程问题。EDIT2:还需要考虑的是
这可以通过使用参考来抑制
:C99 标准(大 PDF),第 7.12.2 节和 6.5 第 8 段。这是 C99 特定的;有些编译器可能不支持它。
Try storing each intermediate result in a volatile object:
This is likely to have nasty effects on performance. I suggest measuring both versions.
EDIT: The purpose of
volatile
is to inhibit optimizations that might affect the results even in a single-threaded environment, such aschanging the order of operations orstoring intermediate results in wider registers. It doesn't address multi-threading issues.EDIT2: Something else to consider is that
This can be inhibited by using
Reference: C99 standard (large PDF), sections 7.12.2 and 6.5 paragraph 8. This is C99-specific; some compilers might not support it.
使用压缩十进制。
Use packed decimal.