在多个线程中运行时,浮点操作是否确定性?

发布于 2025-01-29 11:41:25 字数 1033 浏览 2 评论 0 原文

假设我有一个运行计算的函数,示例类似于点产品 - 我传递了数组 a,b vectors的b float Array c 以及功能分配: c [i] = dot(a [i],b [i]);

如果我创建并启动两个将运行此功能的线程,然后将相同的三个数组传递给两个线程,在什么情况下,这种类型的动作(也许使用不同的非随机数学操作等)不能保证给出相同的结果(在没有任何重新编译的情况下运行相同的应用程序,并且在同一台计算机上)?我只对消费者PC的上下文感兴趣。

我知道Float操作通常是确定性的,但是我确实想知道是否可能发生一些奇怪的事情,也许在一个线程上计算将使用中间80位寄存器,而不是另一个线程。

我认为它几乎可以保证相同的二进制代码应该在两个线程中运行(是否有某种方式无法发生?由于某种原因,该功能多次编译,编译器以某种方式弄清楚它将在多个线程中运行,并编译再次出于某种原因,第二个线程?)。 但是我更担心CPU内核即使在消费者级别的PC上也可能没有相同的指令集。

附带问题 - 在类似情况下GPU呢?

//

我假设X86_64,Windows,C ++和 dot IS ax * ax * bx+yy * by 。不能提供更多的信息 - 使用Unity IL2CPP,不知道它是如何编译的。

问题的动机:我正在编写一个修改网格的计算几何过程 - 我将其称为“几何网格”。问题在于,可能会发生“渲染网格”具有某些几何位置的多个顶点 - 例如,扁平阴影需要 - 您有多个具有不同正态的顶点。但是,实际的计算几何过程仅使用空间位置的纯几何数据。

因此,我看到了两个选项:

  1. 创建一个从渲染网格到几何网格的地图(例如 - 重复的顶点映射到一个唯一的顶点),在几何网格上运行该过程,然后根据结果以某种方式修改渲染网格。
  2. 直接使用渲染网格工作。由于该过程对所有顶点进行计算,但从代码的角度来看要容易得多。但最重要的是,我有点担心两个实际上具有相同位置并且不应该发生的顶点的两个顶点可以获得两个不同的值。仅使用该位置,并且两个此类顶点的位置都相同。

Suppose I have a function that runs calculations, example being something like a dot product - I pass in an arrays A, B of vectors and a float array C, and the functions assigns:
C[i] = dot(A[i], B[i]);

If I create and start two threads that will run this function, and pass in the same three arrays to both the threads, under what circumstances is this type of action (perhaps using a different non-random mathematical operation etc.) not guaranteed give the same result (running the same application without any recompilation, and on the same machine)? I'm only interested in the context of a consumer PC.

I know that float operations are in general deterministic, but I do wonder whether perhaps something weird could happen and maybe on one thread the calculations will use an intermediate 80 bit register, but not in the other.

I would assume it's pretty much guaranteed the same binary code should run in both threads (is there some way this could not happen? The function being compiled multiple times for some reason, the compiler somehow figuring out it will run in multiple threads, and compiling it again, for some reason, for the second thread?).
But I'm a a bit more worried that CPU cores might not have the same instruction sets, even on consumer level PCs.

Side question - what about GPUs in a similar scenario?

//

I'm assuming x86_64, Windows, c++, and dot is a.x * b.x + a.y * b.y. Can't give more info than that - using Unity IL2CPP, don't know how it compiles/with what options.

Motivation for the question: I'm writing a computational geometry procedure that modifies a mesh - I'll call this the "geometric mesh". The issue is that it could happen that the "rendering mesh" has multiple vertices for certain geometric positions - it's needed for flat shading for example - you have multiple vertices with different normals. However, the actual computational geometry procedure only uses purely geometric data of the positions in space.

So I see two options:

  1. Create a map from the rendering mesh to the geometric mesh (example - duplicate vertices being mapped to one unique vertex), run the procedure on the geometric mesh, then somehow modify the rendering mesh based on the result.
  2. Work with the rendering mesh directly. Slightly more inefficient as the procedure does calculations for all vertices, but much easier from a code perspective. But most of all I'm a bit worried that I could get two different values for two vertices that actually have the same position and that shouldn't happen. Only the position is used, and the position would be the same for both such vertices.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

墨落画卷 2025-02-05 11:41:25

浮点(FP)操作不是关联的(但它是交换性的)。结果,(x+y)+z 可以给出与 x+(y+z)不同的结果。例如,(1E-13 +(1-1e-13))==(((1E-13 + 1)-1E-13)是false,使用64位IEEE-754 Floats。 C ++标准对浮点数不是很限制。但是,广泛使用的 ieee-754 标准是标准。它指定了32位和64位数字操作的精度,包括舍入模式。 X86-64处理器符合IEEE-754,主流编译器(例如GCC,Clang和MSVC)默认情况下也符合IEEE-754的IEEE-754。默认情况下,ICC不合规,因为它假定FP操作是为了性能而关联的。主流编译器具有编译标志以做出这样的假设以加快代码。它通常与其他fp值不是NAN的假设相结合(例如 -FFAST-MAT​​H )。这样的旗帜打破了IEEE-754的合规性,但它们通常用于3D或视频游戏行业,因此可以加快代码。 C ++标准不需要IEEE-754,但是您可以使用 std :: numeric_limits< t> :: IS_IEC559 对此进行检查。

默认情况下,线程可以具有不同的圆形模式。但是,您可以使用 this < /a>有关更多信息)。

假设IEEE-754合规性没有破坏,圆形模式是相同的,并且线程以相同的顺序进行操作,则结果应至少相同至1 ulp 。实际上,如果使用相同的主流编译器对它们进行编译,则结果应完全相同。

问题是使用多个线程通常会导致应用FP操作的非确定性顺序,从而导致非确定性结果。更具体地说,FP变量上的原子操作通常会引起此类问题,因为操作的顺序经常在运行时发生变化。如果您需要确定性的结果,则需要使用静态分区,请避免对FP变量或更一般的原子操作进行原子操作,从而可能导致不同的顺序。同样的事情适用于锁或任何同步机制。

GPU也是如此。实际上,当开发人员使用原子FP操作(例如总和值)时,此类问题非常频繁。他们之所以这样做,是因为实施快速减少是复杂的(尽管更确定性)和原子操作在现代GPU上非常快(因为他们使用专用的有效单元)。

Floating point (FP) operations are not associative (but it is commutative). As a result, (x+y)+z can give different results than x+(y+z). For example, (1e-13 + (1 - 1e-13)) == ((1e-13 + 1) - 1e-13) is false with 64-bit IEEE-754 floats. The C++ standard is not very restrictive about floating-point numbers. However, the widely-used IEEE-754 standard is. It specifies the precision of 32-bit and 64-bit number operations, including rounding modes. x86-64 processors are IEEE-754 compliant and mainstream compilers (eg. GCC, Clang and MSVC) are also IEEE-754 compliant by default. ICC is not compliant by default since it assumes the FP operations are associative for the sake of performance. Mainstream compilers have compilation flags to make such assumption so to speed up codes. It is generally combined with other ones like the assumption that all FP values are not NaN (eg. -ffast-math). Such flags break the IEEE-754 compliance, but they are often used in the 3D or video game industry so to speed up codes. IEEE-754 is not required by the C++ standard, but you can check this with std::numeric_limits<T>::is_iec559.

Threads can have different rounding modes by default. However, you can set the rounding mode using the C code provided in this answer. Also, please note that denormal numbers are sometimes disabled on some platforms because of their very-high overhead (see this for more information).

Assuming the IEEE-754 compliance is not broken, the rounding mode is the same and the threads does the operations in the same order, then the result should be identical up to at least 1 ULP. In practice, if they are compiled using a same mainstream compiler, the result should be exactly the same.

The thing is using multiple threads often result in a non-deterministic order of the applied FP operations which causes non-deterministic results. More specifically, atomic operations on FP variables often cause such an issue because the order of the operations often changes at runtime. If you want deterministic results, you need to use a static partitioning, avoid atomic operations on FP variables or more generally atomic operations that could result in a different ordering. The same thing applies for locks or any synchronization mechanisms.

The same thing is true for GPUs. In fact, such problem is very frequent when developers use atomic FP operations for example to sum values. They often do that because implementing fast reductions is complex (though it is more deterministic) and atomic operations as pretty fast on modern GPUs (since they use dedicated efficient units).

┾廆蒐ゝ 2025-02-05 11:41:25

根据对浮点处理器非determinism的答案?,C ++浮点不是非确定性的。相同的指令顺序将给出相同的结果。

但是,有几件事要考虑:

首先,进行FP计算的特定c ++源代码的行为(即结果)可能取决于编译器和所选的编译器选项。例如,这可能取决于编译器是否选择发射64位或80位FP指令。但这是确定性的。

其次,类似的C ++源代码可能会产生不同的结果。例如,由于某些FP指令的非缔合行为。这也是确定性的。

默认情况下,确定性不会受到多线程的影响。 C ++编译器可能不知道代码是否为多线程。而且它绝对没有理由发出不同的FP代码。

诚然,FP行为取决于所选的舍入模式,并且可以以每线程为基础设置。但是,要实现这一目标,某些东西(应用程序代码)必须为不同线程明确设置不同的舍入模式。再一次,这是决定性的。 (这是应用程序代码要做的一件非常愚蠢的事情,IMO。)


PC将对不同线程使用不同行为的不同FP硬件的想法对我来说似乎很牵强。当然,PC可以(例如)Intel芯片组和ARM芯片组,但是同一C ++应用程序(可执行文件)的不同线程同时在两个芯片组上同时运行是不合理的。

同样对于GPU。的确,鉴于您需要以与普通(或螺纹)C ++完全不同的方式编程GPU,因此我怀疑它们甚至可以共享相同的源代码。


简而言之,我认为您担心您在现实中不太可能遇到的假设问题……鉴于硬件和C ++编译器中最新的状态。

According to the accepted answer to floating point processor non-determinism?, C++ floating point is not non-deterministic. The same sequence of instructions will give the same results.

There are a few things to take into account though:

Firstly, the behavior (i.e. the result) of a particular piece of C++ source code doing a FP calculation may depend on the compiler and the chosen compiler options. For example, it may depend on whether the compiler chooses to emit 64 or 80 bit FP instructions. But this is deterministic.

Secondly, similar C++ source code may give different results; e.g. due to non-associative behavior of certain FP instructions. This also is deterministic.

Determinism won't be affected by multi-threading by default. The C++ compiler will probably be unaware of whether the code is multi-threaded or not. And it definitely has no reason to emit different FP code.

Admittedly, FP behavior depends on the rounding mode selected, and that can be set on a per-thread basis. However, for this to happen, something (application code) would have to explicitly set different rounding modes for different threads. Once again, that is deterministic. (And a pretty daft thing for the application code to do, IMO.)


The idea that a PC would would use different FP hardware with different behavior for different threads seems far-fetched to me. Sure a PC could have (say) an Intel chipset and an ARM chipset, but it is not plausible that different threads of the same C++ application (executable) would simultaneously run on both chipsets.

Likewise for GPUs. Indeed, given that you need to program GPUs in a way that is radically different to ordinary (or threaded) C++, I would doubt that they could even share the same source code.


In short, I think that you are worrying about a hypothetical problem that you are unlikely to encounter in reality ... given the current state of the art in hardware and C++ compilers.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文