多核架构上的多线程
当您遇到线程 A 读取某个全局变量而线程 B 写入同一变量的情况时,除非读/写在单核上不是原子的,否则您可以在不同步的情况下执行此操作,但是在多核上运行时会发生什么情况核心机?
When you have a situation where Thread A reads some global variable and Thread B writes to the same variable, now unless read/write is not atomic on a single core, you can do it without synchronizing, however what happens when running on a multi-core machine?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
没有人提到隐式同步的优缺点。
主要的“优点”当然是程序员可以编写任何东西而不必担心同步。
主要的“缺点”是这需要很多时间。隐式同步需要通过缓存蜿蜒向下,至少到达(您可能认为)两个核心共用的第一个缓存。错误的!计算机中可能安装了多个物理处理器,因此同步不能在缓存中停止,它需要一直到 RAM。如果您想在那里同步,您还需要与其他需要与内存同步的设备(即任何总线主控设备)同步。总线主控设备可能是传统 PCI 总线上的卡,并且可能以 33 MHz 运行,因此隐式同步也需要等待它们来确认可以写入或读取特定 RAM 位置。我们谈论的是核心和最慢总线之间时钟速度的 100 倍差异,并且最慢总线需要几个自己的总线周期才能以可靠的方式做出反应。因为同步必须是可靠的,否则它是没有用的。
因此,在实现隐式同步的电子设备(无论如何最好留给程序员明确处理)和可以在必要时同步的更快系统之间进行选择时,答案是显而易见的。
同步的显式关键是 LOCK 前缀和 XCHG mem,reg 指令。
你可以说隐式同步就像训练轮:你不会掉到地上,但你不能走得特别快或转弯特别快。很快你就会感到疲倦并想要转向真正的东西。当然,你会受到伤害,但在这个过程中你要么学习,要么放弃。
No one has mentioned the pros and cons of implicit synchronization.
The main "pro" is of course that the programmer can write anything at all and not have to bother about synchronization.
The main "con" is that this takes A LOT of time. The implicit synchronization needs to wind its way down through the caches to at least (you might think) the first cache that is common to both cores. Wrong! There may be several physical processors installed in the computer so synchronization can't stop at a cache, it needs to go all the way down to RAM. If you want to synchronize there you also need to synchronize with other devices that need to synchronize with memory i e any bus-mastering device. Bus-mastering devices may be cards on the classic PCI-bus and may be running at 33 MHz so the implicit synchronization would need to wait for them too to acknowledge that it's ok to write to or read from a specific RAM location. We're talking a 100X difference just in clock speed between the core and the slowest bus and the slowest bus needs several of its own bus cycles to react in a reliable manner. Because synchronization MUST be reliable, it is of no use otherwise.
So in the choice between implementing electronics for implicit synchronization (which is better left to the programmer to handle explicitly anyway) and a faster system which can synchronize when necessary the answer is obvious.
The explicit keys to synchronization are the LOCK prefix and the XCHG mem,reg instruction.
You could say that implicit synchronization is like training wheels: you won't fall to the ground but you can't go especially fast or turn especially quickly. Soon you'll tire and want to move on to the real stuff. Sure, you'll get hurt but in the process you'll either learn or quit.
就(新)C++ 标准而言,如果程序包含数据竞争,则程序的行为是未定义的。如果存在线程交错,使得程序包含来自不同线程的两个相邻冲突内存访问(这只是一种非常正式的说法“如果两个冲突访问可以同时发生,则程序存在数据竞争”),则程序存在数据竞争。 )。
请注意,无论您运行多少个内核,程序的行为都是未定义的(特别是优化器可以根据需要重新排序指令)。
As far as the (new) C++ standard is concerned, if a program contains a data race, the behavior of the program is undefined. A program has a data race if there is an interleaving of threads such that it contains two neighboring conflicting memory accesses from different threads (which is just a very formal way of saying "a program has a data race if two conflicting accesses can occur concurrently").
Note that it doesn't matter how many cores you're running on, the behavior of your program is undefined (notably the optimizer can reorder instructions as it sees fit).
根据您的情况,以下内容可能相关。虽然它不会使您的程序运行不正确,但它可以在速度上产生很大的差异。即使您没有访问相同的内存位置,如果两个核心在缓存中的同一页面上运行(尽管不是相同的位置,因为您仔细同步了数据结构),您也可能会因缓存效应而受到性能影响。
这里有一个关于“虚假共享”的很好的概述:
http://www.drdobbs.com/ go-parallel/article/showArticle.jhtml;jsessionid=LIHTU4QIPKADTQE1GHRSKH4ATMY32JVN?articleID=217500206
Depending on your situation the following may be relevant. While it won't make your program run incorrectly it can make a big difference in speed. Even if you aren't accessing the same memory location, you may get a performance hit due to cache effects if two cores are thrashing over the same page in the cache (though not the same location because you carefully synchronized your data structures).
There is a good overview of "false sharing" here:
http://www.drdobbs.com/go-parallel/article/showArticle.jhtml;jsessionid=LIHTU4QIPKADTQE1GHRSKH4ATMY32JVN?articleID=217500206
对于多核机器上的非原子操作,需要使用系统提供的互斥体来同步访问。
对于 C++,boost 互斥体库提供了多种互斥体类型,这些互斥体类型为操作系统提供的互斥体类型提供一致的接口。
如果您选择将 boost 作为同步/多线程库,您应该阅读 同步概念。
For a non-atomic operation on a multi-core machine, you need to use a system provided Mutex in order to synchronize the accesses.
For C++, the boost mutex library provides several mutex types that provide a consistent interface for OS-supplied mutex types.
If you choose to look at boost as your syncing / multithreading library, you should read up on the Synchronization concepts.
即使在单核机器上,也绝对不能保证在没有显式同步的情况下这可以工作。
造成这种情况的原因有几个:
如果您希望两个线程之间正确通信,则需要某种同步。 总是,无例外。
该同步可能是操作系统或线程 API 提供的互斥体,也可能是 CPU 特定的原子指令,或者只是一个普通的内存屏障。
Even on a singlecore machine, there is absolutely no guarantee that this will work without explicit synchronization.
There are several reasons for this:
If you want correct communication between two threads, you need some kind of synchronization. Always, with no exception.
That synchronization may be a mutex provided by the OS or the threading API, or it may be CPU-specific atomic instructions, or just a plain memory barrier.
它将具有与单核相同的缺陷,但由于必须在内核之间进行 L1 缓存同步,因此会产生额外的延迟。
注意 - “无需同步即可完成”并不总是正确的说法。
It will have the same pitfalls as with a single core but with additional latency due to the L1 cache synchronization that must take place between cores.
Note - "you can do it without synchronizing" is not always a true statement.
即使在单核上,您也不能假设操作是原子的。这可能是您使用汇编程序进行编码的情况,但是,如果您按照您的问题使用 C++ 进行编码,则您不知道它将编译成什么。
您应该依赖您正在编码的抽象级别的同步原语。在你的例子中,这就是 C++ 的线程调用。无论它们是 pthreads、Windows 线程还是其他东西。
这与我在另一个答案中给出的推理相同 i++ 是否是线程安全的。最重要的是,你不知道,因为你没有编码到那个级别(如果你正在做内联汇编器和/或你理解并可以控制幕后发生的事情,你不再在编码C++ 级别,你可以忽略我的建议)。
操作系统和/或操作系统类型库对它们运行的环境了解很多,比 C++ 编译器了解得更多。使用正确的同步原语将为您省去很多麻烦。
Even on a single core, you cannot assume that an operation will be atomic. That may be the case where you're coding in assembler but, if you are coding in C++ as per your question, you do not know what it will compile down to.
You should rely on the synchronisation primitives at the level of abstraction that you're coding to. In your case, that's the threading calls for C++. whether they be pthreads, Windows threads or something else entirely.
It's the same reasoning that I gave in another answer to do with whether i++ was thread-safe. The bottom line is, you don't know since you're not coding to that level (if you're doing inline assembler and/or you understand and can control what's going on under the covers, you're no longer coding at the C++ level and you can ignore my advice).
The operating system and/or OS-type libraries know a great deal about the environment they're running in, far more so than the C++ compiler would. Use of proper syncronisation primitives will save you a great deal of angst.