双CPU机器上的线程协作
我记得在大学学习的一门课程中,我最喜欢的竞争条件示例之一是一个简单的 main() 方法启动两个线程,其中一个线程增加了一个共享(全局)变量一个减一,另一个减一。伪代码:
static int i = 10;
main() {
new Thread(thread_run1).start();
new Thread(thread_run2).start();
waitForThreads();
print("The value of i: " + i);
}
thread_run1 {
i++;
}
thread_run2 {
i--;
}
然后教授问一百万次运行后i
的值是多少。 (本质上,如果它不是 10 的话。)不熟悉多线程系统的学生回答说 100% 的情况下,print()
语句总是会报告 i
为 10。
这实际上是不正确的,因为我们的教授证明每个递增/递减语句实际上被编译(汇编)为 3 个语句:
1: move value of 'i' into register x
2: add 1 to value in register x
3: move value of register x into 'i'
因此,i
的值可以是 9、10 或11.(我不会详细说明。)
我的问题:
据我了解,物理寄存器集是特定于处理器的。当使用双CPU机器时(注意双核和双CPU之间的区别),每个CPU是否都有自己的一组物理寄存器? 我假设答案是肯定的。
在单 CPU(多线程)机器上,上下文切换允许每个线程拥有自己的虚拟寄存器集。由于双 CPU 机器上有两组物理寄存器,这不会导致更大的竞争条件可能性吗?因为您实际上可以让两个线程同时运行,而不是在单线程上进行“虚拟”同时操作。 CPU机器? (虚拟同步操作指的是寄存器状态在每次上下文切换时都会保存/恢复。)
更具体地说,如果您在 8 个 CPU 的机器上运行此操作,每个 CPU 都有一个线程,是否消除了竞争条件?如果将此示例扩展到使用 8 个线程,在双 CPU 计算机上,每个 CPU 有 4 个核心,竞争条件的可能性会增加还是减少? 操作系统如何防止汇编指令的第 3 步
在两个不同的 CPU 上同时运行?
I remember in a course I took in college, one of my favorite examples of a race condition was one in which a simple main()
method started two threads, one of which incremented a shared (global) variable by one, the other decrementing it. Pseudo code:
static int i = 10;
main() {
new Thread(thread_run1).start();
new Thread(thread_run2).start();
waitForThreads();
print("The value of i: " + i);
}
thread_run1 {
i++;
}
thread_run2 {
i--;
}
The professor then asked what the value of i
is after a million billion zillion runs. (If it would ever be anything other than 10, essentially.) Students unfamiliar with multithreading systems responded that 100% of the time, the print()
statement would always report i
as 10.
This was in fact incorrect, as our professor demonstrated that each increment/decrement statement was actually compiled (to assembly) as 3 statements:
1: move value of 'i' into register x
2: add 1 to value in register x
3: move value of register x into 'i'
Thus, the value of i
could be 9, 10, or 11. (I won't go into specifics.)
My Question:
It was (is?) my understanding that the set of physical registers is processor-specific. When working with dual-CPU machines (note the difference between dual-core and dual-CPU), does each CPU have its own set of physical registers? I had assumed the answer is yes.
On a single-CPU (multithreaded) machine, context switching allows each thread to have its own virtual set of registers. Since there are two physical sets of registers on a dual-CPU machine, couldn't this result in even more potential for race conditions, since you can literally have two threads operating simultaneously, as opposed to 'virtual' simultaneous operation on a single-CPU machine? (Virtual simultaneous operation in reference to the fact that register states are saved/restored each context switch.)
To be more specific - if you were running this on an 8-CPU machine, each CPU with one thread, are race conditions eliminated? If you expand this example to use 8 threads, on a dual-CPU machine, each CPU having 4 cores, would the potential for race conditions increase or decrease? How does the operating system prevent step 3
of the assembly instructions from being run simultaneously on two different CPUs?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
是的,双核 CPU 的引入使得大量具有潜在线程竞争的程序很快失败。单核 CPU 通过调度程序在线程之间快速切换线程上下文来执行多任务。这消除了与过时的 CPU 缓存相关的一类线程错误。
不过,您给出的示例也可能在单核上失败。当线程调度程序在将变量的值加载到寄存器中以递增它时中断线程。它只是不会那么频繁地失败,因为调度程序中断线程的可能性并不那么大。
有一个操作系统功能可以让这些程序无论如何都一瘸一拐地运行,而不是在几分钟内崩溃。称为“处理器关联性”,可用作 Windows 上 start.exe 的 AFFINITY 命令行选项、winapi 中的 SetProcessAfinityMask()。查看 Interlocked 类以获取以原子方式递增和递减变量的辅助方法。
Yes, the introduction of dual-core CPUs made a significant number of programs with latent threading races fail quickly. Single-core CPUs multitask by the scheduler rapidly switching the threading context between threads. Which eliminates a class of threading bugs that are associated with a stale CPU cache.
The example you give can fail on a single core as well though. When the thread scheduler interrupts the thread just as it loaded the value of the variable in a register in order to increment it. It just won't fail nearly as frequently because the odds that the scheduler interrupts the thread just there isn't that great.
There's an operating system feature to allow these programs to limp along anyway instead of crashing within minutes. Called 'processor affinity', available as the AFFINITY command line option for start.exe on Windows, SetProcessAfinityMask() in the winapi. Review the Interlocked class for helper methods that atomically increment and decrement variables.
你仍然会遇到竞争条件 - 它根本不会改变这一点。想象两个核心同时执行增量 - 它们都加载相同的值,增量到相同的值,然后存储相同的值......因此两个操作的总增量将是 1 而不是 2 。
对于涉及内存模型的潜在问题,还有其他原因 - 其中步骤 1 可能真正无法检索到
i
的最新值,而步骤 3可能不会立即以其他线程可以看到的方式写入i
的新值。基本上,这一切都变得非常棘手 - 这就是为什么在访问共享数据时使用同步通常是一个好主意,或者使用无锁的更高级别抽象,这些抽象已经被由真正了解自己在做什么的专家撰写。
You'd still have a race condition - it doesn't change that at all. Imagine two cores both performing an increment at the same time - they'd both load the same value, increment to the same value, and then store the same value... so the overall increment from the two operations would be one instead of two.
There are additional causes of potential problems where memory models are concerned - where step 1 may not really retrieve the latest value of
i
, and step 3 may not immediately write the new value ofi
in a way which other threads can see.Basically, it all becomes very tricky - which is why it's generally a good idea to either use synchronization when accessing shared data or to use lock-free higher level abstractions which have been written by experts who really know what they're doing.
首先,双处理器与双核没有实际效果。双核处理器的芯片上仍然有两个完全独立的处理器。它们可能共享一些缓存,并且共享到内存/外设的公共总线,但处理器本身是完全独立的。 (双线程单代码,例如超线程)是第三种变体 - 但它每个虚拟处理器也有一组寄存器。两个处理器共享一组执行资源,但它们保留完全独立的寄存器组。
其次,实际上只有两种情况真正有趣:单个执行线程和其他所有情况。一旦您拥有多个线程(即使所有线程都在单个处理器上运行),您就会遇到同样的潜在问题,就像您在一台具有数千个处理器的大型机器上运行一样。现在,确实,当代码在更多处理器(最多与您创建的线程一样多)上运行时,您可能会更快地看到问题显现出来,但问题本身还没有/没有出现根本改变。
从实践的角度来看,从测试的角度来看,拥有更多的核心是有用的。考虑到典型操作系统上任务切换的粒度,很容易编写可以运行数年的代码,而不会在单个处理器上出现问题,这些代码将在几小时甚至几分钟内崩溃并烧毁当您在两个以上或物理处理器上运行它时。但问题并没有真正改变——只是当你拥有更多处理器时,它出现的速度会更快。
最终,竞争条件(或死锁、活锁等)与代码的设计有关,而不是与代码运行的硬件有关。硬件可能会影响您需要采取哪些步骤来执行所涉及的条件,但相关差异与处理器的简单数量关系不大。相反,它们是关于当您不仅拥有具有多个处理器的单台机器,而是具有完全独立的地址空间的多台机器时所做的让步,因此您可能需要采取额外的步骤来确保当您将值写入内存时它对于无法直接看到该内存的其他机器上的 CPU 来说是可见的。
First, dual processor versus dual core has no real effect. A dual core processor still has two completely separate processors on the chip. They may share some cache, and do share a common bus to memory/peripherals, but the processors themselves are entirely separate. (A dual-threaded single code, such as Hyperthreading) is a third variation -- but it has a set of registers per virtual processor as well. The two processors share a single set of execution resources, but they retain completely separate register sets.
Second, there are really only two cases that are realy interesting: a single thread of execution, and everything else. Once you have more than one thread (even if all threads run on a single processor), you have the same potential problems as if you're running on some huge machine with thousands of processors. Now, it's certainly true that you're likely to see the problems manifest themselves a lot sooner when the code runs on more processors (up to as many as you've created threads), but the problems themselves haven't/don't change at all.
From a practical viewpoint, having more cores is useful from a testing viewpoint. Given the granularity of task switching on a typical OS, it's pretty easy to write code that will run for years without showing problems on a single processor, that will crash and burn in a matter of hours or even minute when you run it on two more or physical processors. The problem hasn't really changed though -- it's just a lot more likely to show up a lot more quickly when you have more processors.
Ultimately, a race condition (or deadlock, livelock, etc.) is about the design of the code, not about the hardware it runs on. The hardware can make a difference in what steps you need to take to enforce the conditions involved, but the relevant differences have little to do with simple number of processors. Rather, they're about things like concessions made when you have not simply a single machine with multiple processors, but multiple machines with completely separate address spaces, so you may have to take extra steps to assure that when you write a value to memory that it becomes visible to the CPUs on other machines that can't see that memory directly.