多核处理器的关键部分
对于单核处理器,所有线程都从一个 CPU 运行,对内存中的某些互斥体(或信号量等)使用原子测试和设置操作来实现关键部分的想法似乎很简单; 因为您的处理器正在从程序中的一个位置执行测试和设置,所以它不一定可以从程序中伪装成其他线程的另一位置执行测试和设置。
但是,当您实际上拥有多个物理处理器时会发生什么呢? 似乎简单的指令级原子性是不够的,因为两个处理器可能同时执行其测试和设置操作,您真正需要维护原子性的是访问共享内存位置互斥体。 (如果共享内存位置加载到缓存中,则还需要处理整个缓存一致性问题。)
这似乎会比单核情况产生更多的开销,所以这里是问题的核心:如何更糟糕的是吗? 是不是更糟了? 我们就只能忍受它吗? 或者通过强制执行进程组中的所有线程都必须位于同一物理核心上的策略来回避它?
With a single-core processor, where all your threads are run from the one single CPU, the idea of implementing a critical section using an atomic test-and-set operation on some mutex (or semaphore or etc) in memory seems straightforward enough; because your processor is executing a test-and-set from one spot in your program, it necessarily can't be doing one from another spot in your program disguised as some other thread.
But what happens when you do actually have more than one physical processor? It seems that simple instruction level atomicity wouldn't be sufficient, b/c with two processors potentially executing their test-and-set operations at the same time, what you really need to maintain atomicity on is access to the shared memory location of the mutex. (And if the shared memory location is loaded into cache, there's the whole cache consistency thing to deal with, too..)
This seems like it would incur far more overhead than the single core case, so here's the meat of the question: How much worse is it? Is it worse? Do we just live with it? Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
内存访问由内存控制器处理,内存控制器应该处理多核问题,即它不应该允许同时访问相同的地址(可能通过内存页或内存行处理)。 因此,您可以使用一个标志来指示另一个处理器是否正在更新某个块的内存内容(这可以避免更新部分记录但不是全部记录的脏读类型)。
如果处理器具有这样的功能,则更优雅的解决方案是使用硬件信号量块。 硬件信号量是一个简单的队列,其大小可能为 no_of_cores -1。 TI 的 6487/8 处理器就是这样。 您可以直接查询信号量(并循环直到它被释放),也可以进行间接查询,这将在核心获取资源后导致中断。 请求按照发出的顺序排队并提供服务。 信号量查询是一个原子操作。
缓存一致性是另一个问题,在某些情况下您可能需要进行缓存写回和刷新。 但这是一个非常特定于缓存实现的事情。 对于 6487/8,我们需要通过一些操作来做到这一点。
Memory accesses are handled by the memory controller which should take care of multi-core issues, i.e. it shouldn't allow simultaneous access to same addresses (probably handled either by memory page or memory line basis). So you can use a flag to indicate whether another processor is updating the memory contents of some block (this to avoid a type of dirty read where part of the record is updated, but not all).
A more elegant solution is to use a HW semaphore block if the processor has such a feature. A HW semaphore is a simple queue which could be of size no_of_cores -1. This is how it is in TI's 6487/8 processor. You can either query the semaphore directly (and loop until it is released) or do an indirect query which will result in an interrupt once your core gets the resource. The requests are queued and served in the order they were made. A semaphore query is an atomic operation.
Cache consistency is another issue and you might need to do cache writebacks and refreshes in some cases. But this is a very cache implementation specific thing. With 6487/8 we needed to do that on a few operations.
那么,根据您家里放置的计算机类型,执行以下操作: 编写一个简单的多线程应用程序。 在单核(Pentium 4 或 Core Solo)上运行此应用程序,然后在多核处理器(Core 2 Duo 或类似处理器)上运行它,看看加速有多大。
当然,这些比较是不公平的,因为无论内核如何,Pentium 4 和 Core Solo 都比 Core 2 Duo 慢得多。 也许可以将 Core 2 Duo 和 Core 2 Quad 与可以使用 4 个或更多线程的应用程序进行比较。
您提出了一些有效的观点。 多个处理器会带来很多麻烦和开销。 然而,我们只能忍受它们,因为如果关键部分足够长,并行性的速度提升可能远远超过它们。
至于您关于将所有线程放在同一物理核心上的最后建议,这完全违背了多核计算机的意义!
Well, depending on what type of computers you have laying around the house, do the following: Write a simple multithreaded application. Run this application on a single core (Pentium 4 or Core Solo) and then run it on a multicore processor (Core 2 Duo or similar) and see how big the speed up is.
Granted these are unfair comparisons since Pentium 4 and Core Solo are much slower regardless of cores than a Core 2 Duo. Maybe compare between a Core 2 Duo and a Core 2 Quad with an application that can use 4 or more threads.
You raise a number of valid points. Muliple processors introduce a lot of headache and overhead. However, we just have to live with them, because the speed boost of parallelism can far outweigh them, if the critical sections are made long enough.
As for your final suggestion about having all threads on the same physical core, that completely defeats the point of a multi-core computer!
多核/SMP 系统不仅仅是几个粘合在一起的 CPU。 明确支持并行处理事情。 所有同步原语都是在硬件的帮助下按照原子CAS的方式实现的。 该指令要么锁定 CPU 和内存控制器(以及执行 DMA 的设备)共享的总线并更新内存,要么仅依赖 缓存监听。 这反过来会导致缓存一致性算法启动,强制所有相关方刷新其缓存。
免责声明 - 这是非常基本的描述,这里还有更多有趣的内容,例如虚拟缓存与物理缓存、缓存回写策略、内存模型、围栏等。
如果您想了解有关操作系统如何使用这些硬件设施的更多信息 - 这是一本优秀的书 关于该主题。
Multi-core/SMP systems are not just several CPUs glued together. There's explicit support for doing things in parallel. All the synchronization primitives are implemented with the help of hardware along the lines of atomic CAS. The instruction either locks the bus shared by CPUs and memory controller (and devices that do DMA) and updates the memory, or just updates the memory relying on cache snooping. This in turn causes cache coherency algorithm to kick in forcing all involved parties to flush their caches.
Disclaimer - this is very basic description, there are more interesting things here like virtual vs. physical caches, cache write-back policies, memory models, fences, etc. etc.
If you want to know more about how OS might use these hardware facilities - here's an excellent book on the subject.
多核 cpu 的供应商必须注意不同的核心在执行保证原子内存访问的指令时进行自我协调。
例如,在英特尔芯片上,您有“cmpxchg”指令。 它将存储在内存位置的值与预期值进行比较,如果两者匹配,则将其交换为新值。 如果在它前面加上“lock”指令,则保证它对于所有核心都是原子的。
The vendor of multi-core cpus has to take care that the different cores coordinate themselves when executing instructions which guarantee atomic memory access.
On intel chips for instance you have the 'cmpxchg' instruction. It compares the value stored at a memory location to an expected value and exchanges it for the new value if the two match. If you precede it with the 'lock' instruction, it is guaranteed to be atomic with respect to all cores.
您需要进行测试和设置,以强制处理器通知所有其他核心的操作,以便它们了解。 是的,这会带来开销,您必须忍受它。 这是以这样的方式设计多线程应用程序的原因:它们不会太频繁地等待同步原语。
You would need a test-and-set that forces the processor to notify all the other cores of the operation so that they are aware. Yes, that introduces an overhead and you have to live with it. It's a reason to design multithreaded applications in such a way that they don't wait for synchronization primitives too often.
这将取消多线程的全部意义。 当您使用锁、信号量或其他同步技术时,无论您使用多少个内核,您都依赖操作系统来确保这些操作是互锁的。
释放锁后切换到不同线程的时间主要取决于上下文切换的成本。 此 SO 线程处理上下文切换开销,因此您可能想检查一下。
还有一些其他有趣的线程:
您应该阅读此 MSDN另请参阅文章:了解多线程应用中低锁定技术的影响 。
That would cancel the whole point of multithreading. When you are using a lock, semaphore, or other syncronization techniques, you are relying on OS to make sure that these operations are interlocked, no matter how many cores you are using.
The time to switch to a different thread after a lock has been released is mostly determined by the cost of a context switch. This SO thread deals with the context switching overhead, so you might want to check that.
There are some other interesting threads also:
You should read this MSDN article also: Understanding the Impact of Low-Lock Techniques in Multithreaded Apps.