如何在多核系统中访问共享内存
在多核系统中,例如2、4、8核,我们通常使用互斥体和信号量来访问共享内存。然而,我可以预见,这些方法会给未来的多核系统带来很高的开销。是否有任何替代方法更适合未来许多核心系统访问共享内存。
In multicore systems, such as 2, 4, 8 cores, we typically use mutexes and semaphores to access shared memory. However, I can foresee that these methods would induce a high overhead for future systems with many cores. Are there any alternative methods that would be better for future many core systems for accessing shared memories.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
事务内存就是这样一种方法。
Transactional memory is one such method.
我不确定你未来想走多远。但从长远来看,我们现在所知道的共享内存(任何核心都可以访问的单一地址空间)是不可扩展的。因此,编程模型必须在某个时刻发生改变,并使程序员的生活变得更加困难,就像我们转向多核时那样。
但现在(也许再过 10 年)您可以摆脱事务内存和其他硬件/软件技巧。
我说共享内存从长远来看不可扩展的原因仅仅是由于物理原因。 (类似于单核/高频如何遇到障碍)
简而言之,晶体管不能缩小到小于原子的尺寸(除非采用新技术),并且信号的传播速度不能超过光速。因此,内存会变得越来越慢(相对于处理器),并且在某些时候,共享内存变得不可行。
我们现在已经可以通过 NUMA 在多插槽系统上看到这种效果。大型超级计算机既不是共享内存,也不是缓存一致性。
I'm not sure how far in the future you want to go. But in the long-long run, shared memory as we know it right now (single address space accessible by any core) is not scalable. So the programming model will have to change at some point and make the lives of programmers harder as it did when we went to multi-core.
But for now (perhaps for another 10 years) you can get away with transactional memory and other hardware/software tricks.
The reason I say shared-memory is not scalable in the long run is simply due to physics. (similar to how single-core/high-frequency hit a barrier)
In short, transistors can't shrink to less than the size of an atom (barring new technology), and signals can't propagate faster than the speed of light. Therefore, memory will get slower and slower (with respect to the processor) and at some point, it becomes infeasible to share memory.
We can already see this effect right now with NUMA on the multi-socket systems. Large-scale supercomputers are neither shared-memory nor cache-coherent.
1)仅锁定您正在访问的内存部分,而不是整个表!这是在大哈希表的帮助下完成的。桌子越大,锁定机构越精细。
2)如果可以的话,只锁定写入,而不锁定读取(这要求在更新时读取“先前值”没有问题,这通常是有效的情况)。
1) Lock only the memory part your are accessing, and not the entire table ! This is done with the help of a big hash table. The bigger the table, the finer the lock mechanism is.
2) If you can, only lock on writing, not on reading (this requires that there is no problem in reading the "previous value" while it is being updated, which is very often a valid case).
在任何多处理器/核心/线程应用程序同步中对最低级别的共享内存的访问都取决于总线锁。这样的锁定可能会导致数百个(CPU)等待状态,因为它还包含锁定那些具有包括 DMA 在内的总线主控设备的 I/O 总线。理论上,可以设想一个中等级别的锁,当程序员确定被锁定的内存区域不会受到任何 I/O 总线的影响时,可以调用该锁。这样的锁会快得多,因为它只需要同步 CPU 缓存与主内存,主内存速度很快,至少与最慢的 I/O 总线的延迟相比是这样。一般来说,程序员是否有能力确定何时使用哪个总线锁,这对其主流可行性增加了令人担忧的影响。这种锁还可能需要其自己的专用外部引脚来与其他处理器同步。
在多处理器 Opteron 系统中,每个处理器都有自己的内存,该内存成为所有安装的处理器都可以“看到”的整个内存的一部分。试图访问连接到另一个处理器的内存的处理器将通过高速互连总线(称为 HyperTransport)透明地完成对负责该内存的处理器的访问(尽管速度较慢)(NUMA 概念)。只要处理器及其核心与物理连接的内存一起工作,处理速度就会很快。此外,许多处理器配备了多个外部内存总线,以倍增其整体内存带宽。
在 Opteron 系统上,理论上的中级锁可以使用 HyperTransport 互连来实现。
对于任何可预见的未来,通过实现锁定到位时使用的高效算法(和相关数据结构)来尽可能少地锁定并锁定时间尽可能短的经典方法仍然适用。
Access to shared memory at the lowest level in any multi-processor/core/threaded application synchronization depends on the bus lock. Such a lock may incur hundreds of (CPU) wait states as it also encompasses locking those I/O buses that have bus-mastering devices including DMA. Theoretically it is possible to envision a medium-level lock that can be invoked in situations when the programmer is certain that the memory area being locked won't be affected by any I/O bus. Such a lock would be much faster because it only needs to synchronize the CPU caches with main memory which is fast, at least in comparison to latency of the slowest I/O buses. Whether programmers in general would be competent to determine when to use which bus lock adds worrying implications to its mainstream feasibility. Such a lock could also require its own dedicated external pins for synchronization with other processors.
In multi-processor Opteron systems each processor has its own memory which becomes part of the entire memory that all installed processors can "see". A processor trying to access memory which turns out to be attached to another processor will transparently complete the access - albeit more slowly - through a high-speed interconnect bus (called HyperTransport) to the processor in charge of that memory (the NUMA concept). As long as a processor and its cores are working with the memory physically connected to it processing will be fast. In addition, many processors are equipped with several external memory buses to multiply their overall memory bandwidth.
A theoretical medium-level lock could, on Opteron systems, be implemented using the HyperTransport interconnections.
As for any forseeable future the classic approach of locking as seldom as possible and for as short a time as possible by implementing efficient algorithms (and associated data structures) that are used when the locks are in place still holds true.