malloc/memcpy 函数在 NUMA 上独立运行吗?
在尝试提高非 NUMA/标准 PC 上应用程序的速度时,我总是发现瓶颈是对 malloc() 的调用,因为即使在多核计算机中,它也是在所有计算机之间共享/同步的。核心。
我有一台使用 Linux 和 C 的 NUMA 架构 PC,我有两个问题:
- 在 NUMA 机器中,由于每个核心都有自己的内存,
malloc()
将在每个核心上独立执行/内存而不阻塞其他核心? - 在这些架构中,如何调用
memcpy()
?这可以在每个核心上独立调用吗?或者,在一个核心中调用它会阻止其他核心?我可能错了,但我记得 memcpy() 也遇到了与 malloc() 相同的问题,即当一个核心正在使用它时,其他核心必须等待。
While trying to increase the speed of my applications on non-NUMA / standard PCs I always found that the bottleneck was the call to malloc()
because even in multi-core machines it is shared/synch between all the cores.
I have available a PC with NUMA architecture using Linux and C and I have two questions:
- In a NUMA machine, since each core is provided with its own memory, will
malloc()
execute independently on each core/memory without blocking the other cores? - In these architectures how are the calls to
memcpy()
made? Can this be called independently on each core or, calling it in once core will block the others? I maybe wrong but I remember that alsomemcpy()
got the same problem ofmalloc()
i.e. when one core is using it the others have to wait.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
NUMA机器是一个共享内存系统,因此任何处理器的内存访问都可以无阻塞地到达内存。如果存储器模型是基于消息的,则访问远程存储器将需要执行处理器请求本地处理器执行所需的操作。然而,在 NUMA 系统中,远程处理器仍然可能由于使用内存链接而影响近距离处理器的性能,尽管这可能取决于特定的架构配置。
至于1,这完全取决于操作系统和malloc库。操作系统负责将每个核心/每个处理器的内存呈现为统一空间或 NUMA。 Malloc 可能支持或不支持 NUMA。但从根本上来说,malloc 实现可能能够也可能不能与其他请求同时执行。 Al 的回答(以及相关的讨论)更详细地解决了这一点。
至于 2,由于 memcpy 由一系列加载和存储组成,唯一的影响将再次是使用其他处理器的内存控制器等的潜在架构影响。
A NUMA machine is a shared memory system, so memory accesses from any processor can reach the memory without blocking. If the memory model were message based, then accessing remote memory would require the executing processor to request that the local processor perform the desired operation. However, in a NUMA system, a remote processor may still impact the performance of the close processor due to utilizing the memory links, though this can depend on the specific architectural configuration.
As for 1, this entirely depends on the OS and malloc library. The OS is responsible for presenting the per-core / per-processor memory as either a unified space or as NUMA. Malloc may or may not be NUMA-aware. But fundamentally, the malloc implementation may or may not be able to execute concurrently with other requests. And the answer from Al (and associated discussion) addresses this point in greater detail.
As for 2, as memcpy consist of a series of loads and stores, the only impact would again be the potential architectural effects of using the other processors' memory controllers, etc.