malloc/memcpy 函数在 NUMA 上独立运行吗?

发布于 2024-10-27 06:28:32 字数 372 浏览 4 评论 0原文

在尝试提高非 NUMA/标准 PC 上应用程序的速度时,我总是发现瓶颈是对 malloc() 的调用,因为即使在多核计算机中,它也是在所有计算机之间共享/同步的。核心。

我有一台使用 Linux 和 C 的 NUMA 架构 PC,我有两个问题:

  1. 在 NUMA 机器中,由于每个核心都有自己的内存,malloc() 将在每个核心上独立执行/内存而不阻塞其他核心?
  2. 在这些架构中,如何调用 memcpy() ?这可以在每个核心上独立调用吗?或者,在一个核心中调用它会阻止其他核心?我可能错了,但我记得 memcpy() 也遇到了与 malloc() 相同的问题,即当一个核心正在使用它时,其他核心必须等待。

While trying to increase the speed of my applications on non-NUMA / standard PCs I always found that the bottleneck was the call to malloc() because even in multi-core machines it is shared/synch between all the cores.

I have available a PC with NUMA architecture using Linux and C and I have two questions:

  1. In a NUMA machine, since each core is provided with its own memory, will malloc() execute independently on each core/memory without blocking the other cores?
  2. In these architectures how are the calls to memcpy() made? Can this be called independently on each core or, calling it in once core will block the others? I maybe wrong but I remember that also memcpy() got the same problem of malloc() i.e. when one core is using it the others have to wait.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

菩提树下叶撕阳。 2024-11-03 06:28:32

NUMA机器是一个共享内存系统,因此任何处理器的内存访问都可以无阻塞地到达内存。如果存储器模型是基于消息的,则访问远程存储器将需要执行处理器请求本地处理器执行所需的操作。然而,在 NUMA 系统中,远程处理器仍然可能由于使用内存链接而影响近距离处理器的性能,尽管这可能取决于特定的架构配置。

至于1,这完全取决于操作系统和malloc库。操作系统负责将每个核心/每个处理器的内存呈现为统一空间或 NUMA。 Malloc 可能支持或不支持 NUMA。但从根本上来说,malloc 实现可能能够也可能不能与其他请求同时执行。 Al 的回答(以及相关的讨论)更详细地解决了这一点。

至于 2,由于 memcpy 由一系列加载和存储组成,唯一的影响将再次是使用其他处理器的内存控制器等的潜在架构影响。

A NUMA machine is a shared memory system, so memory accesses from any processor can reach the memory without blocking. If the memory model were message based, then accessing remote memory would require the executing processor to request that the local processor perform the desired operation. However, in a NUMA system, a remote processor may still impact the performance of the close processor due to utilizing the memory links, though this can depend on the specific architectural configuration.

As for 1, this entirely depends on the OS and malloc library. The OS is responsible for presenting the per-core / per-processor memory as either a unified space or as NUMA. Malloc may or may not be NUMA-aware. But fundamentally, the malloc implementation may or may not be able to execute concurrently with other requests. And the answer from Al (and associated discussion) addresses this point in greater detail.

As for 2, as memcpy consist of a series of loads and stores, the only impact would again be the potential architectural effects of using the other processors' memory controllers, etc.

来世叙缘 2024-11-03 06:28:32
  1. 无论您是否使用 NUMA 架构,单独进程中对 malloc 的调用都将独立执行。同一进程的不同线程中对 malloc 的调用不能独立执行,因为返回的内存对于进程内的所有线程来说都是平等可访问的。如果您想要特定线程的本地内存,请阅读线程本地存储。我还没有找到任何明确的文档来说明 Linux VM 和调度程序是否能够优化内核、线程、本地内存和线程本地存储之间的关联性。
  1. Calls to malloc in separate processes will execute independently regardless of whether you are on a NUMA architecture. Calls to malloc in different threads of the same process cannot execute independently because the memory returned is equally accessible to all threads within the process. If you want memory that is local to a particular thread, read up on Thread Local Storage. I have not been able to find any clear documentation on whether the Linux VM and scheduler are able optimize the affinity between cores, threads, local memory and thread local storage.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文