如何统计对远程 NUMA 内存节点的内存访问?
在最近的 Linux 分布式共享内存系统上运行的多线程应用程序中,是否有一种直接的方法来计算每个线程对远程(非本地)NUMA 内存节点的请求数量?
我正在考虑使用 PAPI 来计算互连流量。这是要走的路吗?
在我的应用程序中,线程在其整个生命周期中都绑定到特定的核心或处理器。当应用程序开始时,内存按页分配,并以循环方式分布在所有可用的 NUMA 内存节点上。
谢谢您的回答。
In a multi-threaded application running on a recent linux Distributed Shared Memory system, is there a straight forward way to count the number of requests per thread to remote (non-local) NUMA memory nodes?
I am thinking of using PAPI to count interconnect traffic. Is this the way to go?
In my application, threads are bound to a particular core or processor for their entire life-time. When the application begins, memory is allocated page wise and spread in a round-robin manner across all available NUMA memory nodes.
Thank you for your answers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您有权访问 VTune,则本地和远程 NUMA 节点访问由硬件计数器 OFFCORE_RESPONSE.ANY_DATA.OTHER_LOCAL_DRAM_0 进行计数(用于快速本地 NUMA 节点访问),通过 OFFCORE_RESPONSE.ANY_DATA.REMOTE_DRAM_0 进行较慢的远程 NUMA 节点访问。
计数器在 VTune 中的显示方式:
计数器在两种情况下的外观:
NUMA 不愉快的代码:核心0(NUMA 节点 0)增加驻留在 NUMA 节点 1 上的 50 MB:

NUMA 满意代码:核心 0(NUMA 节点 0)增加驻留在 NUMA 上的 50 MB节点 0:

If you have access to VTune, local and remote NUMA node accesses are counted by hardware counters OFFCORE_RESPONSE.ANY_DATA.OTHER_LOCAL_DRAM_0 for fast local NUMA node accesses and OFFCORE_RESPONSE.ANY_DATA.REMOTE_DRAM_0 for slower remote NUMA node acccesses.
How the counters appear in VTune:
How the counters look in two scenarios:
NUMA unhappy code: core 0 (NUMA node 0) increments 50 MB residing on NUMA node 1:

NUMA happy code: core 0 (NUMA node 0) increments 50 MB residing on NUMA node 0:

我找到了 Intel PCM 附带的 pcm-numa.x 工具 非常有用。它告诉您每个核心访问本地或远程 NUMA 节点的次数。
I found the pcm-numa.x tool that comes with Intel PCM to be quite useful. It tells you the number of times each core has accessed the local or remote NUMA nodes.
我不确定这是否直接,我也不知道什么是“分布式共享内存系统”,但是,无论如何,在普通的 Linux 上,如果您有权访问源代码,您也许可以自己计算请求数。您可以使用我的“我可以从指针地址获取 NUMA 节点吗?”的答案。问题这里 找出请求的内存位于哪个节点,并了解线程所在的节点来统计远程请求。这只会告诉您使用远程内存的频率,而不是该内存何时不在本地缓存中并且必须获取,因此它可能不完全是您想要的。
如果您想了解远程内存上的缓存未命中情况,请尝试将分析标签添加到您的问题中 - 它可能会吸引更多读者。如果有一个分析器可以区分本地内存缺失和远程内存缺失,我也有兴趣找出答案。
I'm not sure this qualifies as straight forward, and I don't know what a "Distributed Shared Memory System" is, but, on normal Linux anyway, if you have access to the source you may be able to count the requests yourself. You could use the answer to my "Can I get the NUMA node from a pointer address?" question here to figure out what node the memory requested is on, and knowing the node your thread is on tally up the remote requests. This is only going to tell you how often you're using remote memory, rather than when that memory is not in your local cache already and has to be fetched, so it may not be exactly what you want.
If you want to know about cache misses on remote memory, try adding the profiling tag to your question - it might attract more readers. If there's a profiler that will distinguish local memory misses from remote memory misses I'd be interested to find out too.