NUMA 感知缓存对齐内存分配
在linux系统中,pthreads库为我们提供了一个函数(posix_memalign)用于缓存对齐以防止错误共享。要选择架构的特定 NUMA 节点,我们可以使用 libnuma 库。我想要的是两者都需要的东西。我将某些线程绑定到某些处理器,并且希望从相应的 NUMA 节点为每个线程分配本地数据结构,以减少线程内存操作的延迟。我该怎么做?
In linux systems, pthreads library provides us a function (posix_memalign) for cache alignment to prevent false sharing. And to choose a specific NUMA node of the arhitecture we can use libnuma library. What I want is something needing both two. I bind certain threads to some certain processors and I want allocate local data structures for each thread from the corresponding NUMA node in order to decrease delay in memory operations for the threads. How can I do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
libnuma 中的 numa_alloc_*() 函数分配整页内存,通常为 4096 字节。高速缓存行通常为 64 字节。由于 4096 是 64 的倍数,因此从 numa_alloc_*() 返回的任何内容都将在缓存级别进行内存对齐。
但请注意 numa_alloc_*() 函数。它在手册页上说它们比相应的 malloc() 慢,我确信这是真的,但我发现的更大问题是 numa_alloc_*() 的同时分配同时在许多内核上运行遭受大量争用问题。就我而言,用 numa_alloc_onnode() 替换 malloc() 是一次清洗(我通过使用本地内存获得的一切都被分配/空闲时间的增加所抵消); tcmalloc 比两者都快。我同时在 32 个线程/核心上执行数千个 12-16kb malloc。计时实验表明,我的进程花费大量时间执行分配的原因并不是 numa_alloc_onnode() 的单线程速度,这使得锁定/争用问题成为可能的原因。我采用的解决方案是 numa_alloc_onnode() 一次大块内存,然后根据需要将其分配给每个节点上运行的线程。我使用 gcc 原子内置函数来允许每个线程(我将线程固定到 cpu)从每个节点上分配的内存中获取。如果您愿意,您可以在制作时按缓存行大小对齐分布:我愿意。这种方法甚至击败了 tcmalloc(它是线程感知的,但不是 NUMA 感知的——至少 Debain Squeeze 版本似乎不是)。这种方法的缺点是您无法释放各个发行版(无论如何,不是没有更多的工作),您只能释放整个底层的节点上分配。但是,如果这是函数调用的临时节点暂存空间,或者您可以以其他方式准确指定何时不再需要该内存,则此方法非常有效。显然,如果您也可以预测需要在每个节点上分配多少内存,这会有所帮助。
@nandu:我不会发布完整的源代码 - 它很长,并且在某些地方与我所做的其他事情相关,这使得它不太透明。我将发布的是我的新 malloc() 函数的稍微缩短的版本,以说明核心思想:
其中 struct node_memory
和 nm->ptr[node] 是使用 libnuma 函数 numa_alloc_onnode() 设置的。
我通常也在结构中存储允许的节点信息,因此 my_malloc() 可以检查节点请求是否合理,而无需进行函数调用;我还检查 nm 是否存在并且大小是否合理。函数 __sync_fetch_and_add() 是一个 gcc 内置原子函数;如果你不使用 gcc 编译,你将需要其他东西。我使用原子是因为根据我有限的经验,在高线程/核心计数条件下(如在 4P NUMA 机器上),它们比互斥体快得多。
The numa_alloc_*() functions in libnuma allocate whole pages of memory, typically 4096 bytes. Cache lines are typically 64 bytes. Since 4096 is a multiple of 64, anything that comes back from numa_alloc_*() will already be memaligned at the cache level.
Beware the numa_alloc_*() functions however. It says on the man page that they are slower than a corresponding malloc(), which I'm sure is true, but the much bigger problem I've found is that simultaneous allocations from numa_alloc_*() running on lots of cores at once suffer massive contention problems. In my case replacing malloc() with numa_alloc_onnode() was a wash (everything I gained by using local memory was offset by increased allocation/free time); tcmalloc was faster than either. I was performing thousands of 12-16kb mallocs on 32 threads/cores at once. Timing experiments showed that it wasn't numa_alloc_onnode()'s single-threaded speed that was responsible for the large amount of time my process spent performing the allocations, which left lock/contention issues as the likely cause. The solution I've adopted is to numa_alloc_onnode() large chunks of memory once, and then distribute it to the threads running on each node as needed. I use gcc atomic builtins to allow each thread (I pin threads to cpus) to grab from the memory allocated on each node. You can cache-line-size align the distributions as they are made, if you want : I do. This approach beats the pants off of even tcmalloc (which is thread aware but not NUMA aware - at least the Debain Squeeze version doesn't seem to be). The downside of this approach is that you can't free the individual distributions (well, not without a lot more work, anyway), you can only free the whole underlying on-node allocations. However, if this is temporary on-node scratch space for a function call, or you can otherwise specify exactly when that memory is no longer needed, then this approach works very well. It helps if you can predict how much memory you need to allocate on each node too, obviously.
@nandu : I wont post the full source - it's long and in places tied to something else I do which makes it less than perfectly transparent. What I will post is a slightly shortened version of my new malloc() function to illustrate the core idea :
where struct node_memory is
and nm->ptr[node] is set up using the libnuma function numa_alloc_onnode().
I usually store allowable node information in the structure too, so my_malloc() can check node requests are sensible without making function calls; I also check nm exists and that size is sensible. The function __sync_fetch_and_add() is a gcc builtin atomic function; if you're not compiling with gcc you'll need something else. I use atomics because in my limited experience they are much faster than mutexes in high thread/core count conditions (as on 4P NUMA machines).
如果您只是想获得 NUMA 分配器的对齐功能,您可以轻松构建自己的分配器。
这个想法是用多一点的空间来调用未对齐的
malloc()
。然后返回第一个对齐的地址。为了能够释放它,您需要将基地址存储在已知位置。这是一个例子。只需用适当的名称替换名称:
当您使用此名称时,您需要调用
my_NUMA_free
来获取使用my_NUMA_malloc
分配的任何内容。If you're just looking to get the alignment functionality around a NUMA allocator, you can easily build your own.
The idea is to call the unaligned
malloc()
with a little bit more space. Then return the first aligned address. To be able to free it, you need to store the base address at a known location.Here's an example. Just substitute the names with whatever is appropriate:
To when you use this, you need to call
my_NUMA_free
for anything allocated withmy_NUMA_malloc
.