NUMA 感知缓存对齐内存分配

发布于 2024-12-16 17:52:32 字数 178 浏览 0 评论 0原文

在linux系统中，pthreads库为我们提供了一个函数（posix_memalign）用于缓存对齐以防止错误共享。要选择架构的特定 NUMA 节点，我们可以使用 libnuma 库。我想要的是两者都需要的东西。我将某些线程绑定到某些处理器，并且希望从相应的 NUMA 节点为每个线程分配本地数据结构，以减少线程内存操作的延迟。我该怎么做？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

葵雨 2024-12-23 17:52:32

libnuma 中的 numa_alloc_*() 函数分配整页内存，通常为 4096 字节。高速缓存行通常为 64 字节。由于 4096 是 64 的倍数，因此从 numa_alloc_*() 返回的任何内容都将在缓存级别进行内存对齐。

但请注意 numa_alloc_*() 函数。它在手册页上说它们比相应的 malloc() 慢，我确信这是真的，但我发现的更大问题是 numa_alloc_*() 的同时分配同时在许多内核上运行遭受大量争用问题。就我而言，用 numa_alloc_onnode() 替换 malloc() 是一次清洗（我通过使用本地内存获得的一切都被分配/空闲时间的增加所抵消）； tcmalloc 比两者都快。我同时在 32 个线程/核心上执行数千个 12-16kb malloc。计时实验表明，我的进程花费大量时间执行分配的原因并不是 numa_alloc_onnode() 的单线程速度，这使得锁定/争用问题成为可能的原因。我采用的解决方案是 numa_alloc_onnode() 一次大块内存，然后根据需要将其分配给每个节点上运行的线程。我使用 gcc 原子内置函数来允许每个线程（我将线程固定到 cpu）从每个节点上分配的内存中获取。如果您愿意，您可以在制作时按缓存行大小对齐分布：我愿意。这种方法甚至击败了 tcmalloc（它是线程感知的，但不是 NUMA 感知的——至少 Debain Squeeze 版本似乎不是）。这种方法的缺点是您无法释放各个发行版（无论如何，不是没有更多的工作），您只能释放整个底层的节点上分配。但是，如果这是函数调用的临时节点暂存空间，或者您可以以其他方式准确指定何时不再需要该内存，则此方法非常有效。显然，如果您也可以预测需要在每个节点上分配多少内存，这会有所帮助。

@nandu：我不会发布完整的源代码 - 它很长，并且在某些地方与我所做的其他事情相关，这使得它不太透明。我将发布的是我的新 malloc() 函数的稍微缩短的版本，以说明核心思想：

void *my_malloc(struct node_memory *nm,int node,long size)
{
  long off,obytes;

  // round up size to the nearest cache line size
  // (optional, though some rounding is essential to avoid misalignment problems)

  if ((obytes = (size % CACHE_LINE_SIZE)) > 0)
    size += CACHE_LINE_SIZE - obytes;

  // atomically increase the offset for the requested node by size

  if (((off = __sync_fetch_and_add(&(nm->off[node]),size)) + size) > nm->bytes) {
    fprintf(stderr,"Out of allocated memory on node %d\n",node);
    return(NULL);
  }
  else
    return((void *) (nm->ptr[node] + off));

}

其中 struct node_memory

struct node_memory {
  long bytes;         // the number of bytes of memory allocated on each node
  char **ptr;         // ptr array of ptrs to the base of the memory on each node
  long *off;          // array of offsets from those bases (in bytes)
  int nptrs;          // the size of the ptr[] and off[] arrays
};

和 nm->ptr[node] 是使用 libnuma 函数 numa_alloc_onnode() 设置的。

我通常也在结构中存储允许的节点信息，因此 my_malloc() 可以检查节点请求是否合理，而无需进行函数调用；我还检查 nm 是否存在并且大小是否合理。函数 __sync_fetch_and_add() 是一个 gcc 内置原子函数；如果你不使用 gcc 编译，你将需要其他东西。我使用原子是因为根据我有限的经验，在高线程/核心计数条件下（如在 4P NUMA 机器上），它们比互斥体快得多。

The numa_alloc_*() functions in libnuma allocate whole pages of memory, typically 4096 bytes. Cache lines are typically 64 bytes. Since 4096 is a multiple of 64, anything that comes back from numa_alloc_*() will already be memaligned at the cache level.

Beware the numa_alloc_*() functions however. It says on the man page that they are slower than a corresponding malloc(), which I'm sure is true, but the much bigger problem I've found is that simultaneous allocations from numa_alloc_*() running on lots of cores at once suffer massive contention problems. In my case replacing malloc() with numa_alloc_onnode() was a wash (everything I gained by using local memory was offset by increased allocation/free time); tcmalloc was faster than either. I was performing thousands of 12-16kb mallocs on 32 threads/cores at once. Timing experiments showed that it wasn't numa_alloc_onnode()'s single-threaded speed that was responsible for the large amount of time my process spent performing the allocations, which left lock/contention issues as the likely cause. The solution I've adopted is to numa_alloc_onnode() large chunks of memory once, and then distribute it to the threads running on each node as needed. I use gcc atomic builtins to allow each thread (I pin threads to cpus) to grab from the memory allocated on each node. You can cache-line-size align the distributions as they are made, if you want : I do. This approach beats the pants off of even tcmalloc (which is thread aware but not NUMA aware - at least the Debain Squeeze version doesn't seem to be). The downside of this approach is that you can't free the individual distributions (well, not without a lot more work, anyway), you can only free the whole underlying on-node allocations. However, if this is temporary on-node scratch space for a function call, or you can otherwise specify exactly when that memory is no longer needed, then this approach works very well. It helps if you can predict how much memory you need to allocate on each node too, obviously.

@nandu : I wont post the full source - it's long and in places tied to something else I do which makes it less than perfectly transparent. What I will post is a slightly shortened version of my new malloc() function to illustrate the core idea :

void *my_malloc(struct node_memory *nm,int node,long size)
{
  long off,obytes;

  // round up size to the nearest cache line size
  // (optional, though some rounding is essential to avoid misalignment problems)

  if ((obytes = (size % CACHE_LINE_SIZE)) > 0)
    size += CACHE_LINE_SIZE - obytes;

  // atomically increase the offset for the requested node by size

  if (((off = __sync_fetch_and_add(&(nm->off[node]),size)) + size) > nm->bytes) {
    fprintf(stderr,"Out of allocated memory on node %d\n",node);
    return(NULL);
  }
  else
    return((void *) (nm->ptr[node] + off));

}

where struct node_memory is

struct node_memory {
  long bytes;         // the number of bytes of memory allocated on each node
  char **ptr;         // ptr array of ptrs to the base of the memory on each node
  long *off;          // array of offsets from those bases (in bytes)
  int nptrs;          // the size of the ptr[] and off[] arrays
};

and nm->ptr[node] is set up using the libnuma function numa_alloc_onnode().

I usually store allowable node information in the structure too, so my_malloc() can check node requests are sensible without making function calls; I also check nm exists and that size is sensible. The function __sync_fetch_and_add() is a gcc builtin atomic function; if you're not compiling with gcc you'll need something else. I use atomics because in my limited experience they are much faster than mutexes in high thread/core count conditions (as on 4P NUMA machines).

回复收藏 0 原文

眼眸 2024-12-23 17:52:32

如果您只是想获得 NUMA 分配器的对齐功能，您可以轻松构建自己的分配器。

这个想法是用多一点的空间来调用未对齐的malloc()。然后返回第一个对齐的地址。为了能够释放它，您需要将基地址存储在已知位置。

这是一个例子。只需用适当的名称替换名称：

pint         //  An unsigned integer that is large enough to store a pointer.
NUMA_malloc  //  The NUMA malloc function
NUMA_free    //  The NUMA free function

void* my_NUMA_malloc(size_t bytes,size_t align, /* NUMA parameters */ ){

    //  The NUMA malloc function
    void *ptr = numa_malloc(
        (size_t)(bytes + align + sizeof(pint)),
        /* NUMA parameters */
    );

    if (ptr == NULL)
        return NULL;

    //  Get aligned return address
    pint *ret = (pint*)((((pint)ptr + sizeof(pint)) & ~(pint)(align - 1)) + align);

    //  Save the free pointer
    ret[-1] = (pint)ptr;

    return ret;
}

void my_NUMA_free(void *ptr){
    if (ptr == NULL)
        return;

    //  Get the free pointer
    ptr = (void*)(((pint*)ptr)[-1]);

    //  The NUMA free function
    numa_free(ptr); 
}

当您使用此名称时，您需要调用 my_NUMA_free 来获取使用 my_NUMA_malloc 分配的任何内容。

If you're just looking to get the alignment functionality around a NUMA allocator, you can easily build your own.

The idea is to call the unaligned malloc() with a little bit more space. Then return the first aligned address. To be able to free it, you need to store the base address at a known location.

Here's an example. Just substitute the names with whatever is appropriate:

pint         //  An unsigned integer that is large enough to store a pointer.
NUMA_malloc  //  The NUMA malloc function
NUMA_free    //  The NUMA free function

void* my_NUMA_malloc(size_t bytes,size_t align, /* NUMA parameters */ ){

    //  The NUMA malloc function
    void *ptr = numa_malloc(
        (size_t)(bytes + align + sizeof(pint)),
        /* NUMA parameters */
    );

    if (ptr == NULL)
        return NULL;

    //  Get aligned return address
    pint *ret = (pint*)((((pint)ptr + sizeof(pint)) & ~(pint)(align - 1)) + align);

    //  Save the free pointer
    ret[-1] = (pint)ptr;

    return ret;
}

void my_NUMA_free(void *ptr){
    if (ptr == NULL)
        return;

    //  Get the free pointer
    ptr = (void*)(((pint*)ptr)[-1]);

    //  The NUMA free function
    numa_free(ptr); 
}

To when you use this, you need to call my_NUMA_free for anything allocated with my_NUMA_malloc.

回复收藏 0 原文

~没有更多了~