多线程环境中的 Malloc 性能
我一直在使用 openmp 框架进行一些实验,发现了一些奇怪的结果,我不确定如何解释。
我的目标是创建这个巨大的矩阵,然后用值填充它。我制作了代码的某些部分,例如并行循环,以便从多线程环境中获得性能。我在一台具有 2 个四核 Xeon 处理器的机器上运行它,因此我可以安全地在其中放置最多 8 个并发线程。
一切都按预期工作,但由于某种原因,实际分配矩阵行的 for 循环在仅使用 3 个线程运行时具有奇怪的峰值性能。从那时起,添加更多线程只会使我的循环花费更长的时间。 8 个线程实际上比只有 1 个线程需要更多的时间。
这是我的并行循环:
int width = 11;
int height = 39916800;
vector<vector<int> > matrix;
matrix.resize(height);
#pragma omp parallel shared(matrix,width,height) private(i) num_threads(3)
{
#pragma omp for schedule(dynamic,chunk)
for(i = 0; i < height; i++){
matrix[i].resize(width);
}
} /* End of parallel block */
这让我想知道:在多线程环境中调用 malloc (我认为这就是向量模板类的 resize 方法实际调用的方法)时是否存在已知的性能问题?我发现一些文章提到了在多线程环境中释放堆空间时的性能损失,但没有具体说明在本例中分配新空间。
举个例子,我在下面放置了一个图表,显示循环完成所需的时间,作为分配循环和仅从这个巨大矩阵读取数据的正常循环的线程数的函数稍后。
两次都使用 gettimeofday 函数进行测量,似乎在不同的执行实例中返回非常相似且准确的结果。那么,有人有一个好的解释吗?
I've been running some experiments with the openmp framework and found some odd results I'm not sure I know how to explain.
My goal is to create this huge matrix and then fill it with values. I made some parts of my code like parallel loops in order to gain performance from my multithreaded enviroment. I'm running this in a machine with 2 quad-core xeon processors, so I can safely put up to 8 concurrent threads in there.
Everything works as expected, but for some reason the for loop actually allocating the rows of my matrix have an odd peak performance when running with only 3 threads. From there on, adding some more threads just makes my loop take longer. With 8 threads taking actually more time that it would need with only one.
This is my parallel loop:
int width = 11;
int height = 39916800;
vector<vector<int> > matrix;
matrix.resize(height);
#pragma omp parallel shared(matrix,width,height) private(i) num_threads(3)
{
#pragma omp for schedule(dynamic,chunk)
for(i = 0; i < height; i++){
matrix[i].resize(width);
}
} /* End of parallel block */
This made me wonder: is there a known performance problem when calling malloc (which I suppose is what the resize method of the vector template class is actually calling) in a multithreaded enviroment? I found some articles saying something about performance loss in freeing heap space in a mutithreaded enviroment, but nothing specific about allocating new space as in this case.
Just to give you an example, I'm placing below a graph of the time it takes for the loop to finish as a function of the number of threads for both the allocation loop, and a normal loop that just reads data from this huge matrix later on.
Both times where measured using the gettimeofday function and seem to return very similar and accurate results across different execution instances. So, anyone has a good explanation?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您对 vector::resize() 内部调用 malloc 的看法是正确的。实现方面的 malloc 相当复杂。我可以看到 malloc 在多线程环境中可能导致争用的多个地方。
malloc可能在用户空间中保留一个全局数据结构来管理用户的堆地址空间。需要保护此全局数据结构免受并发访问和修改。一些分配器进行了优化,以减少访问此全局数据结构的次数...我不知道 Ubuntu 已经发展到什么程度了。
malloc 分配地址空间。因此,当您真正开始接触分配的内存时,您将遇到“软页面错误”,这是一种允许操作系统内核为分配的地址空间分配后备 RAM 的页面错误。这可能会很昂贵,因为要访问内核,并且需要内核获取一些全局锁来访问其自己的全局 RAM 资源数据结构。
用户空间分配器可能会保留一些已分配的空间以从中分配新的空间。然而,一旦这些分配用完,分配器将需要返回内核并从内核分配更多的地址空间。这也是昂贵的,并且需要访问内核,并且内核需要一些全局锁来访问其与全局地址空间管理相关的数据结构。
最重要的是,这些交互可能相当复杂。如果您遇到这些瓶颈,我建议您只需“预分配”内存即可。这将涉及分配它,然后触及所有它(全部来自单个线程),以便您稍后可以从所有线程使用该内存,而不会在用户或内核级别遇到锁争用。
You are right about vector::resize() internally calling malloc. Implementation-wise malloc is fairly complicated. I can see multiple places where malloc can lead to contention in a multi-threaded environment.
malloc probably keeps a global data structure in userspace to manage the user's heap address space. This global data structure would need to be protected against concurrent access and modification. Some allocators have optimizations to alleviate the number of times this global data structure is accessed... I don't know how far has Ubuntu come along.
malloc allocates address space. So when you actually begin to touch the allocated memory you would go through a "soft page fault" which is a page fault which allows the OS kernel to allocate the backing RAM for the allocated address space. This can be expensive because of the trip to the kernel and would require the kernel to take some global locks to access its own global RAM resource data structures.
the user space allocator probably keeps some allocated space to give out new allocations from. However, once those allocations run out the allocator would need to go back to the kernel and allocate some more address space from the kernel. This is also expensive and would require a trip to the kernel and the kernel taking some global locks to access its global address space management related data structures.
Bottomline, these interactions could be fairly complicated. If you are running into these bottlenecks I would suggest that you simply "pre-allocate" your memory. This would involve allocating it and then touching all of it (all from a single thread) so that you can use that memory later from all your threads without running into lock contention at user or kernel level.
内存分配器绝对是多个线程可能的争用点。
从根本上讲,堆是一种共享数据结构,因为可以在一个线程上分配内存,并在另一个线程上取消分配内存。事实上,您的示例正是这样做的 - “调整大小”将释放每个工作线程上的内存,这些内存最初是在其他地方分配的。
gcc 和其他编译器中包含的 malloc 的典型实现使用共享全局锁,并且如果内存分配压力相对较低,则跨线程工作得相当好。然而,超过一定的分配级别,线程将开始在锁上序列化,您将获得过多的上下文切换和缓存垃圾,并且性能将会下降。您的程序是一个分配繁重的示例,内部循环中有一个 alloc + dealloc。
我很惊讶 OpenMP 兼容编译器没有更好的线程 malloc 实现?它们确实存在 - 请查看此问题获取列表。
Memory allocators are definitely a possible contention point for multiple threads.
Fundamentally, the heap is a shared data structure, since it is possible to allocate memory on one thread, and de-allocate it on another. In fact, your example does exactly that - the "resize" will free memory on each of the worker threads, which was initially allocated elsewhere.
Typical implementations of malloc included with gcc and other compilers use a shared global lock and work reasonably well across threads if memory allocation pressure is relatively low. Above a certain allocation level, however, threads will begin to serialize on the lock, you'll get excessive context switching and cache trashing, and performance will degrade. Your program is an example of something which is allocation heavy, with an alloc + dealloc in the inner loop.
I'm surprised that an OpenMP compatible compiler doesn't have a better threaded malloc implementation? They certainly exist - take a look at this question for a list.
从技术上讲,STL
vector
使用最终调用new
的std::allocator
。new
依次调用 libc 的malloc
(针对您的 Linux 系统)。这个 malloc 实现作为通用分配器非常高效,是线程安全的,但它不可扩展(GNU libc 的 malloc 源自 Doug Lea 的 dlmalloc)。有许多分配器和论文对 dlmalloc 进行了改进以提供可扩展的分配。
我建议您看一下 Emery Berger 博士的 Hoard,tcmalloc 来自 Google 和 Intel 线程构建块可扩展的分配器。
Technically, the STL
vector
uses thestd::allocator
which eventually callsnew
.new
in its turn calls the libc'smalloc
(for your Linux system).This malloc implementation is quite efficient as a general purpose allocator, is thread-safe, however it is not scalable (the GNU libc's malloc derives from Doug Lea's dlmalloc). There are numerous allocators and papers that improve upon dlmalloc to provide scalable allocation.
I would suggest that you take a look at Hoard from Dr. Emery Berger, tcmalloc from Google and Intel Threading Building Blocks scalable allocator.