多线程环境中的 Malloc 性能

发布于 2024-12-05 19:08:13 字数 1072 浏览 1 评论 0原文

我一直在使用 openmp 框架进行一些实验，发现了一些奇怪的结果，我不确定如何解释。

我的目标是创建这个巨大的矩阵，然后用值填充它。我制作了代码的某些部分，例如并行循环，以便从多线程环境中获得性能。我在一台具有 2 个四核 Xeon 处理器的机器上运行它，因此我可以安全地在其中放置最多 8 个并发线程。

一切都按预期工作，但由于某种原因，实际分配矩阵行的 for 循环在仅使用 3 个线程运行时具有奇怪的峰值性能。从那时起，添加更多线程只会使我的循环花费更长的时间。 8 个线程实际上比只有 1 个线程需要更多的时间。

这是我的并行循环：

 int width = 11;
 int height = 39916800;
 vector<vector<int> > matrix;
 matrix.resize(height);    
 #pragma omp parallel shared(matrix,width,height) private(i) num_threads(3)
 {
   #pragma omp for schedule(dynamic,chunk)
   for(i = 0; i < height; i++){
     matrix[i].resize(width);
   }
 } /* End of parallel block */

这让我想知道：在多线程环境中调用 malloc （我认为这就是向量模板类的 resize 方法实际调用的方法）时是否存在已知的性能问题？我发现一些文章提到了在多线程环境中释放堆空间时的性能损失，但没有具体说明在本例中分配新空间。

举个例子，我在下面放置了一个图表，显示循环完成所需的时间，作为分配循环和仅从这个巨大矩阵读取数据的正常循环的线程数的函数稍后。

并行区域 1，有分配

并行区域 2，没有分配

两次都使用 gettimeofday 函数进行测量，似乎在不同的执行实例中返回非常相似且准确的结果。那么，有人有一个好的解释吗？

原文

I've been running some experiments with the openmp framework and found some odd results I'm not sure I know how to explain.

My goal is to create this huge matrix and then fill it with values. I made some parts of my code like parallel loops in order to gain performance from my multithreaded enviroment. I'm running this in a machine with 2 quad-core xeon processors, so I can safely put up to 8 concurrent threads in there.

Everything works as expected, but for some reason the for loop actually allocating the rows of my matrix have an odd peak performance when running with only 3 threads. From there on, adding some more threads just makes my loop take longer. With 8 threads taking actually more time that it would need with only one.

This is my parallel loop:

 int width = 11;
 int height = 39916800;
 vector<vector<int> > matrix;
 matrix.resize(height);    
 #pragma omp parallel shared(matrix,width,height) private(i) num_threads(3)
 {
   #pragma omp for schedule(dynamic,chunk)
   for(i = 0; i < height; i++){
     matrix[i].resize(width);
   }
 } /* End of parallel block */

This made me wonder: is there a known performance problem when calling malloc (which I suppose is what the resize method of the vector template class is actually calling) in a multithreaded enviroment? I found some articles saying something about performance loss in freeing heap space in a mutithreaded enviroment, but nothing specific about allocating new space as in this case.

Just to give you an example, I'm placing below a graph of the time it takes for the loop to finish as a function of the number of threads for both the allocation loop, and a normal loop that just reads data from this huge matrix later on.

Parallel region 1, where there is allocation

Parallel region 2, where there is no allocation

Both times where measured using the gettimeofday function and seem to return very similar and accurate results across different execution instances. So, anyone has a good explanation?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甜柠檬 2024-12-12 19:08:13

您对 vector::resize() 内部调用 malloc 的看法是正确的。实现方面的 malloc 相当复杂。我可以看到 malloc 在多线程环境中可能导致争用的多个地方。

malloc可能在用户空间中保留一个全局数据结构来管理用户的堆地址空间。需要保护此全局数据结构免受并发访问和修改。一些分配器进行了优化，以减少访问此全局数据结构的次数...我不知道 Ubuntu 已经发展到什么程度了。
malloc 分配地址空间。因此，当您真正开始接触分配的内存时，您将遇到“软页面错误”，这是一种允许操作系统内核为分配的地址空间分配后备 RAM 的页面错误。这可能会很昂贵，因为要访问内核，并且需要内核获取一些全局锁来访问其自己的全局 RAM 资源数据结构。
用户空间分配器可能会保留一些已分配的空间以从中分配新的空间。然而，一旦这些分配用完，分配器将需要返回内核并从内核分配更多的地址空间。这也是昂贵的，并且需要访问内核，并且内核需要一些全局锁来访问其与全局地址空间管理相关的数据结构。

最重要的是，这些交互可能相当复杂。如果您遇到这些瓶颈，我建议您只需“预分配”内存即可。这将涉及分配它，然后触及所有它（全部来自单个线程），以便您稍后可以从所有线程使用该内存，而不会在用户或内核级别遇到锁争用。

回复收藏 0 原文

烟凡古楼 2024-12-12 19:08:13

内存分配器绝对是多个线程可能的争用点。

从根本上讲，堆是一种共享数据结构，因为可以在一个线程上分配内存，并在另一个线程上取消分配内存。事实上，您的示例正是这样做的 - “调整大小”将释放每个工作线程上的内存，这些内存最初是在其他地方分配的。

gcc 和其他编译器中包含的 malloc 的典型实现使用共享全局锁，并且如果内存分配压力相对较低，则跨线程工作得相当好。然而，超过一定的分配级别，线程将开始在锁上序列化，您将获得过多的上下文切换和缓存垃圾，并且性能将会下降。您的程序是一个分配繁重的示例，内部循环中有一个 alloc + dealloc。

我很惊讶 OpenMP 兼容编译器没有更好的线程 malloc 实现？它们确实存在 - 请查看此问题获取列表。

回复收藏 0 原文