多线程强调内存碎片吗?
描述
当使用 openmp 的并行结构分配和释放 4 个或更多线程的随机大小的内存块时,程序似乎开始在 测试程序的运行时。因此,它将消耗的内存从 1050 MB 增加到 1500 MB 或更多,但实际上并未使用额外的内存。
由于 valgrind 没有显示任何问题,我必须假设看似内存泄漏的实际上是内存碎片的强调效应。
有趣的是,如果 2 个线程各进行 10000 次分配,效果还没有显示出来,但如果 4 个线程各进行 5000 次分配,则效果会很明显。此外,如果分配块的最大大小从 1mb 减少到 256kb,效果会变弱。
重并发会如此强调碎片吗?或者这更有可能是堆中的错误?
测试程序描述 构建
演示程序是为了从堆中获取总共 256 MB 的随机大小的内存块,进行 5000 次分配。如果达到内存限制,首先分配的块将被释放,直到内存消耗低于限制。一旦执行了 5000 次分配,所有内存都会被释放并且循环结束。所有这些工作都是针对 openmp 生成的每个线程完成的。
这种内存分配方案允许我们预计每个线程的内存消耗约为 260 MB(包括一些簿记数据)。
演示程序
由于这确实是您可能想要测试的内容,因此您可以从 保管箱。
按原样运行程序时,您应该至少有 1400 MB 的可用 RAM。请随意调整代码中的常量以满足您的需要。
为了完整起见,实际代码如下:
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <deque>
#include <omp.h>
#include <math.h>
typedef unsigned long long uint64_t;
void runParallelAllocTest()
{
// constants
const int NUM_ALLOCATIONS = 5000; // alloc's per thread
const int NUM_THREADS = 4; // how many threads?
const int NUM_ITERS = NUM_THREADS;// how many overall repetions
const bool USE_NEW = true; // use new or malloc? , seems to make no difference (as it should)
const bool DEBUG_ALLOCS = false; // debug output
// pre store allocation sizes
const int NUM_PRE_ALLOCS = 20000;
const uint64_t MEM_LIMIT = (1024 * 1024) * 256; // x MB per process
const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1;
srand(1);
std::vector<size_t> allocations;
allocations.resize(NUM_PRE_ALLOCS);
for (int i = 0; i < NUM_PRE_ALLOCS; i++) {
allocations[i] = rand() % MAX_CHUNK_SIZE; // use up to x MB chunks
}
#pragma omp parallel num_threads(NUM_THREADS)
#pragma omp for
for (int i = 0; i < NUM_ITERS; ++i) {
uint64_t long totalAllocBytes = 0;
uint64_t currAllocBytes = 0;
std::deque< std::pair<char*, uint64_t> > pointers;
const int myId = omp_get_thread_num();
for (int j = 0; j < NUM_ALLOCATIONS; ++j) {
// new allocation
const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ];
char* pnt = NULL;
if (USE_NEW) {
pnt = new char[allocSize];
} else {
pnt = (char*) malloc(allocSize);
}
pointers.push_back(std::make_pair(pnt, allocSize));
totalAllocBytes += allocSize;
currAllocBytes += allocSize;
// fill with values to add "delay"
for (int fill = 0; fill < (int) allocSize; ++fill) {
pnt[fill] = (char)(j % 255);
}
if (DEBUG_ALLOCS) {
std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n";
}
// free all or just a bit
if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) {
int frees = 0;
// keep this much allocated
// last check, free all
uint64_t memLimit = MEM_LIMIT;
if (j == NUM_ALLOCATIONS - 1) {
std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl;
memLimit = 0;
}
//MEM_LIMIT = 0; // DEBUG
while (pointers.size() > 0 && (currAllocBytes > memLimit)) {
// free one of the first entries to allow previously obtained resources to 'live' longer
currAllocBytes -= pointers.front().second;
char* pnt = pointers.front().first;
// free memory
if (USE_NEW) {
delete[] pnt;
} else {
free(pnt);
}
// update array
pointers.pop_front();
if (DEBUG_ALLOCS) {
std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n";
}
frees++;
}
if (DEBUG_ALLOCS) {
std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n";
}
}
} // for each allocation
if (currAllocBytes != 0) {
std::cerr << "Not all free'd!\n";
}
std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n";
} // for each iteration
exit(1);
}
int main(int argc, char** argv)
{
runParallelAllocTest();
return 0;
}
测试系统
从我目前看来,硬件非常重要。如果在更快的机器上运行,测试可能需要调整。
Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz
Ubuntu 10.04 LTS 64 bit
gcc 4.3, 4.4, 4.6
3988.62 Bogomips
测试
一旦执行了 makefile,您应该会得到一个名为 ompmemtest
的文件。为了查询一段时间内的内存使用情况,我使用了以下命令:
./ompmemtest &
top -b | grep ompmemtest
这会产生令人印象深刻的碎片或泄漏行为。 4 个线程的预期内存消耗为 1090 MB,随着时间的推移变成 1500 MB:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11626 byron 20 0 204m 99m 1000 R 27 2.5 0:00.81 ompmemtest
11626 byron 20 0 992m 832m 1004 R 195 21.0 0:06.69 ompmemtest
11626 byron 20 0 1118m 1.0g 1004 R 189 26.1 0:12.40 ompmemtest
11626 byron 20 0 1218m 1.0g 1004 R 190 27.1 0:18.13 ompmemtest
11626 byron 20 0 1282m 1.1g 1004 R 195 29.6 0:24.06 ompmemtest
11626 byron 20 0 1471m 1.3g 1004 R 195 33.5 0:29.96 ompmemtest
11626 byron 20 0 1469m 1.3g 1004 R 194 33.5 0:35.85 ompmemtest
11626 byron 20 0 1469m 1.3g 1004 R 195 33.6 0:41.75 ompmemtest
11626 byron 20 0 1636m 1.5g 1004 R 194 37.8 0:47.62 ompmemtest
11626 byron 20 0 1660m 1.5g 1004 R 195 38.0 0:53.54 ompmemtest
11626 byron 20 0 1669m 1.5g 1004 R 195 38.2 0:59.45 ompmemtest
11626 byron 20 0 1664m 1.5g 1004 R 194 38.1 1:05.32 ompmemtest
11626 byron 20 0 1724m 1.5g 1004 R 195 40.0 1:11.21 ompmemtest
11626 byron 20 0 1724m 1.6g 1140 S 193 40.1 1:17.07 ompmemtest
请注意: 编译时我可以重现此问题gcc 4.3、4.4 和 4.6(主干)。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(54)
好吧,上钩了。
的系统上,
这是在一个Naive run
我只是运行了它,
没什么特别的。 的同时输出
这是
vmstat -SM 1
Vmstat 原始数据这些信息对您来说有什么意义吗?
Google 线程缓存 Malloc
现在为了真正的乐趣,添加一点香料
看起来更快,而不是?
如果您想比较 vmstat 输出
Valgrind --tool Massif
这是
valgrind --tool=massif ./ompmemtest
之后ms_print
的输出头code>(默认 malloc):Google HEAPPROFILE
不幸的是,普通
valgrind
无法与tcmalloc
一起使用,所以我在比赛中换了马使用google-perftools
进行堆分析请与我联系以获取完整日志/详细信息
更新
至评论:我更新了程序
它运行了超过 5 立方秒。接近尾声时,htop 的屏幕截图表明,保留集确实略高,接近 2.3g:
Ok, picked up the bait.
This is on a system with
Naive run
I just ran it
Nothing spectacular. Here is the simultaneous output of
vmstat -S M 1
Vmstat raw data
Does that information mean anything to you?
Google Thread Caching Malloc
Now for real fun, add a little spice
Looks faster, not?
In case you wanted to compare vmstat outputs
Valgrind --tool massif
This is the head of output from
ms_print
aftervalgrind --tool=massif ./ompmemtest
(default malloc):Google HEAPPROFILE
Unfortunately, vanilla
valgrind
doesn't work withtcmalloc
, so I switched horses midrace to heap profiling withgoogle-perftools
Contact me for full logs/details
Update
To the comments: I updated the program
It ran for over 5m3s. Close to the end, a screenshot of htop teaches that indeed, the reserved set is slightly higher, going towards 2.3g:
96.7%] Tasks: 125 total, 2 running
2 [
96.7%] Tasks: 125 total, 2 running
2 [
96.7%] Load average: 8.09 5.24 2.37
3 [
96.7%] Load average: 8.09 5.24 2.37
3 [
97.4%] Uptime: 01:54:22
4 [
97.4%] Uptime: 01:54:22
4 [
96.1%]
Mem[
96.1%]
Mem[
| 3055/7936MB]
Swp[ 0/0MB]
PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
4330 sehe 8 20 0 2635M 2286M 908 R 368. 28.8 15:35.01 ./ompmemtest
与 tcmalloc 运行的结果进行比较:4 分 12 秒,
相似的顶级统计数据有细微差别;最大的区别在于 VIRT 集(但这并不是特别有用,除非每个进程的地址空间非常有限?)。如果你问我的话,RES 集非常相似。 更重要的是要注意是并行度增加;所有核心现在都已达到极限。这显然是由于使用 tcmalloc 时减少了对堆操作的锁定需要:| 3055/7936MB]
Swp[ 0/0MB]
PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
4330 sehe 8 20 0 2635M 2286M 908 R 368. 28.8 15:35.01 ./ompmemtest
Comparing results with a tcmalloc run: 4m12s,
similar top statshas minor differences; the big difference is in the VIRT set (but that isn't particularly useful unless you have a very limited address space per process?). The RES set is quite similar, if you ask me. The more important thing to note is parallellism is increased; all cores are now maxed out. This is obviously due to reduced need to lock for heap operations when using tcmalloc:|||100.0%] Tasks: 172 total, 2 running
2 [
|||100.0%] Tasks: 172 total, 2 running
2 [
|||100.0%] Load average: 7.39 2.92 1.11
3 [
|||100.0%] Load average: 7.39 2.92 1.11
3 [
|||100.0%] Uptime: 11:12:25
4 [
|||100.0%] Uptime: 11:12:25
4 [
|||100.0%]
Mem[
|||100.0%]
Mem[
|||| 3278/7936MB]
Swp[ 0/0MB]
PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
14391 sehe 8 20 0 2251M 2179M 1148 R 379. 27.5 8:08.92 ./ompmemtest
|||| 3278/7936MB]
Swp[ 0/0MB]
PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
14391 sehe 8 20 0 2251M 2179M 1148 R 379. 27.5 8:08.92 ./ompmemtest
当将测试程序与 google 的 tcmalloc 库链接时,可执行文件不仅运行速度提高了约 10%,而且还显示内存碎片大大减少或微不足道:
从我拥有的数据来看,答案似乎是是:
如果所使用的堆库不能很好地处理并发访问并且处理器无法真正并发执行线程,则对堆的多线程访问可能会加剧碎片。
tcmalloc 库显示,运行同一程序时没有出现明显的内存碎片,而之前的程序会导致大约 400MB 的碎片丢失。
但为什么会发生这种情况呢?
我在这里必须提供的最好的想法是在堆中提供某种锁定工件。
测试程序将分配随机大小的内存块,释放程序早期分配的块以保持在其内存限制内。当一个线程正在释放“左侧”堆块中的旧内存时,它实际上可能会因另一个线程计划运行而停止,从而留下一个(软)锁那个堆块。新调度的线程想要分配内存,但甚至可能不会读取“左侧”的堆块来检查空闲内存,因为它当前正在更改。因此,它最终可能会不必要地从“右侧”使用新的堆块。
这个过程可能看起来像堆块移位,其中第一个块(左侧)仅保留稀疏使用和碎片,迫使新块在右侧使用。
让我们重申一下,只有当我在双核系统上使用 4 个或更多线程(该系统只能或多或少地同时处理两个线程)时,才会出现此碎片问题。当仅使用两个线程时,堆上的(软)锁将保持足够短的时间,不会阻塞想要分配内存的其他线程。
另外,作为免责声明,我没有检查 glibc 堆实现的实际代码,我在内存分配器领域也只是新手 - 我所写的只是它在我看来的样子,这使得它纯粹是猜测。
另一个有趣的读物可能是 tcmalloc 文档,其中说明了堆和多线程的常见问题- 线程访问,其中一些可能也在测试程序中发挥了作用。
值得注意的是,它永远不会将内存返回给系统(请参阅 tcmalloc 中的注意事项段落文档)
When linking the test program with google's tcmalloc library, the executable doesn't only run ~10% faster, but shows greatly reduced or insignificant memory fragmentation as well:
From the data I have, the answer appears to be:
Multithreaded access to the heap can emphasize fragmentation if the employed heap library does not deal well with concurrent access and if the processor fails to execute the threads truly concurrently.
The tcmalloc library shows no significant memory fragmentation running the same program that previously caused ~400MB to be lost in fragmentation.
But why does that happen ?
The best idea I have to offer here is some sort of locking artifact within the heap.
The test program will allocate randomly sized blocks of memory, freeing up blocks allocated early in the program to stay within its memory limit. When one thread is in the process of releasing old memory which is in a heap block on the 'left', it might actually be halted as another thread is scheduled to run, leaving a (soft) lock on that heap block. The newly scheduled thread wants to allocate memory, but may not even read that heap block on the 'left' side to check for free memory as it is currently being changed. Hence it might end up using a new heap block unnecessarily from the 'right'.
This process could look like a heap-block-shifting, where the the first blocks (on the left) remain only sparsely used and fragmented, forcing new blocks to be used on the right.
Lets restate that this fragmentation issue only occurs for me if I use 4 or more threads on a dual core system which can only handle two threads more or less concurrently. When only two threads are used, the (soft) locks on the heap will be held short enough not to block the other thread who wants to allocate memory.
Also, as a disclaimer, I didn't check the actual code of the glibc heap implementation, nor am I anything more than novice in the field of memory allocators - all I wrote is just how it appears to me which makes it pure speculation.
Another interesting read might be the tcmalloc documentation, which states common problems with heaps and multi-threaded access, some of which may have played their role in the test program too.
Its worth noting that it will never return memory to the system (see Caveats paragraph in tcmalloc documentation)
是的,默认的 malloc (取决于 Linux 版本)做了一些疯狂的事情,在一些多线程应用程序中大量失败。具体来说,它保留几乎每个线程堆(区域)以避免锁定。对于所有线程来说,这比单个堆要快得多,但内存效率非常低(有时)。您可以使用这样的代码来调整它,该代码会关闭多个竞技场(这会降低性能,因此如果您有很多小分配,请不要这样做!)
或者像其他人建议的那样使用 malloc 的各种替代品。
基本上,通用 malloc 不可能始终高效,因为它不知道将如何使用它。
克里斯·P.
Yes the default malloc (Depending on linux version) does some crazy stuff which fails massively in some multi threaded applications. Specifically it keeps almost per thread heaps (arenas) to avoid locking. This is much faster than a single heap for all threads, but massively memory inefficient (sometimes). You can tune this by using code like this which turns off the multiple arenas (this kills performance so don't do this if you have lots of small allocations!)
Or as others suggested using various replacements for malloc.
Basically it's impossible for a general purpose malloc to always be efficient as it doesn't know how it's going to be used.
ChrisP.