当前位置：文江博客话题详情

多线程强调内存碎片吗？

发布于 2024-11-04 13:16:15 字数 8036 浏览 6 评论 0 原文

描述

当使用 openmp 的并行结构分配和释放 4 个或更多线程的随机大小的内存块时，程序似乎开始在测试程序的运行时。因此，它将消耗的内存从 1050 MB 增加到 1500 MB 或更多，但实际上并未使用额外的内存。

由于 valgrind 没有显示任何问题，我必须假设看似内存泄漏的实际上是内存碎片的强调效应。

有趣的是，如果 2 个线程各进行 10000 次分配，效果还没有显示出来，但如果 4 个线程各进行 5000 次分配，则效果会很明显。此外，如果分配块的最大大小从 1mb 减少到 256kb，效果会变弱。

重并发会如此强调碎片吗？或者这更有可能是堆中的错误？

测试程序描述构建

演示程序是为了从堆中获取总共 256 MB 的随机大小的内存块，进行 5000 次分配。如果达到内存限制，首先分配的块将被释放，直到内存消耗低于限制。一旦执行了 5000 次分配，所有内存都会被释放并且循环结束。所有这些工作都是针对 openmp 生成的每个线程完成的。

这种内存分配方案允许我们预计每个线程的内存消耗约为 260 MB（包括一些簿记数据）。

演示程序

由于这确实是您可能想要测试的内容，因此您可以从保管箱。

按原样运行程序时，您应该至少有 1400 MB 的可用 RAM。请随意调整代码中的常量以满足您的需要。

为了完整起见，实际代码如下：

#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <deque>

#include <omp.h>
#include <math.h>

typedef unsigned long long uint64_t;

void runParallelAllocTest()
{
    // constants
    const int  NUM_ALLOCATIONS = 5000; // alloc's per thread
    const int  NUM_THREADS = 4;       // how many threads?
    const int  NUM_ITERS = NUM_THREADS;// how many overall repetions

    const bool USE_NEW      = true;   // use new or malloc? , seems to make no difference (as it should)
    const bool DEBUG_ALLOCS = false;  // debug output

    // pre store allocation sizes
    const int  NUM_PRE_ALLOCS = 20000;
    const uint64_t MEM_LIMIT = (1024 * 1024) * 256;   // x MB per process
    const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1;

    srand(1);
    std::vector<size_t> allocations;
    allocations.resize(NUM_PRE_ALLOCS);
    for (int i = 0; i < NUM_PRE_ALLOCS; i++) {
        allocations[i] = rand() % MAX_CHUNK_SIZE;   // use up to x MB chunks
    }


    #pragma omp parallel num_threads(NUM_THREADS)
    #pragma omp for
    for (int i = 0; i < NUM_ITERS; ++i) {
        uint64_t long totalAllocBytes = 0;
        uint64_t currAllocBytes = 0;

        std::deque< std::pair<char*, uint64_t> > pointers;
        const int myId = omp_get_thread_num();

        for (int j = 0; j < NUM_ALLOCATIONS; ++j) {
            // new allocation
            const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ];

            char* pnt = NULL;
            if (USE_NEW) {
                pnt = new char[allocSize];
            } else {
                pnt = (char*) malloc(allocSize);
            }
            pointers.push_back(std::make_pair(pnt, allocSize));

            totalAllocBytes += allocSize;
            currAllocBytes  += allocSize;

            // fill with values to add "delay"
            for (int fill = 0; fill < (int) allocSize; ++fill) {
                pnt[fill] = (char)(j % 255);
            }


            if (DEBUG_ALLOCS) {
                std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n";
            }

            // free all or just a bit
            if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) {
                int frees = 0;

                // keep this much allocated
                // last check, free all
                uint64_t memLimit = MEM_LIMIT;
                if (j == NUM_ALLOCATIONS - 1) {
                    std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl;
                    memLimit = 0;
                }
                //MEM_LIMIT = 0; // DEBUG

                while (pointers.size() > 0 && (currAllocBytes > memLimit)) {
                    // free one of the first entries to allow previously obtained resources to 'live' longer
                    currAllocBytes -= pointers.front().second;
                    char* pnt       = pointers.front().first;

                    // free memory
                    if (USE_NEW) {
                        delete[] pnt;
                    } else {
                        free(pnt);
                    }

                    // update array
                    pointers.pop_front();

                    if (DEBUG_ALLOCS) {
                        std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n";
                    }
                    frees++;
                }
                if (DEBUG_ALLOCS) {
                    std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n";
                }
            }
        } // for each allocation

        if (currAllocBytes != 0) {
            std::cerr << "Not all free'd!\n";
        }

        std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n";
    } // for each iteration

    exit(1);
}

int main(int argc, char** argv)
{
    runParallelAllocTest();

    return 0;
}

测试系统

从我目前看来，硬件非常重要。如果在更快的机器上运行，测试可能需要调整。

Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz
Ubuntu 10.04 LTS 64 bit
gcc 4.3, 4.4, 4.6
3988.62 Bogomips

测试

一旦执行了 makefile，您应该会得到一个名为 ompmemtest 的文件。为了查询一段时间内的内存使用情况，我使用了以下命令：

./ompmemtest &
top -b | grep ompmemtest

这会产生令人印象深刻的碎片或泄漏行为。 4 个线程的预期内存消耗为 1090 MB，随着时间的推移变成 1500 MB：

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11626 byron     20   0  204m  99m 1000 R   27  2.5   0:00.81 ompmemtest                                                                              
11626 byron     20   0  992m 832m 1004 R  195 21.0   0:06.69 ompmemtest                                                                              
11626 byron     20   0 1118m 1.0g 1004 R  189 26.1   0:12.40 ompmemtest                                                                              
11626 byron     20   0 1218m 1.0g 1004 R  190 27.1   0:18.13 ompmemtest                                                                              
11626 byron     20   0 1282m 1.1g 1004 R  195 29.6   0:24.06 ompmemtest                                                                              
11626 byron     20   0 1471m 1.3g 1004 R  195 33.5   0:29.96 ompmemtest                                                                              
11626 byron     20   0 1469m 1.3g 1004 R  194 33.5   0:35.85 ompmemtest                                                                              
11626 byron     20   0 1469m 1.3g 1004 R  195 33.6   0:41.75 ompmemtest                                                                              
11626 byron     20   0 1636m 1.5g 1004 R  194 37.8   0:47.62 ompmemtest                                                                              
11626 byron     20   0 1660m 1.5g 1004 R  195 38.0   0:53.54 ompmemtest                                                                              
11626 byron     20   0 1669m 1.5g 1004 R  195 38.2   0:59.45 ompmemtest                                                                              
11626 byron     20   0 1664m 1.5g 1004 R  194 38.1   1:05.32 ompmemtest                                                                              
11626 byron     20   0 1724m 1.5g 1004 R  195 40.0   1:11.21 ompmemtest                                                                              
11626 byron     20   0 1724m 1.6g 1140 S  193 40.1   1:17.07 ompmemtest

请注意： 编译时我可以重现此问题gcc 4.3、4.4 和 4.6（主干）。

原文

Description

When allocating and deallocating randomly sized memory chunks with 4 or more threads using openmp's parallel for construct, the program seems to start leaking considerable amounts of memory in the second half of the test-program's runtime. Thus it increases its consumed memory from 1050 MB to 1500 MB or more without actually making use of the extra memory.

As valgrind shows no issues, I must assume that what appears to be a memory leak actually is an emphasized effect of memory fragmentation.

Interestingly, the effect does not show yet if 2 threads make 10000 allocations each, but it shows strongly if 4 threads make 5000 allocations each. Also, if the maximum size of allocated chunks is reduced to 256kb (from 1mb), the effect gets weaker.

Can heavy concurrency emphasize fragmentation that much ? Or is this more likely to be a bug in the heap ?

Test Program Description

The demo program is build to obtain a total of 256 MB of randomly sized memory chunks from the heap, doing 5000 allocations. If the memory limit is hit, the chunks allocated first will be deallocated until the memory consumption falls below the limit. Once 5000 allocations where performed, all memory is released and the loop ends. All this work is done for each thread generated by openmp.

This memory allocation scheme allows us to expect a memory consumption of ~260 MB per thread (including some bookkeeping data).

Demo Program

As this is really something you might want to test, you can download the sample program with a simple makefile from dropbox.

When running the program as is, you should have at least 1400 MB of RAM available. Feel free to adjust the constants in the code to suit your needs.

For completeness, the actual code follows:

#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <deque>

#include <omp.h>
#include <math.h>

typedef unsigned long long uint64_t;

void runParallelAllocTest()
{
    // constants
    const int  NUM_ALLOCATIONS = 5000; // alloc's per thread
    const int  NUM_THREADS = 4;       // how many threads?
    const int  NUM_ITERS = NUM_THREADS;// how many overall repetions

    const bool USE_NEW      = true;   // use new or malloc? , seems to make no difference (as it should)
    const bool DEBUG_ALLOCS = false;  // debug output

    // pre store allocation sizes
    const int  NUM_PRE_ALLOCS = 20000;
    const uint64_t MEM_LIMIT = (1024 * 1024) * 256;   // x MB per process
    const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1;

    srand(1);
    std::vector<size_t> allocations;
    allocations.resize(NUM_PRE_ALLOCS);
    for (int i = 0; i < NUM_PRE_ALLOCS; i++) {
        allocations[i] = rand() % MAX_CHUNK_SIZE;   // use up to x MB chunks
    }


    #pragma omp parallel num_threads(NUM_THREADS)
    #pragma omp for
    for (int i = 0; i < NUM_ITERS; ++i) {
        uint64_t long totalAllocBytes = 0;
        uint64_t currAllocBytes = 0;

        std::deque< std::pair<char*, uint64_t> > pointers;
        const int myId = omp_get_thread_num();

        for (int j = 0; j < NUM_ALLOCATIONS; ++j) {
            // new allocation
            const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ];

            char* pnt = NULL;
            if (USE_NEW) {
                pnt = new char[allocSize];
            } else {
                pnt = (char*) malloc(allocSize);
            }
            pointers.push_back(std::make_pair(pnt, allocSize));

            totalAllocBytes += allocSize;
            currAllocBytes  += allocSize;

            // fill with values to add "delay"
            for (int fill = 0; fill < (int) allocSize; ++fill) {
                pnt[fill] = (char)(j % 255);
            }


            if (DEBUG_ALLOCS) {
                std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n";
            }

            // free all or just a bit
            if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) {
                int frees = 0;

                // keep this much allocated
                // last check, free all
                uint64_t memLimit = MEM_LIMIT;
                if (j == NUM_ALLOCATIONS - 1) {
                    std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl;
                    memLimit = 0;
                }
                //MEM_LIMIT = 0; // DEBUG

                while (pointers.size() > 0 && (currAllocBytes > memLimit)) {
                    // free one of the first entries to allow previously obtained resources to 'live' longer
                    currAllocBytes -= pointers.front().second;
                    char* pnt       = pointers.front().first;

                    // free memory
                    if (USE_NEW) {
                        delete[] pnt;
                    } else {
                        free(pnt);
                    }

                    // update array
                    pointers.pop_front();

                    if (DEBUG_ALLOCS) {
                        std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n";
                    }
                    frees++;
                }
                if (DEBUG_ALLOCS) {
                    std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n";
                }
            }
        } // for each allocation

        if (currAllocBytes != 0) {
            std::cerr << "Not all free'd!\n";
        }

        std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n";
    } // for each iteration

    exit(1);
}

int main(int argc, char** argv)
{
    runParallelAllocTest();

    return 0;
}

The Test-System

From what I see so far, the hardware matters a lot. The test might need adjustments if run on a faster machine.

Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz
Ubuntu 10.04 LTS 64 bit
gcc 4.3, 4.4, 4.6
3988.62 Bogomips

Testing

Once you have executed the makefile, you should get a file named ompmemtest. To query the memory usage over time, I used the following commands:

./ompmemtest &
top -b | grep ompmemtest

Which yields the quite impressive fragmentation or leaking behaviour. The expected memory consumption with 4 threads is 1090 MB, which became 1500 MB over time:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11626 byron     20   0  204m  99m 1000 R   27  2.5   0:00.81 ompmemtest                                                                              
11626 byron     20   0  992m 832m 1004 R  195 21.0   0:06.69 ompmemtest                                                                              
11626 byron     20   0 1118m 1.0g 1004 R  189 26.1   0:12.40 ompmemtest                                                                              
11626 byron     20   0 1218m 1.0g 1004 R  190 27.1   0:18.13 ompmemtest                                                                              
11626 byron     20   0 1282m 1.1g 1004 R  195 29.6   0:24.06 ompmemtest                                                                              
11626 byron     20   0 1471m 1.3g 1004 R  195 33.5   0:29.96 ompmemtest                                                                              
11626 byron     20   0 1469m 1.3g 1004 R  194 33.5   0:35.85 ompmemtest                                                                              
11626 byron     20   0 1469m 1.3g 1004 R  195 33.6   0:41.75 ompmemtest                                                                              
11626 byron     20   0 1636m 1.5g 1004 R  194 37.8   0:47.62 ompmemtest                                                                              
11626 byron     20   0 1660m 1.5g 1004 R  195 38.0   0:53.54 ompmemtest                                                                              
11626 byron     20   0 1669m 1.5g 1004 R  195 38.2   0:59.45 ompmemtest                                                                              
11626 byron     20   0 1664m 1.5g 1004 R  194 38.1   1:05.32 ompmemtest                                                                              
11626 byron     20   0 1724m 1.5g 1004 R  195 40.0   1:11.21 ompmemtest                                                                              
11626 byron     20   0 1724m 1.6g 1140 S  193 40.1   1:17.07 ompmemtest

Please Note: I could reproduce this issue when compiling with gcc 4.3, 4.4 and 4.6(trunk).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

前事休说 2024-11-11 13:16:15

好吧，上钩了。

的系统上，

Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
4x5666.59 bogomips

Linux meerkat 2.6.35-28-generic-pae #50-Ubuntu SMP Fri Mar 18 20:43:15 UTC 2011 i686 GNU/Linux

gcc version 4.4.5

             total       used       free     shared    buffers     cached
Mem:       8127172    4220560    3906612          0     374328    2748796
-/+ buffers/cache:    1097436    7029736
Swap:            0          0          0

这是在一个Naive run

我只是运行了它，

time ./ompmemtest 
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 1 about to release all memory: 257.339 MB
Id 2 about to release all memory: 257.043 MB
Id 1 done, total alloc'ed -1570.42MB 
Id 2 done, total alloc'ed -1569.96MB 

real    0m13.429s
user    0m44.619s
sys 0m6.000s

没什么特别的。的同时输出

这是 `vmstat -SM 1` Vmstat 原始数据

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 0  0      0   3892    364   2669    0    0    24     0  701 1487  2  1 97  0
 4  0      0   3421    364   2669    0    0     0     0 1317 1953 53  7 40  0
 4  0      0   2858    364   2669    0    0     0     0 2715 5030 79 16  5  0
 4  0      0   2861    364   2669    0    0     0     0 6164 12637 76 15  9  0
 4  0      0   2853    364   2669    0    0     0     0 4845 8617 77 13 10  0
 4  0      0   2848    364   2669    0    0     0     0 3782 7084 79 13  8  0
 5  0      0   2842    364   2669    0    0     0     0 3723 6120 81 12  7  0
 4  0      0   2835    364   2669    0    0     0     0 3477 4943 84  9  7  0
 4  0      0   2834    364   2669    0    0     0     0 3273 4950 81 10  9  0
 5  0      0   2828    364   2669    0    0     0     0 3226 4812 84 11  6  0
 4  0      0   2823    364   2669    0    0     0     0 3250 4889 83 10  7  0
 4  0      0   2826    364   2669    0    0     0     0 3023 4353 85 10  6  0
 4  0      0   2817    364   2669    0    0     0     0 3176 4284 83 10  7  0
 4  0      0   2823    364   2669    0    0     0     0 3008 4063 84 10  6  0
 0  0      0   3893    364   2669    0    0     0     0 4023 4228 64 10 26  0

这些信息对您来说有什么意义吗？

Google 线程缓存 Malloc

现在为了真正的乐趣，添加一点香料

time LD_PRELOAD="/usr/lib/libtcmalloc.so" ./ompmemtest 
Id 1 about to release all memory: 257.339 MB
Id 1 done, total alloc'ed -1570.42MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 2 about to release all memory: 257.043 MB
Id 2 done, total alloc'ed -1569.96MB 
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 

real    0m11.663s
user    0m44.255s
sys 0m1.028s

看起来更快，而不是？

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 4  0      0   3562    364   2684    0    0     0     0 1041 1676 28  7 64  0
 4  2      0   2806    364   2684    0    0     0   172 1641 1843 84 14  1  0
 4  0      0   2758    364   2685    0    0     0     0 1520 1009 98  2  1  0
 4  0      0   2747    364   2685    0    0     0     0 1504  859 98  2  0  0
 5  0      0   2745    364   2685    0    0     0     0 1575 1073 98  2  0  0
 5  0      0   2739    364   2685    0    0     0     0 1415  743 99  1  0  0
 4  0      0   2738    364   2685    0    0     0     0 1526  981 99  2  0  0
 4  0      0   2731    364   2685    0    0     0   684 1536  927 98  2  0  0
 4  0      0   2730    364   2685    0    0     0     0 1584 1010 99  1  0  0
 5  0      0   2730    364   2685    0    0     0     0 1461  917 99  2  0  0
 4  0      0   2729    364   2685    0    0     0     0 1561 1036 99  1  0  0
 4  0      0   2729    364   2685    0    0     0     0 1406  756 100  1  0  0
 0  0      0   3819    364   2685    0    0     0     4 1159 1476 26  3 71  0

如果您想比较 vmstat 输出

`Valgrind --tool Massif`

这是 valgrind --tool=massif ./ompmemtest 之后 ms_print 的输出头code>（默认 malloc）：

--------------------------------------------------------------------------------
Command:            ./ompmemtest
Massif arguments:   (none)
ms_print arguments: massif.out.beforetcmalloc
--------------------------------------------------------------------------------


    GB
1.009^                                                                     :  
     |       ##::::@@:::::::@@::::::@@::::@@::@::::@::::@:::::::::@::::::@::: 
     |       # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |       # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   ::::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |  :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |  :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   264.0

Number of snapshots: 63
 Detailed snapshots: [6 (peak), 10, 17, 23, 27, 30, 35, 39, 48, 56]

Google HEAPPROFILE

不幸的是，普通 valgrind 无法与 tcmalloc 一起使用，所以我在比赛中换了马使用 google-perftools 进行堆分析

gcc openMpMemtest_Linux.cpp -fopenmp -lgomp -lstdc++ -ltcmalloc -o ompmemtest

time HEAPPROFILE=/tmp/heapprofile ./ompmemtest
Starting tracking the heap
Dumping heap profile to /tmp/heapprofile.0001.heap (100 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0002.heap (200 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0003.heap (300 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0004.heap (400 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0005.heap (501 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0006.heap (601 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0007.heap (701 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0008.heap (801 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0009.heap (902 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0010.heap (1002 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0011.heap (2029 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0012.heap (3053 MB allocated cumulatively, 1030 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0013.heap (4078 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0014.heap (5102 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0015.heap (6126 MB allocated cumulatively, 1033 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0016.heap (7151 MB allocated cumulatively, 1029 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0017.heap (8175 MB allocated cumulatively, 1029 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0018.heap (9199 MB allocated cumulatively, 1028 MB currently in use)
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 
Id 2 about to release all memory: 257.043 MB
Id 2 done, total alloc'ed -1569.96MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 1 about to release all memory: 257.339 MB
Id 1 done, total alloc'ed -1570.42MB 
Dumping heap profile to /tmp/heapprofile.0019.heap (Exiting)

real    0m11.981s
user    0m44.455s
sys 0m1.124s

请与我联系以获取完整日志/详细信息

更新

至评论：我更新了程序

--- omptest/openMpMemtest_Linux.cpp 2011-05-03 23:18:44.000000000 +0200
+++ q/openMpMemtest_Linux.cpp   2011-05-04 13:42:47.371726000 +0200
@@ -13,8 +13,8 @@
 void runParallelAllocTest()
 {
    // constants
-   const int  NUM_ALLOCATIONS = 5000; // alloc's per thread
-   const int  NUM_THREADS = 4;       // how many threads?
+   const int  NUM_ALLOCATIONS = 55000; // alloc's per thread
+   const int  NUM_THREADS = 8;        // how many threads?
    const int  NUM_ITERS = NUM_THREADS;// how many overall repetions

    const bool USE_NEW      = true;   // use new or malloc? , seems to make no difference (as it should)

它运行了超过 5 立方秒。接近尾声时，htop 的屏幕截图表明，保留集确实略高，接近 2.3g：

  1  [

Ok, picked up the bait.

This is on a system with

Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
4x5666.59 bogomips

Linux meerkat 2.6.35-28-generic-pae #50-Ubuntu SMP Fri Mar 18 20:43:15 UTC 2011 i686 GNU/Linux

gcc version 4.4.5

             total       used       free     shared    buffers     cached
Mem:       8127172    4220560    3906612          0     374328    2748796
-/+ buffers/cache:    1097436    7029736
Swap:            0          0          0

Naive run

I just ran it

time ./ompmemtest 
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 1 about to release all memory: 257.339 MB
Id 2 about to release all memory: 257.043 MB
Id 1 done, total alloc'ed -1570.42MB 
Id 2 done, total alloc'ed -1569.96MB 

real    0m13.429s
user    0m44.619s
sys 0m6.000s

Nothing spectacular. Here is the simultaneous output of vmstat -S M 1

Vmstat raw data

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 0  0      0   3892    364   2669    0    0    24     0  701 1487  2  1 97  0
 4  0      0   3421    364   2669    0    0     0     0 1317 1953 53  7 40  0
 4  0      0   2858    364   2669    0    0     0     0 2715 5030 79 16  5  0
 4  0      0   2861    364   2669    0    0     0     0 6164 12637 76 15  9  0
 4  0      0   2853    364   2669    0    0     0     0 4845 8617 77 13 10  0
 4  0      0   2848    364   2669    0    0     0     0 3782 7084 79 13  8  0
 5  0      0   2842    364   2669    0    0     0     0 3723 6120 81 12  7  0
 4  0      0   2835    364   2669    0    0     0     0 3477 4943 84  9  7  0
 4  0      0   2834    364   2669    0    0     0     0 3273 4950 81 10  9  0
 5  0      0   2828    364   2669    0    0     0     0 3226 4812 84 11  6  0
 4  0      0   2823    364   2669    0    0     0     0 3250 4889 83 10  7  0
 4  0      0   2826    364   2669    0    0     0     0 3023 4353 85 10  6  0
 4  0      0   2817    364   2669    0    0     0     0 3176 4284 83 10  7  0
 4  0      0   2823    364   2669    0    0     0     0 3008 4063 84 10  6  0
 0  0      0   3893    364   2669    0    0     0     0 4023 4228 64 10 26  0

Does that information mean anything to you?

Google Thread Caching Malloc

Now for real fun, add a little spice

time LD_PRELOAD="/usr/lib/libtcmalloc.so" ./ompmemtest 
Id 1 about to release all memory: 257.339 MB
Id 1 done, total alloc'ed -1570.42MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 2 about to release all memory: 257.043 MB
Id 2 done, total alloc'ed -1569.96MB 
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 

real    0m11.663s
user    0m44.255s
sys 0m1.028s

Looks faster, not?

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 4  0      0   3562    364   2684    0    0     0     0 1041 1676 28  7 64  0
 4  2      0   2806    364   2684    0    0     0   172 1641 1843 84 14  1  0
 4  0      0   2758    364   2685    0    0     0     0 1520 1009 98  2  1  0
 4  0      0   2747    364   2685    0    0     0     0 1504  859 98  2  0  0
 5  0      0   2745    364   2685    0    0     0     0 1575 1073 98  2  0  0
 5  0      0   2739    364   2685    0    0     0     0 1415  743 99  1  0  0
 4  0      0   2738    364   2685    0    0     0     0 1526  981 99  2  0  0
 4  0      0   2731    364   2685    0    0     0   684 1536  927 98  2  0  0
 4  0      0   2730    364   2685    0    0     0     0 1584 1010 99  1  0  0
 5  0      0   2730    364   2685    0    0     0     0 1461  917 99  2  0  0
 4  0      0   2729    364   2685    0    0     0     0 1561 1036 99  1  0  0
 4  0      0   2729    364   2685    0    0     0     0 1406  756 100  1  0  0
 0  0      0   3819    364   2685    0    0     0     4 1159 1476 26  3 71  0

In case you wanted to compare vmstat outputs

`Valgrind --tool massif`

This is the head of output from ms_print after valgrind --tool=massif ./ompmemtest (default malloc):

--------------------------------------------------------------------------------
Command:            ./ompmemtest
Massif arguments:   (none)
ms_print arguments: massif.out.beforetcmalloc
--------------------------------------------------------------------------------


    GB
1.009^                                                                     :  
     |       ##::::@@:::::::@@::::::@@::::@@::@::::@::::@:::::::::@::::::@::: 
     |       # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |       # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   ::::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |  :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |  :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   264.0

Number of snapshots: 63
 Detailed snapshots: [6 (peak), 10, 17, 23, 27, 30, 35, 39, 48, 56]

Google HEAPPROFILE

Unfortunately, vanilla valgrind doesn't work with tcmalloc, so I switched horses midrace to heap profiling with google-perftools

gcc openMpMemtest_Linux.cpp -fopenmp -lgomp -lstdc++ -ltcmalloc -o ompmemtest

time HEAPPROFILE=/tmp/heapprofile ./ompmemtest
Starting tracking the heap
Dumping heap profile to /tmp/heapprofile.0001.heap (100 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0002.heap (200 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0003.heap (300 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0004.heap (400 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0005.heap (501 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0006.heap (601 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0007.heap (701 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0008.heap (801 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0009.heap (902 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0010.heap (1002 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0011.heap (2029 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0012.heap (3053 MB allocated cumulatively, 1030 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0013.heap (4078 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0014.heap (5102 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0015.heap (6126 MB allocated cumulatively, 1033 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0016.heap (7151 MB allocated cumulatively, 1029 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0017.heap (8175 MB allocated cumulatively, 1029 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0018.heap (9199 MB allocated cumulatively, 1028 MB currently in use)
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 
Id 2 about to release all memory: 257.043 MB
Id 2 done, total alloc'ed -1569.96MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 1 about to release all memory: 257.339 MB
Id 1 done, total alloc'ed -1570.42MB 
Dumping heap profile to /tmp/heapprofile.0019.heap (Exiting)

real    0m11.981s
user    0m44.455s
sys 0m1.124s

Contact me for full logs/details

Update

To the comments: I updated the program

--- omptest/openMpMemtest_Linux.cpp 2011-05-03 23:18:44.000000000 +0200
+++ q/openMpMemtest_Linux.cpp   2011-05-04 13:42:47.371726000 +0200
@@ -13,8 +13,8 @@
 void runParallelAllocTest()
 {
    // constants
-   const int  NUM_ALLOCATIONS = 5000; // alloc's per thread
-   const int  NUM_THREADS = 4;       // how many threads?
+   const int  NUM_ALLOCATIONS = 55000; // alloc's per thread
+   const int  NUM_THREADS = 8;        // how many threads?
    const int  NUM_ITERS = NUM_THREADS;// how many overall repetions

    const bool USE_NEW      = true;   // use new or malloc? , seems to make no difference (as it should)

It ran for over 5m3s. Close to the end, a screenshot of htop teaches that indeed, the reserved set is slightly higher, going towards 2.3g:

  1  [

回复收藏 0 原文

天赋异禀 2024-11-11 13:16:15

回复收藏 0

伴随着你 2024-11-11 13:16:15

回复收藏 0

哀由 2024-11-11 13:16:15

回复收藏 0

一袭白衣梦中忆 2024-11-11 13:16:15

回复收藏 0

始终不够爱げ你 2024-11-11 13:16:15

96.7%] Tasks: 125 total, 2 running
2 [

回复收藏 0 原文

ゃ人海孤独症 2024-11-11 13:16:15

回复收藏 0

守望孤独 2024-11-11 13:16:15

回复收藏 0

七分※倦醒 2024-11-11 13:16:15

回复收藏 0

温馨耳语 2024-11-11 13:16:15

回复收藏 0

内心激荡 2024-11-11 13:16:15

96.7%] Load average: 8.09 5.24 2.37
3 [

回复收藏 0 原文

单身狗的梦 2024-11-11 13:16:15

回复收藏 0

可是我不能没有你 2024-11-11 13:16:15

回复收藏 0

菊凝晚露 2024-11-11 13:16:15

回复收藏 0

无所的.畏惧 2024-11-11 13:16:15

回复收藏 0

夏有森光若流苏 2024-11-11 13:16:15

97.4%] Uptime: 01:54:22
4 [

回复收藏 0 原文

小鸟爱天空丶 2024-11-11 13:16:15

回复收藏 0

只怪假的太真实 2024-11-11 13:16:15

回复收藏 0

做个ˇ局外人 2024-11-11 13:16:15

回复收藏 0

始终不够爱げ你 2024-11-11 13:16:15

回复收藏 0

一抹微笑 2024-11-11 13:16:15

96.1%]
Mem[

回复收藏 0 原文

眸中客 2024-11-11 13:16:15

回复收藏 0

恍梦境° 2024-11-11 13:16:15

回复收藏 0

泛滥成性 2024-11-11 13:16:15

| 3055/7936MB]
Swp[ 0/0MB]

PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
4330 sehe 8 20 0 2635M 2286M 908 R 368. 28.8 15:35.01 ./ompmemtest

与 tcmalloc 运行的结果进行比较：4 分 12 秒，~~相似的顶级统计数据~~ 有细微差别；最大的区别在于 VIRT 集（但这并不是特别有用，除非每个进程的地址空间非常有限？）。如果你问我的话，RES 集非常相似。 更重要的是要注意是并行度增加；所有核心现在都已达到极限。这显然是由于使用 tcmalloc 时减少了对堆操作的锁定需要：

如果空闲列表为空： (1) 我们从 a 中获取一堆对象此大小类别的中央空闲列表（中央空闲列表由所有线程共享）。 (2) 将它们放入线程本地空闲列表中。 (3) 将新获取的对象之一返回给应用程序。

  1  [

| 3055/7936MB]
Swp[ 0/0MB]

PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
4330 sehe 8 20 0 2635M 2286M 908 R 368. 28.8 15:35.01 ./ompmemtest

Comparing results with a tcmalloc run: 4m12s, ~~similar top stats~~ has minor differences; the big difference is in the VIRT set (but that isn't particularly useful unless you have a very limited address space per process?). The RES set is quite similar, if you ask me. The more important thing to note is parallellism is increased; all cores are now maxed out. This is obviously due to reduced need to lock for heap operations when using tcmalloc:

If the free list is empty: (1) We fetch a bunch of objects from a central free list for this size-class (the central free list is shared by all threads). (2) Place them in the thread-local free list. (3) Return one of the newly fetched objects to the applications.

  1  [

回复收藏 0 原文

紫罗兰の梦幻 2024-11-11 13:16:15

回复收藏 0

无声情话 2024-11-11 13:16:15

回复收藏 0

花桑 2024-11-11 13:16:15

回复收藏 0

蓝眸 2024-11-11 13:16:15

回复收藏 0

北陌 2024-11-11 13:16:15

回复收藏 0

櫻之舞 2024-11-11 13:16:15

|||100.0%] Tasks: 172 total, 2 running
2 [

回复收藏 0 原文

感性不性感 2024-11-11 13:16:15

回复收藏 0

执手闯天涯 2024-11-11 13:16:15

回复收藏 0

握住你手 2024-11-11 13:16:15

回复收藏 0

暮年 2024-11-11 13:16:15

回复收藏 0

戈亓 2024-11-11 13:16:15

回复收藏 0

衣神在巴黎 2024-11-11 13:16:15

|||100.0%] Load average: 7.39 2.92 1.11
3 [

回复收藏 0 原文

漫雪独思 2024-11-11 13:16:15

回复收藏 0

送你一个梦 2024-11-11 13:16:15

回复收藏 0

情域 2024-11-11 13:16:15

回复收藏 0

网白 2024-11-11 13:16:15

回复收藏 0

忘年祭陌 2024-11-11 13:16:15

回复收藏 0

情深缘浅 2024-11-11 13:16:15

|||100.0%] Uptime: 11:12:25
4 [

回复收藏 0 原文

如果没有你 2024-11-11 13:16:15

回复收藏 0

撩人痒 2024-11-11 13:16:15

回复收藏 0

顾忌 2024-11-11 13:16:15

回复收藏 0

氛圍 2024-11-11 13:16:15

回复收藏 0

用心笑 2024-11-11 13:16:15

回复收藏 0

情场扛把子 2024-11-11 13:16:15

|||100.0%]
Mem[

回复收藏 0 原文

給妳壹絲溫柔 2024-11-11 13:16:15

回复收藏 0

思念绕指尖 2024-11-11 13:16:15

回复收藏 0

南风几经秋 2024-11-11 13:16:15

回复收藏 0

拿命拼未来 2024-11-11 13:16:15

|||| 3278/7936MB]
Swp[ 0/0MB]

PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
14391 sehe 8 20 0 2251M 2179M 1148 R 379. 27.5 8:08.92 ./ompmemtest

回复收藏 0 原文

演多会厌 2024-11-11 13:16:15

当将测试程序与 google 的 tcmalloc 库链接时，可执行文件不仅运行速度提高了约 10%，而且还显示内存碎片大大减少或微不足道：

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13441 byron     20   0  379m 334m 1220 R  187  8.4   0:02.63 ompmemtestgoogle                                                                        
13441 byron     20   0 1085m 1.0g 1220 R  194 26.2   0:08.52 ompmemtestgoogle                                                                        
13441 byron     20   0 1111m 1.0g 1220 R  195 26.9   0:14.42 ompmemtestgoogle                                                                        
13441 byron     20   0 1131m 1.1g 1220 R  195 27.4   0:20.30 ompmemtestgoogle                                                                        
13441 byron     20   0 1137m 1.1g 1220 R  195 27.6   0:26.19 ompmemtestgoogle                                                                        
13441 byron     20   0 1137m 1.1g 1220 R  195 27.6   0:32.05 ompmemtestgoogle                                                                        
13441 byron     20   0 1149m 1.1g 1220 R  191 27.9   0:37.81 ompmemtestgoogle                                                                        
13441 byron     20   0 1149m 1.1g 1220 R  194 27.9   0:43.66 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  188 28.2   0:49.32 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  194 28.2   0:55.15 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  191 28.2   1:00.90 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  191 28.2   1:06.64 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1356 R  192 28.2   1:12.42 ompmemtestgoogle

从我拥有的数据来看，答案似乎是是：

如果所使用的堆库不能很好地处理并发访问并且处理器无法真正并发执行线程，则对堆的多线程访问可能会加剧碎片。

tcmalloc 库显示，运行同一程序时没有出现明显的内存碎片，而之前的程序会导致大约 400MB 的碎片丢失。

但为什么会发生这种情况呢？

我在这里必须提供的最好的想法是在堆中提供某种锁定工件。

测试程序将分配随机大小的内存块，释放程序早期分配的块以保持在其内存限制内。当一个线程正在释放“左侧”堆块中的旧内存时，它实际上可能会因另一个线程计划运行而停止，从而留下一个（软）锁那个堆块。新调度的线程想要分配内存，但甚至可能不会读取“左侧”的堆块来检查空闲内存，因为它当前正在更改。因此，它最终可能会不必要地从“右侧”使用新的堆块。

这个过程可能看起来像堆块移位，其中第一个块（左侧）仅保留稀疏使用和碎片，迫使新块在右侧使用。

让我们重申一下，只有当我在双核系统上使用 4 个或更多线程（该系统只能或多或少地同时处理两个线程）时，才会出现此碎片问题。当仅使用两个线程时，堆上的（软）锁将保持足够短的时间，不会阻塞想要分配内存的其他线程。

另外，作为免责声明，我没有检查 glibc 堆实现的实际代码，我在内存分配器领域也只是新手 - 我所写的只是它在我看来的样子，这使得它纯粹是猜测。

另一个有趣的读物可能是 tcmalloc 文档，其中说明了堆和多线程的常见问题- 线程访问，其中一些可能也在测试程序中发挥了作用。

值得注意的是，它永远不会将内存返回给系统（请参阅 tcmalloc 中的注意事项段落文档)

When linking the test program with google's tcmalloc library, the executable doesn't only run ~10% faster, but shows greatly reduced or insignificant memory fragmentation as well:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13441 byron     20   0  379m 334m 1220 R  187  8.4   0:02.63 ompmemtestgoogle                                                                        
13441 byron     20   0 1085m 1.0g 1220 R  194 26.2   0:08.52 ompmemtestgoogle                                                                        
13441 byron     20   0 1111m 1.0g 1220 R  195 26.9   0:14.42 ompmemtestgoogle                                                                        
13441 byron     20   0 1131m 1.1g 1220 R  195 27.4   0:20.30 ompmemtestgoogle                                                                        
13441 byron     20   0 1137m 1.1g 1220 R  195 27.6   0:26.19 ompmemtestgoogle                                                                        
13441 byron     20   0 1137m 1.1g 1220 R  195 27.6   0:32.05 ompmemtestgoogle                                                                        
13441 byron     20   0 1149m 1.1g 1220 R  191 27.9   0:37.81 ompmemtestgoogle                                                                        
13441 byron     20   0 1149m 1.1g 1220 R  194 27.9   0:43.66 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  188 28.2   0:49.32 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  194 28.2   0:55.15 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  191 28.2   1:00.90 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  191 28.2   1:06.64 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1356 R  192 28.2   1:12.42 ompmemtestgoogle

From the data I have, the answer appears to be:

Multithreaded access to the heap can emphasize fragmentation if the employed heap library does not deal well with concurrent access and if the processor fails to execute the threads truly concurrently.

The tcmalloc library shows no significant memory fragmentation running the same program that previously caused ~400MB to be lost in fragmentation.

But why does that happen ?

The best idea I have to offer here is some sort of locking artifact within the heap.

The test program will allocate randomly sized blocks of memory, freeing up blocks allocated early in the program to stay within its memory limit. When one thread is in the process of releasing old memory which is in a heap block on the 'left', it might actually be halted as another thread is scheduled to run, leaving a (soft) lock on that heap block. The newly scheduled thread wants to allocate memory, but may not even read that heap block on the 'left' side to check for free memory as it is currently being changed. Hence it might end up using a new heap block unnecessarily from the 'right'.

This process could look like a heap-block-shifting, where the the first blocks (on the left) remain only sparsely used and fragmented, forcing new blocks to be used on the right.

Lets restate that this fragmentation issue only occurs for me if I use 4 or more threads on a dual core system which can only handle two threads more or less concurrently. When only two threads are used, the (soft) locks on the heap will be held short enough not to block the other thread who wants to allocate memory.

Also, as a disclaimer, I didn't check the actual code of the glibc heap implementation, nor am I anything more than novice in the field of memory allocators - all I wrote is just how it appears to me which makes it pure speculation.

Another interesting read might be the tcmalloc documentation, which states common problems with heaps and multi-threaded access, some of which may have played their role in the test program too.

Its worth noting that it will never return memory to the system (see Caveats paragraph in tcmalloc documentation)

回复收藏 0 原文

风轻花落早 2024-11-11 13:16:15

是的，默认的 malloc （取决于 Linux 版本）做了一些疯狂的事情，在一些多线程应用程序中大量失败。具体来说，它保留几乎每个线程堆（区域）以避免锁定。对于所有线程来说，这比单个堆要快得多，但内存效率非常低（有时）。您可以使用这样的代码来调整它，该代码会关闭多个竞技场（这会降低性能，因此如果您有很多小分配，请不要这样做！）

rv = mallopt(-7, 1);  // M_ARENA_TEST
rv = mallopt(-8, 1);  // M_ARENA_MAX

或者像其他人建议的那样使用 malloc 的各种替代品。

基本上，通用 malloc 不可能始终高效，因为它不知道将如何使用它。

克里斯·P.

Yes the default malloc (Depending on linux version) does some crazy stuff which fails massively in some multi threaded applications. Specifically it keeps almost per thread heaps (arenas) to avoid locking. This is much faster than a single heap for all threads, but massively memory inefficient (sometimes). You can tune this by using code like this which turns off the multiple arenas (this kills performance so don't do this if you have lots of small allocations!)

rv = mallopt(-7, 1);  // M_ARENA_TEST
rv = mallopt(-8, 1);  // M_ARENA_MAX

Or as others suggested using various replacements for malloc.

Basically it's impossible for a general purpose malloc to always be efficient as it doesn't know how it's going to be used.

ChrisP.

回复收藏 0 原文

~没有更多了~

关于作者

奶气

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

多线程强调内存碎片吗？

描述

测试程序描述 构建

演示程序

测试系统

测试

Description

Test Program Description

Demo Program

The Test-System

Testing

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（54）

这是在一个Naive run

这是 vmstat -SM 1 Vmstat 原始数据

Valgrind --tool Massif

Google HEAPPROFILE

请与我联系以获取完整日志/详细信息

更新

Naive run

Vmstat raw data

Valgrind --tool massif

Google HEAPPROFILE

Contact me for full logs/details

Update

但为什么会发生这种情况呢？

But why does that happen ?

关于作者

相关话题

热门标签

推荐作者

友情链接

测试程序描述构建

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

这是 `vmstat -SM 1` Vmstat 原始数据

`Valgrind --tool Massif`

`Valgrind --tool massif`