当前位置：文江博客话题详情

是“malloc”吗？可能是我的程序的瓶颈？

发布于 2025-01-10 16:32:27 字数 604 浏览 0 评论 0原文

我的程序每秒调用 malloc 10'000 次。我完全不知道 malloc 调用需要多长时间。

只要一个无竞争的互斥锁？（10-100 ns）
只要压缩1kb的数据？（1-10 us）
只要一个SSD随机读？（100-1000 us）
只要网络传输到阿姆斯特丹？（10-100ms）

我不想花两个小时来调查这个问题，结果发现它与我的程序所做的其他事情相比绝对相形见绌，我想大致了解一下会发生什么。棒球场。不精确。减少 10 倍根本没有关系。

下面的图片被上传了200次此处：

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鹤仙姿 2025-01-17 16:32:27

首先声明显而易见的事实：始终需要对特定用例进行分析。然而，这个问题要求对数量级进行粗略的大致估计。当我们不知道是否应该考虑一个问题时，我们就会这样做。当我的数据发送到阿姆斯特丹时，我是否需要担心数据是否在缓存中？看看问题中的图片，答案是否定的。是的，这可能是一个问题，但前提是你搞砸了。我们假设这种情况被排除，并以概率一般性的方式讨论该问题。

具有讽刺意味的是，当我正在开发一个非常关心小细节的程序时，出现了这个问题，其中百分之几的性能差异就会转化为数百万个 CPU 小时。分析表明 malloc 不是一个问题，但在完全驳回它之前，我想进行健全性检查：理论上 malloc 是一个瓶颈是否合理？

正如问题的早期封闭版本中反复建议的那样，环境之间存在很大差异。
我尝试了各种机器（英特尔：i7 8700K、i5 5670、笔记本电脑中的一些早期移动 i7；AMD：Ryzen 4300G、Ryzen 3900X）、各种操作系统（Windows 10、debian、ubuntu）和编译器（gcc、clang-14、 cygwin-g++、msvc；无调试版本）。

我用它来了解特征（*），仅使用 1 个线程：

#include <stddef.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>

int main(int argc, char* argv[]) {
    const size_t allocs = 10;
    const size_t repeats = 10000;
    printf("chunk\tms\tM1/s\tGB/s\tcheck\n");
    
    for (size_t size = 16; size < 10 * 1000 * 1000; size *= 2) {
        float t0 = (float)clock() / CLOCKS_PER_SEC;
        size_t check = 0;
        
        for (size_t repeat = 0; repeat < repeats; ++repeat) {
            char* ps[allocs];
            for (size_t i = 0; i < allocs; i++) {
                ps[i] = malloc(size);
                if (!ps[i]) {
                    exit(1);
                }
                for (size_t touch = 0; touch < size; touch += 512) {
                    ps[i][touch] = 1;
                }
            }
            for (size_t i = 0; i < allocs; i++) {
                check += ps[i][0];
                free(ps[i]);
            }
        }
        
        float dt = (float)clock() / CLOCKS_PER_SEC - t0;
        printf ("%d\t%1.5f\t%7.3f\t%7.1f\t%d\n",
            size,
            dt / allocs / repeats * 1000,
            allocs / dt * repeats / 1000 / 1000,
            allocs / dt * repeats * size / 1024 / 1024 / 1024,
            check);
    }
}

方差很明显，但是，正如预期的那样，这些值仍然属于相同的范围。
下表具有代表性，其他表格的偏差不到 10 倍

chunk   ms      M1/s    GB/s    check
16      0.00003  38.052     0.6 100000
32      0.00003  37.736     1.1 100000
64      0.00003  37.651     2.2 100000
128     0.00004  24.931     3.0 100000
256     0.00004  26.991     6.4 100000
512     0.00004  26.427    12.6 100000
1024    0.00004  24.814    23.7 100000
2048    0.00007  15.256    29.1 100000
4096    0.00007  14.633    55.8 100000
8192    0.00008  12.940    98.7 100000
16384   0.00066   1.511    23.1 100000
32768   0.00271   0.369    11.3 100000
65536   0.00707   0.141     8.6 100000
131072  0.01594   0.063     7.7 100000
262144  0.04401   0.023     5.5 100000
524288  0.11226   0.009     4.3 100000
1048576 0.25546   0.004     3.8 100000
2097152 0.52395   0.002     3.7 100000
4194304 0.80179   0.001     4.9 100000
8388608 1.78242   0.001     4.4 100000

。这是来自 cygwin-g++ 上的 3900X 的表格。您可以清楚地看到更大的 CPU 缓存，然后是更高的内存吞吐量。

chunk   ms      M1/s    GB/s    check
16      0.00004  25.000     0.4 100000
32      0.00005  20.000     0.6 100000
64      0.00004  25.000     1.5 100000
128     0.00004  25.000     3.0 100000
256     0.00004  25.000     6.0 100000
512     0.00005  20.000     9.5 100000
1024    0.00004  25.000    23.8 100000
2048    0.00005  20.000    38.1 100000
4096    0.00005  20.000    76.3 100000
8192    0.00010  10.000    76.3 100000
16384   0.00015   6.667   101.7 100000
32768   0.00077   1.299    39.6 100000
65536   0.00039   2.564   156.5 100000
131072  0.00067   1.493   182.2 100000
262144  0.00093   1.075   262.5 100000
524288  0.02679   0.037    18.2 100000
1048576 0.14183   0.007     6.9 100000
2097152 0.26805   0.004     7.3 100000
4194304 0.51644   0.002     7.6 100000
8388608 1.01604   0.001     7.7 100000

那么什么给出呢？
对于较小的块大小，即使在旧的商用硬件上，每秒> = 1000 万次调用也是可能的。
一旦大小超出 CPU 缓存，即 1 到 100 MB 左右，RAM 访问很快就会占据主导地位（我没有在没有实际使用块的情况下测试 malloc）。
根据您分配的大小，其中之一将是（大致）限制。
然而，对于每秒 10k 的分配，您可能暂时可以忽略这一点。

To state the obvious first: profiling for specific use cases is always required. However, this question asked for a rough general ballpark approximation guesstimate of the order of magnitude. That's something we do when we don't know if we should even think about a problem. Do I need to worry about my data being in cache when it is then sent to Amsterdam? Looking at the picture in the question, the answer is a resounding No. Yes, it could be a problem, but only if you messed up big. We assume that case to be ruled out and instead discuss the problem in probabilistic generality.

It may be ironic that the question arose when I was working on a program that cares very much about small details, where a performance difference of a few percent translates into millions of CPU hours. Profiling suggested malloc was not an issue, but before dismissing it outright, I wanted to sanity check: Is it theoretically plausible that malloc is a bottleneck?

As repeatedly suggested in a closed, earlier version of the question, there are large differences between environments.
I tried various machines (intel: i7 8700K, i5 5670, some early gen mobile i7 in a laptop; AMD: Ryzen 4300G, Ryzen 3900X), various OS (windows 10, debian, ubuntu) and compilers (gcc, clang-14, cygwin-g++, msvc; no debug builds).

I've used this to get an idea about the characteristics(*), using just 1 thread:

#include <stddef.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>

int main(int argc, char* argv[]) {
    const size_t allocs = 10;
    const size_t repeats = 10000;
    printf("chunk\tms\tM1/s\tGB/s\tcheck\n");
    
    for (size_t size = 16; size < 10 * 1000 * 1000; size *= 2) {
        float t0 = (float)clock() / CLOCKS_PER_SEC;
        size_t check = 0;
        
        for (size_t repeat = 0; repeat < repeats; ++repeat) {
            char* ps[allocs];
            for (size_t i = 0; i < allocs; i++) {
                ps[i] = malloc(size);
                if (!ps[i]) {
                    exit(1);
                }
                for (size_t touch = 0; touch < size; touch += 512) {
                    ps[i][touch] = 1;
                }
            }
            for (size_t i = 0; i < allocs; i++) {
                check += ps[i][0];
                free(ps[i]);
            }
        }
        
        float dt = (float)clock() / CLOCKS_PER_SEC - t0;
        printf ("%d\t%1.5f\t%7.3f\t%7.1f\t%d\n",
            size,
            dt / allocs / repeats * 1000,
            allocs / dt * repeats / 1000 / 1000,
            allocs / dt * repeats * size / 1024 / 1024 / 1024,
            check);
    }
}

The variance is stark, but, as expected, the values still belong to the same ballpark.
the following table is representative, others were off by less than factor 10

chunk   ms      M1/s    GB/s    check
16      0.00003  38.052     0.6 100000
32      0.00003  37.736     1.1 100000
64      0.00003  37.651     2.2 100000
128     0.00004  24.931     3.0 100000
256     0.00004  26.991     6.4 100000
512     0.00004  26.427    12.6 100000
1024    0.00004  24.814    23.7 100000
2048    0.00007  15.256    29.1 100000
4096    0.00007  14.633    55.8 100000
8192    0.00008  12.940    98.7 100000
16384   0.00066   1.511    23.1 100000
32768   0.00271   0.369    11.3 100000
65536   0.00707   0.141     8.6 100000
131072  0.01594   0.063     7.7 100000
262144  0.04401   0.023     5.5 100000
524288  0.11226   0.009     4.3 100000
1048576 0.25546   0.004     3.8 100000
2097152 0.52395   0.002     3.7 100000
4194304 0.80179   0.001     4.9 100000
8388608 1.78242   0.001     4.4 100000

Here's one from a 3900X on cygwin-g++. You can clearly see the larger CPU cache, and after that, the higher memory throughput.

chunk   ms      M1/s    GB/s    check
16      0.00004  25.000     0.4 100000
32      0.00005  20.000     0.6 100000
64      0.00004  25.000     1.5 100000
128     0.00004  25.000     3.0 100000
256     0.00004  25.000     6.0 100000
512     0.00005  20.000     9.5 100000
1024    0.00004  25.000    23.8 100000
2048    0.00005  20.000    38.1 100000
4096    0.00005  20.000    76.3 100000
8192    0.00010  10.000    76.3 100000
16384   0.00015   6.667   101.7 100000
32768   0.00077   1.299    39.6 100000
65536   0.00039   2.564   156.5 100000
131072  0.00067   1.493   182.2 100000
262144  0.00093   1.075   262.5 100000
524288  0.02679   0.037    18.2 100000
1048576 0.14183   0.007     6.9 100000
2097152 0.26805   0.004     7.3 100000
4194304 0.51644   0.002     7.6 100000
8388608 1.01604   0.001     7.7 100000

So what gives?
With small chunk sizes, >= 10 million of calls per second are possible even on old commodity hardware.
Once sizes go beyond CPU cache, i.e. 1 to 100-ish MB, RAM access quickly dominates this (I did not test malloc without actually using the chunks).
Depending on what sizes you malloc, one or the other will be the (ballpark) limit.
However, with something like 10k allocs per second, this is something you can likely ignore for the time being.

回复收藏 0 原文

~没有更多了~