是“malloc”吗?可能是我的程序的瓶颈?
我的程序每秒调用 malloc
10'000 次。我完全不知道 malloc 调用需要多长时间。
- 只要一个无竞争的互斥锁? (10-100 ns)
- 只要压缩1kb的数据? (1-10 us)
- 只要一个SSD随机读? (100-1000 us)
- 只要网络传输到阿姆斯特丹? (10-100ms)
我不想花两个小时来调查这个问题,结果发现它与我的程序所做的其他事情相比绝对相形见绌,我想大致了解一下会发生什么。棒球场。不精确。减少 10 倍根本没有关系。
下面的图片被上传了200次 此处:
My program calls malloc
10'000 times a second. I have absolutely no idea how long a malloc call takes.
- As long as an uncontested mutex lock? (10-100 ns)
- As long as compressing 1kb of data? (1-10 us)
- As long as an SSD random read? (100-1000 us)
- As long as a network transfer to Amsterdam? (10-100ms)
Instead of spending two hours to investigate this, only to find out that it is absolutely dwarfed by some other thing my program does, I would like to get a rough idea of what to expect. Ballpark. Not precise. Off by factor 10 does not matter at all.
The following picture was updooted 200 times here:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先声明显而易见的事实:始终需要对特定用例进行分析。然而,这个问题要求对数量级进行粗略的大致估计。当我们不知道是否应该考虑一个问题时,我们就会这样做。当我的数据发送到阿姆斯特丹时,我是否需要担心数据是否在缓存中?看看问题中的图片,答案是否定的。是的,这可能是一个问题,但前提是你搞砸了。我们假设这种情况被排除,并以概率一般性的方式讨论该问题。
具有讽刺意味的是,当我正在开发一个非常关心小细节的程序时,出现了这个问题,其中百分之几的性能差异就会转化为数百万个 CPU 小时。分析表明 malloc 不是一个问题,但在完全驳回它之前,我想进行健全性检查:理论上 malloc 是一个瓶颈是否合理?
正如问题的早期封闭版本中反复建议的那样,环境之间存在很大差异。
我尝试了各种机器(英特尔:i7 8700K、i5 5670、笔记本电脑中的一些早期移动 i7;AMD:Ryzen 4300G、Ryzen 3900X)、各种操作系统(Windows 10、debian、ubuntu)和编译器(gcc、clang-14、 cygwin-g++、msvc;无调试版本)。
我用它来了解特征(*),仅使用 1 个线程:
方差很明显,但是,正如预期的那样,这些值仍然属于相同的范围。
下表具有代表性,其他表格的偏差不到 10 倍
。这是来自 cygwin-g++ 上的 3900X 的表格。您可以清楚地看到更大的 CPU 缓存,然后是更高的内存吞吐量。
那么什么给出呢?
对于较小的块大小,即使在旧的商用硬件上,每秒> = 1000 万次调用也是可能的。
一旦大小超出 CPU 缓存,即 1 到 100 MB 左右,RAM 访问很快就会占据主导地位(我没有在没有实际使用块的情况下测试 malloc)。
根据您分配的大小,其中之一将是(大致)限制。
然而,对于每秒 10k 的分配,您可能暂时可以忽略这一点。
To state the obvious first: profiling for specific use cases is always required. However, this question asked for a rough general ballpark approximation guesstimate of the order of magnitude. That's something we do when we don't know if we should even think about a problem. Do I need to worry about my data being in cache when it is then sent to Amsterdam? Looking at the picture in the question, the answer is a resounding No. Yes, it could be a problem, but only if you messed up big. We assume that case to be ruled out and instead discuss the problem in probabilistic generality.
It may be ironic that the question arose when I was working on a program that cares very much about small details, where a performance difference of a few percent translates into millions of CPU hours. Profiling suggested malloc was not an issue, but before dismissing it outright, I wanted to sanity check: Is it theoretically plausible that malloc is a bottleneck?
As repeatedly suggested in a closed, earlier version of the question, there are large differences between environments.
I tried various machines (intel: i7 8700K, i5 5670, some early gen mobile i7 in a laptop; AMD: Ryzen 4300G, Ryzen 3900X), various OS (windows 10, debian, ubuntu) and compilers (gcc, clang-14, cygwin-g++, msvc; no debug builds).
I've used this to get an idea about the characteristics(*), using just 1 thread:
The variance is stark, but, as expected, the values still belong to the same ballpark.
the following table is representative, others were off by less than factor 10
Here's one from a 3900X on cygwin-g++. You can clearly see the larger CPU cache, and after that, the higher memory throughput.
So what gives?
With small chunk sizes, >= 10 million of calls per second are possible even on old commodity hardware.
Once sizes go beyond CPU cache, i.e. 1 to 100-ish MB, RAM access quickly dominates this (I did not test malloc without actually using the chunks).
Depending on what sizes you malloc, one or the other will be the (ballpark) limit.
However, with something like 10k allocs per second, this is something you can likely ignore for the time being.