Windows内存分配问题
我目前正在研究 Windows 下的 malloc()
实现。但在我的研究中,我偶然发现了一些让我困惑的事情:
首先,我知道在 API 级别,Windows 主要使用 HeapAlloc()
和 VirtualAlloc()
调用分配内存。我从此处了解到malloc()的 Microsoft 实现 (包含在 CRT - C 运行时)基本上为块 > HeapAlloc() 调用
HeapAlloc()
480 字节,并以其他方式管理使用 VirtualAlloc()
分配的特殊区域以进行小分配,以防止碎片。
嗯,这一切都很好。但还有 malloc()
的其他实现,例如 nedmalloc< /a>,声称比 Microsoft 的 malloc
快 125%。
所有这些让我想知道一些事情:
为什么我们不能只为小块调用
HeapAlloc()
?在碎片方面表现是否不佳(例如通过“首次适应”而不是“最佳适应”)?- 实际上,有什么方法可以了解各种 API 分配调用的幕后情况吗?这会很有帮助。
是什么让
nedmalloc
比 Microsoft 的malloc
快得多?从上面的内容来看,我的印象是
HeapAlloc()
/VirtualAlloc()
非常慢,而malloc()
则快得多 code> 偶尔调用它们一次,然后自行管理分配的内存。这个假设是真的吗?或者只是因为碎片而需要malloc()
“包装器”? 人们会认为像这样的系统调用会很快 - 或者至少会考虑一些想法以提高它们的效率。- 如果这是真的,为什么会这样?
平均而言,典型的 malloc 调用执行多少(一个数量级)内存读/写(可能是已分配段数量的函数)?我直觉上会说,对于普通程序来说,它是数十次,对吗?
I am currently looking into malloc()
implementation under Windows. But in my research I have stumbled upon things that puzzled me:
First, I know that at the API level, windows uses mostly the HeapAlloc()
and VirtualAlloc()
calls to allocate memory. I gather from here that the Microsoft implementation of malloc()
(that which is included in the CRT - the C runtime) basically calls HeapAlloc()
for blocks > 480 bytes and otherwise manage a special area allocated with VirtualAlloc()
for small allocations, in order to prevent fragmentation.
Well that is all good and well. But then there are other implementation of malloc()
, for instance nedmalloc, which claim to be up to 125% faster than Microsoft's malloc
.
All this makes me wonder a few things:
Why can't we just call
HeapAlloc()
for small blocks? Does is perform poorly in regard to fragmentation (for example by doing "first-fit" instead of "best-fit")?- Actually, is there any way to know what is going under the hood of the various API allocation calls? That would be quite helpful.
What makes
nedmalloc
so much faster than Microsoft'smalloc
?From the above, I got the impression that
HeapAlloc()
/VirtualAlloc()
are so slow that it is much faster formalloc()
to call them only once in a while and then to manage the allocated memory itself. Is that assumption true? Or is themalloc()
"wrapper" just needed because of fragmentation? One would think that system calls like this would be quick - or at least that some thoughts would have been put into them to make them efficient.- If it is true, why is it so?
On average, how many (an order of magnitude) memory reads/write are performed by a typical
malloc
call (probably a function of the number of already allocated segments)? I would intuitively says it's in the tens for an average program, am I right?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
操作系统级系统调用是为了管理进程的整个内存空间而设计和优化的。使用它们为整数分配 4 个字节确实不是最理想的 - 通过管理库代码中的微小分配并让操作系统针对更大的分配进行优化,您可以获得更好的整体性能和内存使用情况。至少据我了解。
The OS-level system calls are designed and optimized for managing the entire memory space of processes. Using them to allocate 4 bytes for an integer is indeed suboptimal - you get overall better performance and memory usage by managing tiny allocations in library code, and letting the OS optimize for larger allocations. At least as far as I understand it.