我正在开发一个程序,该程序将处理大小可能为 100GB 或更大的文件。 这些文件包含可变长度记录集。 我已经启动并运行了第一个实现,现在正在寻求提高性能,特别是在更有效地执行 I/O 方面,因为输入文件被扫描了很多次。
使用 mmap()
与通过 C++ 的 fstream
库读取块是否有经验法则? 我想做的是将大块从磁盘读取到缓冲区中,处理缓冲区中的完整记录,然后读取更多内容。
mmap()
代码可能会变得非常混乱,因为 mmap
的块需要位于页面大小的边界上(我的理解),并且记录可能位于跨页面边界。 使用 fstream,我可以只查找记录的开头并再次开始读取,因为我们不限于读取位于页面大小边界上的块。
在不首先实际编写完整的实现的情况下,如何在这两个选项之间做出决定? 有任何经验法则(例如,mmap()
速度快 2 倍)或简单的测试吗?
I'm working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I've got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.
Is there a rule of thumb for using mmap()
versus reading in blocks via C++'s fstream
library? What I'd like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.
The mmap()
code could potentially get very messy since mmap
'd blocks need to lie on page sized boundaries (my understanding) and records could potentially lie across page boundaries. With fstream
s, I can just seek to the start of a record and begin reading again, since we're not limited to reading blocks that lie on page sized boundaries.
How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap()
is 2x faster) or simple tests?
发布评论
评论(13)
我进行了测试,比较了 25 年前(仅限 Windows)和 今天 2023 的“映射与读取”的访问速度(在 Windows 11 AMD Ryzen Threadripper 3970X 和配备 M1-Max 芯片的 macOS 上,全部采用快速 SSD NVMe)。 在所有情况下,我只对顺序访问感兴趣,这是我的C++基准测试(操作系统 API 调用)的重点。
在每次测试中,我都非常小心地彻底刷新系统缓存,以确保结果准确。 在 Mac 上,我使用命令“sudo purge”,在 Windows 上,我在运行每个基准测试之前使用带有“清空备用列表”选项的 RAMMap64.exe 应用程序。
我的发现仍然一致:利用文件内存映射速度明显慢,这让我很沮丧。 在 Windows 上慢 5 倍,在 macOS 上慢 7 倍。
而且,在 macOS 上,读取速度比 Windows 快 4 倍,内存映射快 3 倍。 这对我来说很不幸,因为我大部分时间都花在 Windows 上。
有趣的是,当我不刷新系统缓存并重新运行基准测试时,映射和读取之间的差异会大大减小,尽管读取速度仍然快了大约 30%。
总之,必须进行准确反映您对所选操作系统的特定要求的基准测试。 此外,不要忽视在每次测试之前刷新系统缓存的重要性。 根据这些基准,得出适合您需求的最佳方法的结论。
I have conducted tests comparing the access speed of "map vs read" 25 years ago (Windows only) and again today in 2023 (on Windows 11 AMD Ryzen Threadripper 3970X and macOS with an M1-Max chip, all with fast SSD NVMe). In all cases, I was solely interested in sequential access, which was the focus of my C++ benchmarks (OS API calls).
In every test, I took great care to thoroughly flush the system cache to ensure accurate results. On a Mac, I used the command "sudo purge" and on Windows, I utilized the RAMMap64.exe application with the "Empty Standby List" option before running each benchmark.
My findings remain consistent: utilizing file memory mapping is significantly slower, much to my dismay. It is 5 times slower on Windows and 7 times slower on macOS.
Moreover, on macOS, the reading speed is 4 times faster than on Windows, and memory mapping is 3 times faster. This is unfortunate for me as I spend most of my time on Windows.
Interestingly, when I don’t flush the system cache and rerun the benchmarks, the disparity between mapping and reading is considerably reduced, though reading still remains faster by approximately 30%.
In conclusion, it is imperative to conduct benchmarks that accurately reflect your specific requirements on the operating system of your choice. Additionally, do not overlook the importance of flushing the system cache prior to each test. Based on these benchmarks, draw your own conclusions regarding the best method for your needs.
我认为 mmap 的最大优点是异步读取的潜力:
问题是我找不到正确的 MAP_FLAGS 来提示该内存应该尽快从文件同步。
我希望 MAP_POPULATE 为 mmap 提供正确的提示(即,它不会在从调用返回之前尝试加载所有内容,但会在 async.with feed_data 中执行此操作)。 至少使用此标志可以提供更好的结果,即使手册指出自 2.6.23 起如果没有 MAP_PRIVATE 它什么也不做。
I think the greatest thing about mmap is potential for asynchronous reading with:
Problem is that I can't find the right MAP_FLAGS to give a hint that this memory should be synced from file asap.
I hope that MAP_POPULATE gives the right hint for mmap (i.e. it will not try to load all contents before return from call, but will do that in async. with feed_data). At least it gives better results with this flag even that manual states that it does nothing without MAP_PRIVATE since 2.6.23.
这听起来像是多线程的一个很好的用例...我认为您可以很容易地设置一个线程来读取数据,而其他线程则处理它。 这可能是一种显着提高感知性能的方法。 只是一个想法。
This sounds like a good use-case for multi-threading... I'd think you could pretty easily setup one thread to be reading data while the other(s) process it. That may be a way to dramatically increase the perceived performance. Just a thought.
在我看来,使用 mmap() “只是”减轻了开发人员编写自己的缓存代码的负担。 在简单的“一次读取文件”的情况下,这并不困难(尽管 mlbrock 指出您仍然将内存副本保存到进程空间中),但是如果您要在文件中来回移动或跳过位等等,我相信内核开发人员可能在实现缓存方面比我做得更好......
To my mind, using mmap() "just" unburdens the developer from having to write their own caching code. In a simple "read through file eactly once" case, this isn't going to be hard (although as mlbrock points out you still save the memory copy into process space), but if you're going back and forth in the file or skipping bits and so forth, I believe the kernel developers have probably done a better job implementing caching than I can...
我记得几年前将一个包含树结构的巨大文件映射到内存中。 与普通反序列化相比,我对速度感到惊讶,普通反序列化涉及内存中的大量工作,例如分配树节点和设置指针。
所以事实上我正在比较对 mmap 的单个调用(或其在 Windows 上的对应项)
反对许多(MANY)对operator new 和构造函数的调用。
对于此类任务,与反序列化相比,mmap 是无与伦比的。
当然,人们应该为此研究一下 boosts 可重定位指针。
I remember mapping a huge file containing a tree structure into memory years ago. I was amazed by the speed compared to normal de-serialization which involves lot of work in memory, like allocating tree nodes and setting pointers.
So in fact I was comparing a single call to mmap (or its counterpart on Windows)
against many (MANY) calls to operator new and constructor calls.
For such kind of task, mmap is unbeatable compared to de-serialization.
Of course one should look into boosts relocatable pointer for this.
我同意 mmap 文件 I/O 会更快,但是在对代码进行基准测试时,计数器示例不应该稍微优化吗?
Ben Collins 写道:
我建议也尝试:
除此之外,您还可以尝试使缓冲区大小与一页虚拟内存的大小相同,以防 0x1000 不是您计算机上一页虚拟内存的大小。恕我直言,mmap 文件 I/O 仍然获胜,但这应该会让事情变得更接近。
I agree that mmap'd file I/O is going to be faster, but while your benchmarking the code, shouldn't the counter example be somewhat optimized?
Ben Collins wrote:
I would suggest also trying:
And beyond that, you might also try making the buffer size the same size as one page of virtual memory, in case 0x1000 is not the size of one page of virtual memory on your machine... IMHO mmap'd file I/O still wins, but this should make things closer.
也许您应该预处理文件,因此每个记录都位于单独的文件中(或者至少每个文件都是可映射的大小)。
另外,在继续下一条记录之前,您是否可以对每条记录执行所有处理步骤? 也许这可以避免一些 IO 开销?
Perhaps you should pre-process the files, so each record is in a separate file (or at least that each file is a mmap-able size).
Also could you do all of the processing steps for each record, before moving onto the next one? Maybe that would avoid some of the IO overhead?
mmap 应该更快,但我不知道快多少。 这很大程度上取决于您的代码。 如果您使用 mmap,最好立即 mmap 整个文件,这将使您的生活变得更加轻松。 一个潜在的问题是,如果您的文件大于 4GB(或者实际上限制较低,通常为 2GB),您将需要 64 位架构。 因此,如果您使用的是 32 位环境,您可能不想使用它。
话虽如此,可能有更好的途径来提高性能。 你说输入文件被扫描很多次,如果你可以一次性读出它然后完成它,那可能会快得多。
mmap should be faster, but I don't know how much. It very much depends on your code. If you use mmap it's best to mmap the whole file at once, that will make you life a lot easier. One potential problem is that if your file is bigger than 4GB (or in practice the limit is lower, often 2GB) you will need a 64bit architecture. So if you're using a 32 environment, you probably don't want to use it.
Having said that, there may be a better route to improving performance. You said the input file gets scanned many times, if you can read it out in one pass and then be done with it, that could potentially be much faster.
很抱歉 Ben Collins 丢失了他的滑动窗口 mmap 源代码。 如果能在 Boost 中使用那就太好了。
是的,映射文件要快得多。 您本质上是使用操作系统虚拟内存子系统来关联内存与磁盘,反之亦然。 这样想吧:如果操作系统内核开发人员可以让它变得更快,他们就会的。 因为这样做可以让一切都变得更快:数据库、启动时间、程序加载时间等等。
滑动窗口方法实际上并不困难,因为可以一次映射多个连续页面。 因此,只要单个记录中最大的一条记录能够装入内存,记录的大小并不重要。 重要的是管理簿记。
如果记录不是从 getpagesize() 边界开始,则映射必须从上一页开始。 映射区域的长度从记录的第一个字节(如有必要,向下舍入到 getpagesize() 的最接近的倍数)到记录的最后一个字节(向上舍入到 getpagesize() 的最接近的倍数)。 处理完一条记录后,您可以 unmap() 它,然后继续处理下一条记录。
这一切在 Windows 下也可以正常工作,使用 CreateFileMapping() 和 MapViewOfFile() (以及 GetSystemInfo() 来获取 SYSTEM_INFO.dwAllocationGranularity --- 而不是 SYSTEM_INFO.dwPageSize)。
I'm sorry Ben Collins lost his sliding windows mmap source code. That'd be nice to have in Boost.
Yes, mapping the file is much faster. You're essentially using the the OS virtual memory subsystem to associate memory-to-disk and vice versa. Think about it this way: if the OS kernel developers could make it faster they would. Because doing so makes just about everything faster: databases, boot times, program load times, et cetera.
The sliding window approach really isn't that difficult as multiple continguous pages can be mapped at once. So the size of the record doesn't matter so long as the largest of any single record will fit into memory. The important thing is managing the book-keeping.
If a record doesn't begin on a getpagesize() boundary, your mapping has to begin on the previous page. The length of the region mapped extends from the first byte of the record (rounded down if necessary to the nearest multiple of getpagesize()) to the last byte of the record (rounded up to the nearest multiple of getpagesize()). When you're finished processing a record, you can unmap() it, and move on to the next.
This all works just fine under Windows too using CreateFileMapping() and MapViewOfFile() (and GetSystemInfo() to get SYSTEM_INFO.dwAllocationGranularity --- not SYSTEM_INFO.dwPageSize).
mmap 速度更快。 您可以编写一个简单的基准来向自己证明这一点:
与:
显然,我省略了细节(例如,如果您的文件不是
的倍数,如何确定何时到达文件末尾page_size
,例如),但实际上不应该比这更复杂。如果可以的话,您可以尝试将数据分解为多个文件,这些文件可以全部而不是部分地进行 mmap() 编辑(更简单)。
几个月前,我为 boost_iostreams 实现了一个半成品的滑动窗口 mmap() 流类,但没有人关心,我忙于其他事情。 最不幸的是,几周前我删除了旧的未完成项目的存档,而那是受害者之一:-(
更新:我还应该添加一个警告,即该基准测试在 Windows 中看起来会非常不同因为微软实现了一个漂亮的文件缓存,它可以完成大部分您将使用 mmap 执行的操作,即,对于经常访问的文件,您只需执行 std::ifstream.read() ,它就会与 mmap 一样快。 ,因为文件缓存已经为您完成了内存映射,并且它是透明的
最终更新:看,人们:跨操作系统和标准库以及磁盘和内存的许多不同平台组合。对于层次结构,我不能肯定地说,被视为黑匣子的系统调用
mmap
总是比read
快得多,但事实并非如此。我的意图,即使我的话可以这样解释最终,我的观点是内存映射 I/O 通常比基于字节的 I/O 更快; 这仍然是事实。 如果您通过实验发现两者之间没有区别,那么对我来说唯一合理的解释是您的平台以有利于read. 绝对确定您正在以可移植方式使用内存映射 I/O 的唯一方法是使用
mmap
。 如果您不关心可移植性并且可以依赖目标平台的特定特征,那么使用read
可能比较合适,而且不会显着牺牲任何性能。编辑以清理答案列表:
@jbl:
当然 - 我正在为 Git 编写一个 C++ 库(一个 libgit++,如果你愿意的话),我遇到了与此类似的问题:我需要能够打开大(非常大)文件,并且性能不至于太差(就像 std::fstream 一样)。
Boost::Iostreams
已经有一个mapped_file 源,但问题是它mmap
ping 整个文件,这将您限制为2^(wordsize)。 在 32 位机器上,4GB 不够大。 期望 Git 中的.pack
文件变得比这大得多并不是没有道理的,因此我需要以块的形式读取文件,而无需诉诸常规文件 I/O。 在Boost::Iostreams
的背后,我实现了一个 Source,它或多或少是std::streambuf
和std:: 之间交互的另一个视图。 istream。 您还可以尝试类似的方法,只需将
std::filebuf
继承到mapped_filebuf
中,类似地,将std::fstream
继承到一个mapped_fstream
。 两者之间的互动是很难正确处理的。Boost::Iostreams
已经为您完成了一些工作,并且它还提供了过滤器和链的钩子,因此我认为以这种方式实现它会更有用。mmap is way faster. You might write a simple benchmark to prove it to yourself:
versus:
Clearly, I'm leaving out details (like how to determine when you reach the end of the file in the event that your file isn't a multiple of
page_size
, for instance), but it really shouldn't be much more complicated than this.If you can, you might try to break up your data into multiple files that can be mmap()-ed in whole instead of in part (much simpler).
A couple of months ago I had a half-baked implementation of a sliding-window mmap()-ed stream class for boost_iostreams, but nobody cared and I got busy with other stuff. Most unfortunately, I deleted an archive of old unfinished projects a few weeks ago, and that was one of the victims :-(
Update: I should also add the caveat that this benchmark would look quite different in Windows because Microsoft implemented a nifty file cache that does most of what you would do with mmap in the first place. I.e., for frequently-accessed files, you could just do std::ifstream.read() and it would be as fast as mmap, because the file cache would have already done a memory-mapping for you, and it's transparent.
Final Update: Look, people: across a lot of different platform combinations of OS and standard libraries and disks and memory hierarchies, I can't say for certain that the system call
mmap
, viewed as a black box, will always always always be substantially faster thanread
. That wasn't exactly my intent, even if my words could be construed that way. Ultimately, my point was that memory-mapped i/o is generally faster than byte-based i/o; this is still true. If you find experimentally that there's no difference between the two, then the only explanation that seems reasonable to me is that your platform implements memory-mapping under the covers in a way that is advantageous to the performance of calls toread
. The only way to be absolutely certain that you're using memory-mapped i/o in a portable way is to usemmap
. If you don't care about portability and you can rely on the particular characteristics of your target platforms, then usingread
may be suitable without sacrificing measurably any performance.Edit to clean up answer list:
@jbl:
Sure - I was writing a C++ library for Git (a libgit++, if you will), and I ran into a similar problem to this: I needed to be able to open large (very large) files and not have performance be a total dog (as it would be with
std::fstream
).Boost::Iostreams
already has a mapped_file Source, but the problem was that it wasmmap
ping whole files, which limits you to 2^(wordsize). On 32-bit machines, 4GB isn't big enough. It's not unreasonable to expect to have.pack
files in Git that become much larger than that, so I needed to read the file in chunks without resorting to regular file i/o. Under the covers ofBoost::Iostreams
, I implemented a Source, which is more or less another view of the interaction betweenstd::streambuf
andstd::istream
. You could also try a similar approach by just inheritingstd::filebuf
into amapped_filebuf
and similarly, inheritingstd::fstream
intoa mapped_fstream
. It's the interaction between the two that's difficult to get right.Boost::Iostreams
has some of the work done for you, and it also provides hooks for filters and chains, so I thought it would be more useful to implement it that way.主要的性能成本是磁盘 I/O。 “mmap()”肯定比 istream 快,但差异可能并不明显,因为磁盘 I/O 将主导您的运行时间。
我尝试了 Ben Collins 的代码片段(见上文/下文)来测试他的断言“mmap() 更快”,但没有发现任何可测量的差异。 请参阅我对他的回答的评论。
我当然不建议依次单独映射每个记录,除非您的“记录”很大 - 这会非常慢,每个记录需要 2 次系统调用,并且可能会从磁盘中丢失页面-内存缓存.....
在你的情况下,我认为 mmap()、istream 和低级 open()/read() 调用都大致相同。 在这些情况下,我会推荐 mmap():
(顺便说一句 - 我喜欢 mmap()/MapViewOfFile())。
The main performance cost is going to be disk i/o. "mmap()" is certainly quicker than istream, but the difference might not be noticeable because the disk i/o will dominate your run-times.
I tried Ben Collins's code fragment (see above/below) to test his assertion that "mmap() is way faster" and found no measurable difference. See my comments on his answer.
I would certainly not recommend separately mmap'ing each record in turn unless your "records" are huge - that would be horribly slow, requiring 2 system calls for each record and possibly losing the page out of the disk-memory cache.....
In your case I think mmap(), istream and the low-level open()/read() calls will all be about the same. I would recommend mmap() in these cases:
(btw - I love mmap()/MapViewOfFile()).
这里已经有很多很好的答案,涵盖了许多要点,因此我将添加一些我在上面没有直接解决的问题。 也就是说,这个答案不应被视为对利弊的综合,而应被视为对此处其他答案的补充。
mmap 看起来很神奇
以文件已完全缓存1作为基线2的情况,
mmap
可能看起来非常像 magic:mmap
只需要 1 次系统调用即可(可能)映射整个文件,之后不再需要系统调用。mmap
不需要将文件数据从内核复制到用户空间。mmap
允许您“作为内存”访问文件,包括使用您可以针对内存执行的任何高级技巧来处理它,例如编译器自动向量化,SIMD 内在函数、预取、优化的内存解析例程、OpenMP 等。在文件已经在缓存中的情况下,似乎不可能击败:您只需直接访问内核页面缓存作为内存,它就不会比这更快了。
嗯,可以。
mmap 实际上并不神奇,因为...
mmap 仍然执行每页工作
mmap
与read(2)
(这实际上是用于读取块的可比较的操作系统级系统调用)是使用mmap
,您需要为新映射中访问的每个 4K 页面做“一些工作”,即使它可能被页面错误机制隐藏。例如,仅
mmap
整个文件的典型实现将需要出现故障,因此 100 GB / 4K = 2500 万次故障才能读取 100 GB 文件。 现在,这些将是小错误,但有 2500 万个页面故障仍然不会超级快。 在最好的情况下,一个小故障的成本可能是数百纳秒。mmap 严重依赖 TLB 性能
现在,您可以将
MAP_POPULATE
传递给mmap
告诉它在返回之前设置所有页表,因此访问时不应该出现页面错误它。 现在,这有一个小问题,它还将整个文件读取到 RAM 中,如果您尝试映射 100GB 文件,RAM 将会崩溃 - 但现在让我们忽略它3。 内核需要执行每页工作来设置这些页表(显示为内核时间)。 这最终成为mmap
方法中的主要成本,并且它与文件大小成正比(即,随着文件大小的增长,它的重要性不会相对降低)4。最后,即使在用户空间中,访问此类映射也不是完全免费的(与并非源自基于文件的
mmap
的大内存缓冲区相比) - 即使设置了页表,每次访问从概念上讲,到新页面将导致 TLB 未命中。 由于mmap
文件意味着使用页面缓存及其 4K 页面,因此对于 100GB 文件,您会再次产生 2500 万倍的成本。现在,这些 TLB 未命中的实际成本在很大程度上取决于硬件的至少以下方面:(a) 您拥有多少 4K TLB 实体以及其余翻译缓存的工作方式如何执行 (b) 硬件预取处理的效果如何使用 TLB - 例如,预取可以触发页面遍历吗? (c) 页面遍历硬件的速度和并行程度。 在现代高端 x86 Intel 处理器上,页面行走硬件通常非常强大:至少有 2 个并行页面行走器,页面行走可以与继续执行同时发生,并且硬件预取可以触发页面行走。 因此,TLB 对流式读取负载的影响相当低 - 并且无论页面大小如何,此类负载通常都会执行类似的操作。 然而,其他硬件通常要差得多!
read() 避免了这些陷阱
read()
系统调用,这通常是“块读”类型调用的基础,例如,在 C、C++ 和其他语言中,它有一个每个人都清楚的主要缺点:另一方面,它避免了上述大部分成本 - 您不需要将 2500 万个 4K 页面映射到用户空间。 您通常可以在用户空间中
malloc
单个缓冲区的小缓冲区,并在所有read
调用中重复使用它。 在内核方面,4K 页面或 TLB 未命中几乎不存在问题,因为所有 RAM 通常都是使用一些非常大的页面(例如 x86 上的 1 GB 页面)进行线性映射,因此页面缓存中的底层页面都被覆盖在内核空间中非常有效。因此,基本上,您可以通过以下比较来确定单个读取大文件的速度更快:
mmap 方法隐含的每页额外工作是否比每字节成本更高使用
read()
暗示将文件内容从内核复制到用户空间的工作?在许多系统上,它们实际上是近似平衡的。 请注意,每一种都可以根据硬件和操作系统堆栈的完全不同的属性进行扩展。
特别是,在以下情况下,
mmap
方法变得相对更快:...而在以下情况下,
read()
方法变得相对更快:read()
系统调用具有良好的复制性能。 例如,内核端良好的copy_to_user
性能。上述硬件因素在不同平台上存在很大差异,甚至在同一个系列内(例如,x86 代,尤其是细分市场),并且肯定会跨架构(例如,ARM、x86 与 PPC)。
操作系统因素也在不断变化,双方的各种改进导致一种方法或另一种方法的相对速度大幅跃升。 最近的列表包括:
MAP_POPULATE
的mmap
情况确实有帮助。arch/x86/lib/copy_user_64.S
中添加快速路径copy_to_user
方法,例如,在速度很快时使用REP MOVQ
,这确实对read()
案例有帮助。Spectre 和 Meltdown 后的更新
Spectre 和 Meltdown 漏洞的缓解措施大大增加了系统调用的成本。 在我测量过的系统上,“不执行任何操作”的系统调用(这是对系统调用的纯粹开销的估计,不包括调用完成的任何实际工作)的成本在典型的系统上约为 100 纳秒。现代 Linux 系统大约需要 700 ns。 此外,根据您的系统,专门针对 Meltdown 的页表隔离修复可能还有额外的功能除了由于需要重新加载 TLB 条目而导致的直接系统调用成本之外,还会产生下游影响。
与基于
mmap
的方法相比,所有这些都是基于read()
的方法的相对缺点,因为read()
方法必须创建一个系统调用每个“缓冲区大小”的数据价值。 您不能任意增加缓冲区大小来分摊此成本,因为使用大缓冲区通常会表现更差,因为您超过了 L1 大小,因此不断遭受缓存未命中。另一方面,使用
mmap
,您可以使用MAP_POPULATE
映射较大的内存区域,并高效地访问它,而只需一次系统调用。1 这或多或少还包括文件一开始没有完全缓存的情况,但操作系统的预读足以使其显示如此(即页面通常会在您需要时缓存)。 但这是一个微妙的问题,因为
mmap
和read
调用之间的预读工作方式通常有很大不同,并且可以通过“advise”调用进一步调整,如2。2 ...因为如果文件未缓存,您的行为将完全由 IO 问题主导,包括您的访问模式对底层硬件的支持程度 -并且您所有的努力都应该是确保此类访问尽可能具有同情心,例如通过使用
madvise
或fadvise
调用(以及您可以进行的任何应用程序级别更改以改进访问模式)。3 例如,您可以通过在较小尺寸(例如 100 MB)的窗口中按顺序进行
mmap
来解决这个问题。4 事实上,事实证明
MAP_POPULATE
方法(至少是某种硬件/操作系统组合)仅比不使用它快一点,可能是因为内核正在使用 < a href="https://lwn.net/Articles/588802/" rel="noreferrer">faultaround - 因此小故障的实际数量减少了 16 倍左右。There are lots of good answers here already that cover many of the salient points, so I'll just add a couple of issues I didn't see addressed directly above. That is, this answer shouldn't be considered a comprehensive of the pros and cons, but rather an addendum to other answers here.
mmap seems like magic
Taking the case where the file is already fully cached1 as the baseline2,
mmap
might seem pretty much like magic:mmap
only requires 1 system call to (potentially) map the entire file, after which no more system calls are needed.mmap
doesn't require a copy of the file data from kernel to user-space.mmap
allows you to access the file "as memory", including processing it with whatever advanced tricks you can do against memory, such as compiler auto-vectorization, SIMD intrinsics, prefetching, optimized in-memory parsing routines, OpenMP, etc.In the case that the file is already in the cache, it seems impossible to beat: you just directly access the kernel page cache as memory and it can't get faster than that.
Well, it can.
mmap is not actually magic because...
mmap still does per-page work
A primary hidden cost of
mmap
vsread(2)
(which is really the comparable OS-level syscall for reading blocks) is that withmmap
you'll need to do "some work" for every 4K page accessed in a new mapping, even though it might be hidden by the page-fault mechanism.For a example a typical implementation that just
mmap
s the entire file will need to fault-in so 100 GB / 4K = 25 million faults to read a 100 GB file. Now, these will be minor faults, but 25 million page faults is still not going to be super fast. The cost of a minor fault is probably in the 100s of nanos in the best case.mmap relies heavily on TLB performance
Now, you can pass
MAP_POPULATE
tommap
to tell it to set up all the page tables before returning, so there should be no page faults while accessing it. Now, this has the little problem that it also reads the entire file into RAM, which is going to blow up if you try to map a 100GB file - but let's ignore that for now3. The kernel needs to do per-page work to set up these page tables (shows up as kernel time). This ends up being a major cost in themmap
approach, and it's proportional to the file size (i.e., it doesn't get relatively less important as the file size grows)4.Finally, even in user-space accessing such a mapping isn't exactly free (compared to large memory buffers not originating from a file-based
mmap
) - even once the page tables are set up, each access to a new page is going to, conceptually, incur a TLB miss. Sincemmap
ing a file means using the page cache and its 4K pages, you again incur this cost 25 million times for a 100GB file.Now, the actual cost of these TLB misses depends heavily on at least the following aspects of your hardware: (a) how many 4K TLB enties you have and how the rest of the translation caching works performs (b) how well hardware prefetch deals with with the TLB - e.g., can prefetch trigger a page walk? (c) how fast and how parallel the page walking hardware is. On modern high-end x86 Intel processors, the page walking hardware is in general very strong: there are at least 2 parallel page walkers, a page walk can occur concurrently with continued execution, and hardware prefetching can trigger a page walk. So the TLB impact on a streaming read load is fairly low - and such a load will often perform similarly regardless of the page size. Other hardware is usually much worse, however!
read() avoids these pitfalls
The
read()
syscall, which is what generally underlies the "block read" type calls offered e.g., in C, C++ and other languages has one primary disadvantage that everyone is well-aware of:read()
call of N bytes must copy N bytes from kernel to user space.On the other hand, it avoids most the costs above - you don't need to map in 25 million 4K pages into user space. You can usually
malloc
a single buffer small buffer in user space, and re-use that repeatedly for all yourread
calls. On the kernel side, there is almost no issue with 4K pages or TLB misses because all of RAM is usually linearly mapped using a few very large pages (e.g., 1 GB pages on x86), so the underlying pages in the page cache are covered very efficiently in kernel space.So basically you have the following comparison to determine which is faster for a single read of a large file:
Is the extra per-page work implied by the
mmap
approach more costly than the per-byte work of copying file contents from kernel to user space implied by usingread()
?On many systems, they are actually approximately balanced. Note that each one scales with completely different attributes of the hardware and OS stack.
In particular, the
mmap
approach becomes relatively faster when:MAP_POPULATE
implementation which can efficiently process large maps in cases where, for example, the underlying pages are contiguous in physical memory.... while the
read()
approach becomes relatively faster when:read()
syscall has good copy performance. E.g., goodcopy_to_user
performance on the kernel side.The hardware factors above vary wildly across different platforms, even within the same family (e.g., within x86 generations and especially market segments) and definitely across architectures (e.g., ARM vs x86 vs PPC).
The OS factors keep changing as well, with various improvements on both sides causing a large jump in the relative speed for one approach or the other. A recent list includes:
mmap
case withoutMAP_POPULATE
.copy_to_user
methods inarch/x86/lib/copy_user_64.S
, e.g., usingREP MOVQ
when it is fast, which really help theread()
case.Update after Spectre and Meltdown
The mitigations for the Spectre and Meltdown vulnerabilities considerably increased the cost of a system call. On the systems I've measured, the cost of a "do nothing" system call (which is an estimate of the pure overhead of the system call, apart from any actual work done by the call) went from about 100 ns on a typical modern Linux system to about 700 ns. Furthermore, depending on your system, the page-table isolation fix specifically for Meltdown can have additional downstream effects apart from the direct system call cost due to the need to reload TLB entries.
All of this is a relative disadvantage for
read()
based methods as compared tommap
based methods, sinceread()
methods must make one system call for each "buffer size" worth of data. You can't arbitrarily increase the buffer size to amortize this cost since using large buffers usually performs worse since you exceed the L1 size and hence are constantly suffering cache misses.On the other hand, with
mmap
, you can map in a large region of memory withMAP_POPULATE
and the access it efficiently, at the cost of only a single system call.1 This more-or-less also includes the case where the file wasn't fully cached to start with, but where the OS read-ahead is good enough to make it appear so (i.e., the page is usually cached by the time you want it). This is a subtle issue though because the way read-ahead works is often quite different between
mmap
andread
calls, and can be further adjusted by "advise" calls as described in 2.2 ... because if the file is not cached, your behavior is going to be completely dominated by IO concerns, including how sympathetic your access pattern is to the underlying hardware - and all your effort should be in ensuring such access is as sympathetic as possible, e.g. via use of
madvise
orfadvise
calls (and whatever application level changes you can make to improve access patterns).3 You could get around that, for example, by sequentially
mmap
ing in windows of a smaller size, say 100 MB.4 In fact, it turns out the
MAP_POPULATE
approach is (at least one some hardware/OS combination) only slightly faster than not using it, probably because the kernel is using faultaround - so the actual number of minor faults is reduced by a factor of 16 or so.我试图找到有关 Linux 上 mmap / 读取性能的最终结论,我发现了一篇不错的文章 (链接)。 这是从 2000 年开始的,所以从那时起,内核中的 IO 和虚拟内存有了很多改进,但它很好地解释了
mmap
或read
可能更快或更慢的原因。mmap
的调用比read
具有更多的开销(就像epoll
比poll
具有更多的开销,后者具有更多的开销)开销比读取
)。 在某些处理器上,更改虚拟内存映射是一项相当昂贵的操作,其原因与不同进程之间的切换成本高昂相同。但是,
read
,您的文件可能很久以前就已经从缓存中刷新了。 如果您使用文件并立即丢弃它,则这不适用。 (如果您尝试mlock
页面只是为了将它们保留在缓存中,那么您就是在试图智胜磁盘缓存,而这种愚蠢的行为很少有助于系统性能)。mmap/read 的讨论让我想起了另外两个性能讨论:
一些 Java 程序员惊讶地发现非阻塞 I/O 通常比阻塞 I/O 慢,如果您知道非阻塞 I/O,那么这是完全有道理的需要进行更多的系统调用。
其他一些网络程序员惊讶地发现
epoll
通常比poll
慢,如果您知道管理epoll
需要进行更多的系统调用。结论:如果您随机访问数据、长期保存数据,或者您知道可以与其他进程共享数据(
MAP_SHARED
不太适合),请使用内存映射如果没有实际分享的话很有趣)。 如果按顺序访问数据,则正常读取文件,或者在读取后丢弃数据。 如果任何一种方法都可以降低程序的复杂性,就这样做。。 对于许多现实世界的情况,如果不测试您的实际应用程序而不是基准测试,就没有确定的方法可以证明速度更快。(很抱歉这个问题被破坏了,但我一直在寻找答案,而这个问题一直出现在谷歌结果的顶部。)
I was trying to find the final word on mmap / read performance on Linux and I came across a nice post (link) on the Linux kernel mailing list. It's from 2000, so there have been many improvements to IO and virtual memory in the kernel since then, but it nicely explains the reason why
mmap
orread
might be faster or slower.mmap
has more overhead thanread
(just likeepoll
has more overhead thanpoll
, which has more overhead thanread
). Changing virtual memory mappings is a quite expensive operation on some processors for the same reasons that switching between different processes is expensive.However,
read
, your file may have been flushed from the cache ages ago. This does not apply if you use a file and immediately discard it. (If you try tomlock
pages just to keep them in cache, you are trying to outsmart the disk cache and this kind of foolery rarely helps system performance).The discussion of mmap/read reminds me of two other performance discussions:
Some Java programmers were shocked to discover that nonblocking I/O is often slower than blocking I/O, which made perfect sense if you know that nonblocking I/O requires making more syscalls.
Some other network programmers were shocked to learn that
epoll
is often slower thanpoll
, which makes perfect sense if you know that managingepoll
requires making more syscalls.Conclusion: Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes (
MAP_SHARED
isn't very interesting if there is no actual sharing). Read files normally if you access data sequentially or discard it after reading. And if either method makes your program less complex, do that. For many real world cases there's no sure way to show one is faster without testing your actual application and NOT a benchmark.(Sorry for necro'ing this question, but I was looking for an answer and this question kept coming up at the top of Google results.)