mmap() 与读取块

生寂 2024-07-11 00:18:12

我进行了测试，比较了 25 年前（仅限 Windows）和 今天 2023 的“映射与读取”的访问速度（在 Windows 11 AMD Ryzen Threadripper 3970X 和配备 M1-Max 芯片的 macOS 上，全部采用快速 SSD NVMe）。在所有情况下，我只对顺序访问感兴趣，这是我的C++基准测试（操作系统 API 调用）的重点。

在每次测试中，我都非常小心地彻底刷新系统缓存，以确保结果准确。在 Mac 上，我使用命令“sudo purge”，在 Windows 上，我在运行每个基准测试之前使用带有“清空备用列表”选项的 RAMMap64.exe 应用程序。

我的发现仍然一致：利用文件内存映射速度明显慢，这让我很沮丧。在 Windows 上慢 5 倍，在 macOS 上慢 7 倍。
而且，在 macOS 上，读取速度比 Windows 快 4 倍，内存映射快 3 倍。这对我来说很不幸，因为我大部分时间都花在 Windows 上。

有趣的是，当我不刷新系统缓存并重新运行基准测试时，映射和读取之间的差异会大大减小，尽管读取速度仍然快了大约 30%。

总之，必须进行准确反映您对所选操作系统的特定要求的基准测试。此外，不要忽视在每次测试之前刷新系统缓存的重要性。根据这些基准，得出适合您需求的最佳方法的结论。

回复收藏 0 原文

￡烟消云散 2024-07-11 00:18:12

我认为 mmap 的最大优点是异步读取的潜力：

    addr1 = NULL;
    while( size_left > 0 ) {
        r = min(MMAP_SIZE, size_left);
        addr2 = mmap(NULL, r,
            PROT_READ, MAP_FLAGS,
            0, pos);
        if (addr1 != NULL)
        {
            /* process mmap from prev cycle */
            feed_data(ctx, addr1, MMAP_SIZE);
            munmap(addr1, MMAP_SIZE);
        }
        addr1 = addr2;
        size_left -= r;
        pos += r;
    }
    feed_data(ctx, addr1, r);
    munmap(addr1, r);

问题是我找不到正确的 MAP_FLAGS 来提示该内存应该尽快从文件同步。
我希望 MAP_POPULATE 为 mmap 提供正确的提示（即，它不会在从调用返回之前尝试加载所有内容，但会在 async.with feed_data 中执行此操作）。至少使用此标志可以提供更好的结果，即使手册指出自 2.6.23 起如果没有 MAP_PRIVATE 它什么也不做。

I think the greatest thing about mmap is potential for asynchronous reading with:

    addr1 = NULL;
    while( size_left > 0 ) {
        r = min(MMAP_SIZE, size_left);
        addr2 = mmap(NULL, r,
            PROT_READ, MAP_FLAGS,
            0, pos);
        if (addr1 != NULL)
        {
            /* process mmap from prev cycle */
            feed_data(ctx, addr1, MMAP_SIZE);
            munmap(addr1, MMAP_SIZE);
        }
        addr1 = addr2;
        size_left -= r;
        pos += r;
    }
    feed_data(ctx, addr1, r);
    munmap(addr1, r);

Problem is that I can't find the right MAP_FLAGS to give a hint that this memory should be synced from file asap.
I hope that MAP_POPULATE gives the right hint for mmap (i.e. it will not try to load all contents before return from call, but will do that in async. with feed_data). At least it gives better results with this flag even that manual states that it does nothing without MAP_PRIVATE since 2.6.23.

回复收藏 0 原文

浅忆流年 2024-07-11 00:18:12

这听起来像是多线程的一个很好的用例...我认为您可以很容易地设置一个线程来读取数据，而其他线程则处理它。这可能是一种显着提高感知性能的方法。只是一个想法。

回复收藏 0 原文

人心善变 2024-07-11 00:18:12

在我看来，使用 mmap() “只是”减轻了开发人员编写自己的缓存代码的负担。在简单的“一次读取文件”的情况下，这并不困难（尽管 mlbrock 指出您仍然将内存副本保存到进程空间中），但是如果您要在文件中来回移动或跳过位等等，我相信内核开发人员可能在实现缓存方面比我做得更好......

回复收藏 0 原文

聚集的泪 2024-07-11 00:18:12

我记得几年前将一个包含树结构的巨大文件映射到内存中。与普通反序列化相比，我对速度感到惊讶，普通反序列化涉及内存中的大量工作，例如分配树节点和设置指针。
所以事实上我正在比较对 mmap 的单个调用（或其在 Windows 上的对应项）
反对许多（MANY）对operator new 和构造函数的调用。
对于此类任务，与反序列化相比，mmap 是无与伦比的。
当然，人们应该为此研究一下 boosts 可重定位指针。

回复收藏 0 原文

盗琴音 2024-07-11 00:18:12

我同意 mmap 文件 I/O 会更快，但是在对代码进行基准测试时，计数器示例不应该稍微优化吗？

Ben Collins 写道:

char data[0x1000];
std::ifstream in("file.bin");

while (in)
{
    in.read(data, 0x1000);
    // do something with data 
}

我建议也尝试：

char data[0x1000];
std::ifstream iifle( "file.bin");
std::istream  in( ifile.rdbuf() );

while( in )
{
    in.read( data, 0x1000);
    // do something with data
}

除此之外，您还可以尝试使缓冲区大小与一页虚拟内存的大小相同，以防 0x1000 不是您计算机上一页虚拟内存的大小。恕我直言，mmap 文件 I/O 仍然获胜，但这应该会让事情变得更接近。

I agree that mmap'd file I/O is going to be faster, but while your benchmarking the code, shouldn't the counter example be somewhat optimized?

Ben Collins wrote:

char data[0x1000];
std::ifstream in("file.bin");

while (in)
{
    in.read(data, 0x1000);
    // do something with data 
}

I would suggest also trying:

char data[0x1000];
std::ifstream iifle( "file.bin");
std::istream  in( ifile.rdbuf() );

while( in )
{
    in.read( data, 0x1000);
    // do something with data
}

And beyond that, you might also try making the buffer size the same size as one page of virtual memory, in case 0x1000 is not the size of one page of virtual memory on your machine... IMHO mmap'd file I/O still wins, but this should make things closer.

回复收藏 0 原文

忆离笙 2024-07-11 00:18:12

也许您应该预处理文件，因此每个记录都位于单独的文件中（或者至少每个文件都是可映射的大小）。

另外，在继续下一条记录之前，您是否可以对每条记录执行所有处理步骤？也许这可以避免一些 IO 开销？

回复收藏 0 原文

梦情居士 2024-07-11 00:18:12

mmap 应该更快，但我不知道快多少。这很大程度上取决于您的代码。如果您使用 mmap，最好立即 mmap 整个文件，这将使您的生活变得更加轻松。一个潜在的问题是，如果您的文件大于 4GB（或者实际上限制较低，通常为 2GB），您将需要 64 位架构。因此，如果您使用的是 32 位环境，您可能不想使用它。

话虽如此，可能有更好的途径来提高性能。你说输入文件被扫描很多次，如果你可以一次性读出它然后完成它，那可能会快得多。

回复收藏 0 原文

七色彩虹 2024-07-11 00:18:12

很抱歉 Ben Collins 丢失了他的滑动窗口 mmap 源代码。如果能在 Boost 中使用那就太好了。

是的，映射文件要快得多。您本质上是使用操作系统虚拟内存子系统来关联内存与磁盘，反之亦然。这样想吧：如果操作系统内核开发人员可以让它变得更快，他们就会的。因为这样做可以让一切都变得更快：数据库、启动时间、程序加载时间等等。

滑动窗口方法实际上并不困难，因为可以一次映射多个连续页面。因此，只要单个记录中最大的一条记录能够装入内存，记录的大小并不重要。重要的是管理簿记。

如果记录不是从 getpagesize() 边界开始，则映射必须从上一页开始。映射区域的长度从记录的第一个字节（如有必要，向下舍入到 getpagesize() 的最接近的倍数）到记录的最后一个字节（向上舍入到 getpagesize() 的最接近的倍数）。处理完一条记录后，您可以 unmap() 它，然后继续处理下一条记录。

这一切在 Windows 下也可以正常工作，使用 CreateFileMapping() 和 MapViewOfFile() （以及 GetSystemInfo() 来获取 SYSTEM_INFO.dwAllocationGranularity --- 而不是 SYSTEM_INFO.dwPageSize）。

回复收藏 0 原文

情丝乱 2024-07-11 00:18:12

mmap 速度更快。您可以编写一个简单的基准来向自己证明这一点：

char data[0x1000];
std::ifstream in("file.bin");

while (in)
{
  in.read(data, 0x1000);
  // do something with data
}

与：

const int file_size=something;
const int page_size=0x1000;
int off=0;
void *data;

int fd = open("filename.bin", O_RDONLY);

while (off < file_size)
{
  data = mmap(NULL, page_size, PROT_READ, 0, fd, off);
  // do stuff with data
  munmap(data, page_size);
  off += page_size;
}

显然，我省略了细节（例如，如果您的文件不是 的倍数，如何确定何时到达文件末尾page_size，例如），但实际上不应该比这更复杂。

如果可以的话，您可以尝试将数据分解为多个文件，这些文件可以全部而不是部分地进行 mmap() 编辑（更简单）。

几个月前，我为 boost_iostreams 实现了一个半成品的滑动窗口 mmap() 流类，但没有人关心，我忙于其他事情。最不幸的是，几周前我删除了旧的未完成项目的存档，而那是受害者之一:-(

更新：我还应该添加一个警告，即该基准测试在 Windows 中看起来会非常不同因为微软实现了一个漂亮的文件缓存，它可以完成大部分您将使用 mmap 执行的操作，即，对于经常访问的文件，您只需执行 std::ifstream.read() ，它就会与 mmap 一样快。，因为文件缓存已经为您完成了内存映射，并且它是透明的

最终更新：看，人们：跨操作系统和标准库以及磁盘和内存的许多不同平台组合。对于层次结构，我不能肯定地说，被视为黑匣子的系统调用 mmap 总是比 read 快得多，但事实并非如此。我的意图，即使我的话可以这样解释最终，我的观点是内存映射 I/O 通常比基于字节的 I/O 更快；这仍然是事实。如果您通过实验发现两者之间没有区别，那么对我来说唯一合理的解释是您的平台以有利于 read. 绝对确定您正在以可移植方式使用内存映射 I/O 的唯一方法是使用 mmap。如果您不关心可移植性并且可以依赖目标平台的特定特征，那么使用 read 可能比较合适，而且不会显着牺牲任何性能。

编辑以清理答案列表：
@jbl：

滑动窗口 mmap 发出声音
有趣的。你能多说一点吗
关于它？

当然 - 我正在为 Git 编写一个 C++ 库（一个 libgit++，如果你愿意的话），我遇到了与此类似的问题：我需要能够打开大（非常大）文件，并且性能不至于太差（就像 std::fstream 一样）。

Boost::Iostreams 已经有一个mapped_file 源，但问题是它mmapping 整个文件，这将您限制为2^(wordsize)。在 32 位机器上，4GB 不够大。期望 Git 中的 .pack 文件变得比这大得多并不是没有道理的，因此我需要以块的形式读取文件，而无需诉诸常规文件 I/O。在 Boost::Iostreams 的背后，我实现了一个 Source，它或多或少是 std::streambuf 和 std:: 之间交互的另一个视图。 istream。您还可以尝试类似的方法，只需将 std::filebuf 继承到 mapped_filebuf 中，类似地，将 std::fstream 继承到一个mapped_fstream。两者之间的互动是很难正确处理的。 Boost::Iostreams 已经为您完成了一些工作，并且它还提供了过滤器和链的钩子，因此我认为以这种方式实现它会更有用。

mmap is way faster. You might write a simple benchmark to prove it to yourself:

char data[0x1000];
std::ifstream in("file.bin");

while (in)
{
  in.read(data, 0x1000);
  // do something with data
}

versus:

const int file_size=something;
const int page_size=0x1000;
int off=0;
void *data;

int fd = open("filename.bin", O_RDONLY);

while (off < file_size)
{
  data = mmap(NULL, page_size, PROT_READ, 0, fd, off);
  // do stuff with data
  munmap(data, page_size);
  off += page_size;
}

Clearly, I'm leaving out details (like how to determine when you reach the end of the file in the event that your file isn't a multiple of page_size, for instance), but it really shouldn't be much more complicated than this.

If you can, you might try to break up your data into multiple files that can be mmap()-ed in whole instead of in part (much simpler).

A couple of months ago I had a half-baked implementation of a sliding-window mmap()-ed stream class for boost_iostreams, but nobody cared and I got busy with other stuff. Most unfortunately, I deleted an archive of old unfinished projects a few weeks ago, and that was one of the victims :-(

Update: I should also add the caveat that this benchmark would look quite different in Windows because Microsoft implemented a nifty file cache that does most of what you would do with mmap in the first place. I.e., for frequently-accessed files, you could just do std::ifstream.read() and it would be as fast as mmap, because the file cache would have already done a memory-mapping for you, and it's transparent.

Final Update: Look, people: across a lot of different platform combinations of OS and standard libraries and disks and memory hierarchies, I can't say for certain that the system call mmap, viewed as a black box, will always always always be substantially faster than read. That wasn't exactly my intent, even if my words could be construed that way. Ultimately, my point was that memory-mapped i/o is generally faster than byte-based i/o; this is still true. If you find experimentally that there's no difference between the two, then the only explanation that seems reasonable to me is that your platform implements memory-mapping under the covers in a way that is advantageous to the performance of calls to read. The only way to be absolutely certain that you're using memory-mapped i/o in a portable way is to use mmap. If you don't care about portability and you can rely on the particular characteristics of your target platforms, then using read may be suitable without sacrificing measurably any performance.

Edit to clean up answer list:
@jbl:

the sliding window mmap sounds
interesting. Can you say a little more
about it?

Sure - I was writing a C++ library for Git (a libgit++, if you will), and I ran into a similar problem to this: I needed to be able to open large (very large) files and not have performance be a total dog (as it would be with std::fstream).

Boost::Iostreams already has a mapped_file Source, but the problem was that it was mmapping whole files, which limits you to 2^(wordsize). On 32-bit machines, 4GB isn't big enough. It's not unreasonable to expect to have .pack files in Git that become much larger than that, so I needed to read the file in chunks without resorting to regular file i/o. Under the covers of Boost::Iostreams, I implemented a Source, which is more or less another view of the interaction between std::streambuf and std::istream. You could also try a similar approach by just inheriting std::filebuf into a mapped_filebuf and similarly, inheriting std::fstream into a mapped_fstream. It's the interaction between the two that's difficult to get right. Boost::Iostreams has some of the work done for you, and it also provides hooks for filters and chains, so I thought it would be more useful to implement it that way.

回复收藏 0 原文

○愚か者の日 2024-07-11 00:18:12

主要的性能成本是磁盘 I/O。 “mmap()”肯定比 istream 快，但差异可能并不明显，因为磁盘 I/O 将主导您的运行时间。

我尝试了 Ben Collins 的代码片段（见上文/下文）来测试他的断言“mmap() 更快”，但没有发现任何可测量的差异。请参阅我对他的回答的评论。

我当然不建议依次单独映射每个记录，除非您的“记录”很大 - 这会非常慢，每个记录需要 2 次系统调用，并且可能会从磁盘中丢失页面-内存缓存.....

在你的情况下，我认为 mmap()、istream 和低级 open()/read() 调用都大致相同。在这些情况下，我会推荐 mmap()：

文件内有随机访问（非顺序），并且
整个文件适合内存，或者文件内有引用局部性，以便可以映射某些页面并其他页面已映射。这样操作系统就可以利用可用的 RAM 来发挥最大的作用。
或者，如果多个进程正在读取/处理同一个文件，那么 mmap() 就非常有用，因为这些进程都共享相同的物理页。

（顺便说一句 - 我喜欢 mmap()/MapViewOfFile()）。

回复收藏 0 原文

盗琴音 2024-07-11 00:18:12

这里已经有很多很好的答案，涵盖了许多要点，因此我将添加一些我在上面没有直接解决的问题。也就是说，这个答案不应被视为对利弊的综合，而应被视为对此处其他答案的补充。

mmap 看起来很神奇

以文件已完全缓存¹作为基线²的情况，mmap 可能看起来非常像 magic：

mmap 只需要 1 次系统调用即可（可能）映射整个文件，之后不再需要系统调用。
mmap 不需要将文件数据从内核复制到用户空间。
mmap 允许您“作为内存”访问文件，包括使用您可以针对内存执行的任何高级技巧来处理它，例如编译器自动向量化，SIMD 内在函数、预取、优化的内存解析例程、OpenMP 等。

在文件已经在缓存中的情况下，似乎不可能击败：您只需直接访问内核页面缓存作为内存，它就不会比这更快了。

嗯，可以。

mmap 实际上并不神奇，因为...

mmap 仍然执行每页工作

mmap 与 read(2) （这实际上是用于读取块的可比较的操作系统级系统调用）是使用 mmap，您需要为新映射中访问的每个 4K 页面做“一些工作”，即使它可能被页面错误机制隐藏。

例如，仅 mmap 整个文件的典型实现将需要出现故障，因此 100 GB / 4K = 2500 万次故障才能读取 100 GB 文件。现在，这些将是小错误，但有 2500 万个页面故障仍然不会超级快。在最好的情况下，一个小故障的成本可能是数百纳秒。

mmap 严重依赖 TLB 性能

现在，您可以将 MAP_POPULATE 传递给 mmap 告诉它在返回之前设置所有页表，因此访问时不应该出现页面错误它。现在，这有一个小问题，它还将整个文件读取到 RAM 中，如果您尝试映射 100GB 文件，RAM 将会崩溃 - 但现在让我们忽略它³。内核需要执行每页工作来设置这些页表（显示为内核时间）。这最终成为 mmap 方法中的主要成本，并且它与文件大小成正比（即，随着文件大小的增长，它的重要性不会相对降低）⁴。

最后，即使在用户空间中，访问此类映射也不是完全免费的（与并非源自基于文件的 mmap 的大内存缓冲区相比） - 即使设置了页表，每次访问从概念上讲，到新页面将导致 TLB 未命中。由于 mmap 文件意味着使用页面缓存及其 4K 页面，因此对于 100GB 文件，您会再次产生 2500 万倍的成本。

现在，这些 TLB 未命中的实际成本在很大程度上取决于硬件的至少以下方面：(a) 您拥有多少 4K TLB 实体以及其余翻译缓存的工作方式如何执行 (b) 硬件预取处理的效果如何使用 TLB - 例如，预取可以触发页面遍历吗？ (c) 页面遍历硬件的速度和并行程度。在现代高端 x86 Intel 处理器上，页面行走硬件通常非常强大：至少有 2 个并行页面行走器，页面行走可以与继续执行同时发生，并且硬件预取可以触发页面行走。因此，TLB 对流式读取负载的影响相当低 - 并且无论页面大小如何，此类负载通常都会执行类似的操作。然而，其他硬件通常要差得多！

read() 避免了这些陷阱

read() 系统调用，这通常是“块读”类型调用的基础，例如，在 C、C++ 和其他语言中，它有一个每个人都清楚的主要缺点：

每个 N 字节的 read() 调用都必须将 N 个字节从内核复制到用户空间。

另一方面，它避免了上述大部分成本 - 您不需要将 2500 万个 4K 页面映射到用户空间。您通常可以在用户空间中malloc单个缓冲区的小缓冲区，并在所有read调用中重复使用它。在内核方面，4K 页面或 TLB 未命中几乎不存在问题，因为所有 RAM 通常都是使用一些非常大的页面（例如 x86 上的 1 GB 页面）进行线性映射，因此页面缓存中的底层页面都被覆盖在内核空间中非常有效。

因此，基本上，您可以通过以下比较来确定单个读取大文件的速度更快：

mmap 方法隐含的每页额外工作是否比每字节成本更高使用read()暗示将文件内容从内核复制到用户空间的工作？

在许多系统上，它们实际上是近似平衡的。请注意，每一种都可以根据硬件和操作系统堆栈的完全不同的属性进行扩展。

特别是，在以下情况下，mmap 方法变得相对更快：

操作系统具有快速的小故障处理，尤其是小故障批量优化（例如故障绕过）。
操作系统具有良好的 MAP_POPULATE 实现，可以在底层页面在物理内存中连续的情况下有效地处理大型映射。
硬件具有强大的页翻译性能，如大型TLB、快速的二级TLB、快速并行的page-walker、良好的预取与翻译交互等。

...而在以下情况下，read() 方法变得相对更快：

read() 系统调用具有良好的复制性能。例如，内核端良好的 copy_to_user 性能。
内核有一种有效的（相对于用户态的）方式来映射内存，例如，在硬件支持下仅使用几个大页面。
内核具有快速的系统调用和跨系统调用保存内核 TLB 条目的方法。

上述硬件因素在不同平台上存在很大差异，甚至在同一个系列内（例如，x86 代，尤其是细分市场），并且肯定会跨架构（例如，ARM、x86 与 PPC）。

操作系统因素也在不断变化，双方的各种改进导致一种方法或另一种方法的相对速度大幅跃升。最近的列表包括：

如上所述，添加了故障规避，这对于没有 MAP_POPULATE 的 mmap 情况确实有帮助。
在 arch/x86/lib/copy_user_64.S 中添加快速路径 copy_to_user 方法，例如，在速度很快时使用 REP MOVQ，这确实对 read() 案例有帮助。

Spectre 和 Meltdown 后的更新

Spectre 和 Meltdown 漏洞的缓解措施大大增加了系统调用的成本。在我测量过的系统上，“不执行任何操作”的系统调用（这是对系统调用的纯粹开销的估计，不包括调用完成的任何实际工作）的成本在典型的系统上约为 100 纳秒。现代 Linux 系统大约需要 700 ns。此外，根据您的系统，专门针对 Meltdown 的页表隔离修复可能还有额外的功能除了由于需要重新加载 TLB 条目而导致的直接系统调用成本之外，还会产生下游影响。

与基于 mmap 的方法相比，所有这些都是基于 read() 的方法的相对缺点，因为 read() 方法必须创建一个系统调用每个“缓冲区大小”的数据价值。您不能任意增加缓冲区大小来分摊此成本，因为使用大缓冲区通常会表现更差，因为您超过了 L1 大小，因此不断遭受缓存未命中。

另一方面，使用 mmap，您可以使用 MAP_POPULATE 映射较大的内存区域，并高效地访问它，而只需一次系统调用。

¹ 这或多或少还包括文件一开始没有完全缓存的情况，但操作系统的预读足以使其显示如此（即页面通常会在您需要时缓存）。但这是一个微妙的问题，因为 mmap 和 read 调用之间的预读工作方式通常有很大不同，并且可以通过“advise”调用进一步调整，如²。

² ...因为如果文件未缓存，您的行为将完全由 IO 问题主导，包括您的访问模式对底层硬件的支持程度 -并且您所有的努力都应该是确保此类访问尽可能具有同情心，例如通过使用 madvise 或 fadvise 调用（以及您可以进行的任何应用程序级别更改以改进访问模式）。

³ 例如，您可以通过在较小尺寸（例如 100 MB）的窗口中按顺序进行 mmap 来解决这个问题。

⁴ 事实上，事实证明 MAP_POPULATE 方法（至少是某种硬件/操作系统组合）仅比不使用它快一点，可能是因为内核正在使用 < a href="https://lwn.net/Articles/588802/" rel="noreferrer">faultaround - 因此小故障的实际数量减少了 16 倍左右。

There are lots of good answers here already that cover many of the salient points, so I'll just add a couple of issues I didn't see addressed directly above. That is, this answer shouldn't be considered a comprehensive of the pros and cons, but rather an addendum to other answers here.

mmap seems like magic

Taking the case where the file is already fully cached¹ as the baseline², mmap might seem pretty much like magic:

mmap only requires 1 system call to (potentially) map the entire file, after which no more system calls are needed.
mmap doesn't require a copy of the file data from kernel to user-space.
mmap allows you to access the file "as memory", including processing it with whatever advanced tricks you can do against memory, such as compiler auto-vectorization, SIMD intrinsics, prefetching, optimized in-memory parsing routines, OpenMP, etc.

In the case that the file is already in the cache, it seems impossible to beat: you just directly access the kernel page cache as memory and it can't get faster than that.

Well, it can.

mmap is not actually magic because...

mmap still does per-page work

A primary hidden cost of mmap vs read(2) (which is really the comparable OS-level syscall for reading blocks) is that with mmap you'll need to do "some work" for every 4K page accessed in a new mapping, even though it might be hidden by the page-fault mechanism.

For a example a typical implementation that just mmaps the entire file will need to fault-in so 100 GB / 4K = 25 million faults to read a 100 GB file. Now, these will be minor faults, but 25 million page faults is still not going to be super fast. The cost of a minor fault is probably in the 100s of nanos in the best case.

mmap relies heavily on TLB performance

Now, you can pass MAP_POPULATE to mmap to tell it to set up all the page tables before returning, so there should be no page faults while accessing it. Now, this has the little problem that it also reads the entire file into RAM, which is going to blow up if you try to map a 100GB file - but let's ignore that for now³. The kernel needs to do per-page work to set up these page tables (shows up as kernel time). This ends up being a major cost in the mmap approach, and it's proportional to the file size (i.e., it doesn't get relatively less important as the file size grows)⁴.

Finally, even in user-space accessing such a mapping isn't exactly free (compared to large memory buffers not originating from a file-based mmap) - even once the page tables are set up, each access to a new page is going to, conceptually, incur a TLB miss. Since mmaping a file means using the page cache and its 4K pages, you again incur this cost 25 million times for a 100GB file.

Now, the actual cost of these TLB misses depends heavily on at least the following aspects of your hardware: (a) how many 4K TLB enties you have and how the rest of the translation caching works performs (b) how well hardware prefetch deals with with the TLB - e.g., can prefetch trigger a page walk? (c) how fast and how parallel the page walking hardware is. On modern high-end x86 Intel processors, the page walking hardware is in general very strong: there are at least 2 parallel page walkers, a page walk can occur concurrently with continued execution, and hardware prefetching can trigger a page walk. So the TLB impact on a streaming read load is fairly low - and such a load will often perform similarly regardless of the page size. Other hardware is usually much worse, however!

read() avoids these pitfalls

The read() syscall, which is what generally underlies the "block read" type calls offered e.g., in C, C++ and other languages has one primary disadvantage that everyone is well-aware of:

Every read() call of N bytes must copy N bytes from kernel to user space.

On the other hand, it avoids most the costs above - you don't need to map in 25 million 4K pages into user space. You can usually malloc a single buffer small buffer in user space, and re-use that repeatedly for all your read calls. On the kernel side, there is almost no issue with 4K pages or TLB misses because all of RAM is usually linearly mapped using a few very large pages (e.g., 1 GB pages on x86), so the underlying pages in the page cache are covered very efficiently in kernel space.

So basically you have the following comparison to determine which is faster for a single read of a large file:

Is the extra per-page work implied by the mmap approach more costly than the per-byte work of copying file contents from kernel to user space implied by using read()?

On many systems, they are actually approximately balanced. Note that each one scales with completely different attributes of the hardware and OS stack.

In particular, the mmap approach becomes relatively faster when:

The OS has fast minor-fault handling and especially minor-fault bulking optimizations such as fault-around.
The OS has a good MAP_POPULATE implementation which can efficiently process large maps in cases where, for example, the underlying pages are contiguous in physical memory.
The hardware has strong page translation performance, such as large TLBs, fast second level TLBs, fast and parallel page-walkers, good prefetch interaction with translation and so on.

... while the read() approach becomes relatively faster when:

The read() syscall has good copy performance. E.g., good copy_to_user performance on the kernel side.
The kernel has an efficient (relative to userland) way to map memory, e.g., using only a few large pages with hardware support.
The kernel has fast syscalls and a way to keep kernel TLB entries around across syscalls.

The hardware factors above vary wildly across different platforms, even within the same family (e.g., within x86 generations and especially market segments) and definitely across architectures (e.g., ARM vs x86 vs PPC).

The OS factors keep changing as well, with various improvements on both sides causing a large jump in the relative speed for one approach or the other. A recent list includes:

Addition of fault-around, described above, which really helps the mmap case without MAP_POPULATE.
Addition of fast-path copy_to_user methods in arch/x86/lib/copy_user_64.S, e.g., using REP MOVQ when it is fast, which really help the read() case.

Update after Spectre and Meltdown

The mitigations for the Spectre and Meltdown vulnerabilities considerably increased the cost of a system call. On the systems I've measured, the cost of a "do nothing" system call (which is an estimate of the pure overhead of the system call, apart from any actual work done by the call) went from about 100 ns on a typical modern Linux system to about 700 ns. Furthermore, depending on your system, the page-table isolation fix specifically for Meltdown can have additional downstream effects apart from the direct system call cost due to the need to reload TLB entries.

All of this is a relative disadvantage for read() based methods as compared to mmap based methods, since read() methods must make one system call for each "buffer size" worth of data. You can't arbitrarily increase the buffer size to amortize this cost since using large buffers usually performs worse since you exceed the L1 size and hence are constantly suffering cache misses.

On the other hand, with mmap, you can map in a large region of memory with MAP_POPULATE and the access it efficiently, at the cost of only a single system call.

¹ This more-or-less also includes the case where the file wasn't fully cached to start with, but where the OS read-ahead is good enough to make it appear so (i.e., the page is usually cached by the time you want it). This is a subtle issue though because the way read-ahead works is often quite different between mmap and read calls, and can be further adjusted by "advise" calls as described in ².

² ... because if the file is not cached, your behavior is going to be completely dominated by IO concerns, including how sympathetic your access pattern is to the underlying hardware - and all your effort should be in ensuring such access is as sympathetic as possible, e.g. via use of madvise or fadvise calls (and whatever application level changes you can make to improve access patterns).

³ You could get around that, for example, by sequentially mmaping in windows of a smaller size, say 100 MB.

⁴ In fact, it turns out the MAP_POPULATE approach is (at least one some hardware/OS combination) only slightly faster than not using it, probably because the kernel is using faultaround - so the actual number of minor faults is reduced by a factor of 16 or so.

回复收藏 0 原文

一笑百媚生 2024-07-11 00:18:12

我试图找到有关 Linux 上 mmap / 读取性能的最终结论，我发现了一篇不错的文章 (链接）。这是从 2000 年开始的，所以从那时起，内核中的 IO 和虚拟内存有了很多改进，但它很好地解释了 mmap 或 read 可能更快或更慢的原因。

对 mmap 的调用比 read 具有更多的开销（就像 epoll 比 poll 具有更多的开销，后者具有更多的开销）开销比读取）。在某些处理器上，更改虚拟内存映射是一项相当昂贵的操作，其原因与不同进程之间的切换成本高昂相同。
IO系统已经可以使用磁盘缓存，所以如果你读取一个文件，无论你使用什么方法，你都会命中缓存或错过它。

但是，

内存映射对于随机访问通常更快，尤其是在访问模式稀疏且不可预测的情况下。
内存映射允许您继续使用缓存中的页面，直到完成为止。这意味着，如果您长时间大量使用某个文件，然后关闭并重新打开它，页面仍会被缓存。使用read，您的文件可能很久以前就已经从缓存中刷新了。如果您使用文件并立即丢弃它，则这不适用。（如果您尝试mlock页面只是为了将它们保留在缓存中，那么您就是在试图智胜磁盘缓存，而这种愚蠢的行为很少有助于系统性能）。
直接读取文件非常简单快捷。

mmap/read 的讨论让我想起了另外两个性能讨论：

一些 Java 程序员惊讶地发现非阻塞 I/O 通常比阻塞 I/O 慢，如果您知道非阻塞 I/O，那么这是完全有道理的需要进行更多的系统调用。
其他一些网络程序员惊讶地发现 epoll 通常比 poll 慢，如果您知道管理 epoll需要进行更多的系统调用。

结论：如果您随机访问数据、长期保存数据，或者您知道可以与其他进程共享数据（MAP_SHARED 不太适合），请使用内存映射如果没有实际分享的话很有趣）。如果按顺序访问数据，则正常读取文件，或者在读取后丢弃数据。如果任何一种方法都可以降低程序的复杂性，就这样做。。对于许多现实世界的情况，如果不测试您的实际应用程序而不是基准测试，就没有确定的方法可以证明速度更快。

（很抱歉这个问题被破坏了，但我一直在寻找答案，而这个问题一直出现在谷歌结果的顶部。）

I was trying to find the final word on mmap / read performance on Linux and I came across a nice post (link) on the Linux kernel mailing list. It's from 2000, so there have been many improvements to IO and virtual memory in the kernel since then, but it nicely explains the reason why mmap or read might be faster or slower.

A call to mmap has more overhead than read (just like epoll has more overhead than poll, which has more overhead than read). Changing virtual memory mappings is a quite expensive operation on some processors for the same reasons that switching between different processes is expensive.
The IO system can already use the disk cache, so if you read a file, you'll hit the cache or miss it no matter what method you use.

However,

Memory maps are generally faster for random access, especially if your access patterns are sparse and unpredictable.
Memory maps allow you to keep using pages from the cache until you are done. This means that if you use a file heavily for a long period of time, then close it and reopen it, the pages will still be cached. With read, your file may have been flushed from the cache ages ago. This does not apply if you use a file and immediately discard it. (If you try to mlock pages just to keep them in cache, you are trying to outsmart the disk cache and this kind of foolery rarely helps system performance).
Reading a file directly is very simple and fast.

The discussion of mmap/read reminds me of two other performance discussions:

Some Java programmers were shocked to discover that nonblocking I/O is often slower than blocking I/O, which made perfect sense if you know that nonblocking I/O requires making more syscalls.
Some other network programmers were shocked to learn that epoll is often slower than poll, which makes perfect sense if you know that managing epoll requires making more syscalls.

Conclusion: Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes (MAP_SHARED isn't very interesting if there is no actual sharing). Read files normally if you access data sequentially or discard it after reading. And if either method makes your program less complex, do that. For many real world cases there's no sure way to show one is faster without testing your actual application and NOT a benchmark.

(Sorry for necro'ing this question, but I was looking for an answer and this question kept coming up at the top of Google results.)

回复收藏 0 原文

mmap() 与读取块

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（13）

mmap 看起来很神奇

mmap 实际上并不神奇，因为...

mmap 仍然执行每页工作

mmap 严重依赖 TLB 性能

read() 避免了这些陷阱

Spectre 和 Meltdown 后的更新

mmap seems like magic

mmap is not actually magic because...

mmap still does per-page work

mmap relies heavily on TLB performance

read() avoids these pitfalls

Update after Spectre and Meltdown

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

mmap() 与读取块

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（13）

mmap 看起来很神奇

mmap 实际上并不神奇，因为...

mmap 仍然执行每页工作

mmap 严重依赖 TLB 性能

read() 避免了这些陷阱

Spectre 和 Meltdown 后的更新

mmap seems like magic

mmap is not actually magic because...

mmap still does per-page work

mmap relies heavily on TLB performance

read() avoids these pitfalls

Update after Spectre and Meltdown

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。