为什么使用 mmap 和 madvise 顺序逐行顺序读取大文件比 fgets 慢?

发布于 2024-11-07 21:16:01 字数 1604 浏览 5 评论 0原文

概述

我有一个受 IO 限制很大的程序,并且正在尝试加速它。 使用 mmap 似乎是一个好主意,但相对于仅使用一系列 fgets 调用,它实际上会降低性能。

一些演示代码

我已经将演示压缩为仅最基本的部分,并针对大约 350 万行的 800mb 文件进行测试:

使用 fgets:

char buf[4096];
FILE * fp = fopen(argv[1], "r");

while(fgets(buf, 4096, fp) != 0) {
    // do stuff
}
fclose(fp);
return 0;

800mb 文件的运行时:

[juhani@xtest tests]$ time ./readfile /r/40/13479/14960 

real    0m25.614s
user    0m0.192s
sys 0m0.124s

mmap 版本:

struct stat finfo;
int fh, len;
char * mem;
char * row, *end;
if(stat(argv[1], &finfo) == -1) return 0;
if((fh = open(argv[1], O_RDONLY)) == -1) return 0;

mem = (char*)mmap(NULL, finfo.st_size, PROT_READ, MAP_SHARED, fh, 0);
if(mem == (char*)-1) return 0;
madvise(mem, finfo.st_size, POSIX_MADV_SEQUENTIAL);
row = mem;
while((end = strchr(row, '\n')) != 0) {
    // do stuff
    row = end + 1;
}
munmap(mem, finfo.st_size);
close(fh);

运行时间变化很大,但永远不会比 fgets 快:

[juhani@xtest tests]$ time ./readfile_map /r/40/13479/14960

real    0m28.891s
user    0m0.252s
sys 0m0.732s
[juhani@xtest tests]$ time ./readfile_map /r/40/13479/14960

real    0m42.605s
user    0m0.144s
sys 0m0.472s

其他注释

  • 观察 top 中运行的进程,memmapping 版本在此过程中生成了几千个页面错误。
  • fgets 版本的 CPU 和内存使用率都非常低。

问题

  • 为什么会出现这种情况?仅仅是因为 fopen/fgets 实现的缓冲文件访问比使用 madvise POSIX_MADV_SEQUENTIAL 积极预取 mmap 更好吗?
  • 是否有其他方法可以使其更快(除了即时压缩/解压缩以将 IO 负载转移到处理器之外)?查看同一文件上“wc -l”的运行时,我猜情况可能并非如此。

Overview

I have a program bounded significantly by IO and am trying to speed it up.
Using mmap seemed to be a good idea, but it actually degrades the performance relative to just using a series of fgets calls.

Some demo code

I've squeezed down demos to just the essentials, testing against an 800mb file with about 3.5million lines:

With fgets:

char buf[4096];
FILE * fp = fopen(argv[1], "r");

while(fgets(buf, 4096, fp) != 0) {
    // do stuff
}
fclose(fp);
return 0;

Runtime for 800mb file:

[juhani@xtest tests]$ time ./readfile /r/40/13479/14960 

real    0m25.614s
user    0m0.192s
sys 0m0.124s

The mmap version:

struct stat finfo;
int fh, len;
char * mem;
char * row, *end;
if(stat(argv[1], &finfo) == -1) return 0;
if((fh = open(argv[1], O_RDONLY)) == -1) return 0;

mem = (char*)mmap(NULL, finfo.st_size, PROT_READ, MAP_SHARED, fh, 0);
if(mem == (char*)-1) return 0;
madvise(mem, finfo.st_size, POSIX_MADV_SEQUENTIAL);
row = mem;
while((end = strchr(row, '\n')) != 0) {
    // do stuff
    row = end + 1;
}
munmap(mem, finfo.st_size);
close(fh);

Runtime varies quite a bit, but never faster than fgets:

[juhani@xtest tests]$ time ./readfile_map /r/40/13479/14960

real    0m28.891s
user    0m0.252s
sys 0m0.732s
[juhani@xtest tests]$ time ./readfile_map /r/40/13479/14960

real    0m42.605s
user    0m0.144s
sys 0m0.472s

Other notes

  • Watching the process run in top, the memmapped version generated a few thousand page faults along the way.
  • CPU and memory usage are both very low for the fgets version.

Questions

  • Why is this the case? Is it just because the buffered file access implemented by fopen/fgets is better than the aggressive prefetching that mmap with madvise POSIX_MADV_SEQUENTIAL?
  • Is there an alternative method of possibly making this faster(Other than on-the-fly compression/decompression to shift IO load to the processor)? Looking at the runtime of 'wc -l' on the same file, I'm guessing this might not be the case.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

一张白纸 2024-11-14 21:16:01

POSIX_MADV_SEQUENTIAL 只是对系统的提示,可能会被特定的 POSIX 实现完全忽略。

两种解决方案之间的区别在于,mmap 要求将文件完全映射到虚拟地址空间,而 fgets 的 IO 完全在内核空间中完成,只需复制将页面放入不改变的缓冲区中。

这也更有可能发生重叠,因为 IO 是由某个内核线程完成的。

您也许可以通过让一个(或多个)独立线程读取每个页面的第一个字节来提高 mmap 实现的感知性能。然后,这个(或这些)线程将出现所有页面错误,并且当您的应用程序线程到达特定页面时,它已经被加载。

POSIX_MADV_SEQUENTIAL is only a hint to the system and may be completely ignored by a particular POSIX implementation.

The difference between your two solutions is that mmap requires the file to be mapped into the virtual address space entierly, whereas fgets has the IO entirely done in kernel space and just copies the pages into a buffer that doesn't change.

This also has more potential for overlap, since the IO is done by some kernel thread.

You could perhaps increase the perceived performance of the mmap implementation by having one (or more) independent threads reading the first byte of each page. This (or these) thread then would have all the page faults and the time your application thread would come at a particular page it would already be loaded.

若水般的淡然安静女子 2024-11-14 21:16:01

阅读 mmap 的手册页发现,可以通过将 MAP_POPULATE 添加到 mmap 的标志来防止页面错误:

MAP_POPULATE(自 Linux 2.5.46 起):填充(默认)映射的页表。对于文件映射,这会导致文件预读。稍后对映射的访问不会因页面错误而被阻止。

这样,页面错误预加载线程(如 Jens 建议)将变得过时。

编辑:
首先,您所做的基准测试应该在刷新页面缓存的情况下完成,以获得有意义的结果:

    echo 3 | sudo tee /proc/sys/vm/drop_caches

另外:使用 madviseMADV_WILLNEED 建议将在 (与带有 fadvise 的 POSIX_FADV_WILLNEED 相同)。目前不幸的是,即使文档说明不同,这些调用也会阻塞,直到请求的页面出现故障为止。但目前正在进行一些内核补丁,将故障前请求排队到内核工作队列中,以使这些调用如人们所期望的那样异步,从而使单独的预读用户空间线程变得过时。

Reading the man pages of mmap reveals that the page faults could be prevented by adding MAP_POPULATE to mmap's flags:

MAP_POPULATE (since Linux 2.5.46): Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.

This way a page faulting pre-load thread (as suggested by Jens) will become obsolete.

Edit:
First of all the benchmarks you make should be done with the page cache flushed to get meaningful results:

    echo 3 | sudo tee /proc/sys/vm/drop_caches

Additionally: The MADV_WILLNEED advice with madvise will pre-fault the required pages in (same as the POSIX_FADV_WILLNEED with fadvise). Currently unfortunately these calls block until the requested pages are faulted in, even if the documentation tells differently. But there are kernel patches underway which queue the pre-fault requests into a kernel work-queue to make these calls asynchronous as one would expect - making a separate read-ahead user space thread obsolete.

橘味果▽酱 2024-11-14 21:16:01

您正在做的事情 - 读取整个 mmap 空间 - 应该会触发一系列页面错误。使用 mmap,操作系统仅将 mmap 数据页延迟加载到内存中(在访问它们时加载它们)。所以这个方法是一种优化。尽管您与 mmap 交互时就好像整个内容都在 RAM 中一样,但它并不全部在 RAM 中 - 它只是虚拟内存中预留的一块。

相反,当您将文件读入缓冲区时,操作系统会将整个结构拉入 RAM(进入您的缓冲区)。这可能会施加大量内存压力,挤出其他页面,迫使它们写回磁盘。如果你的内存不足,它可能会导致系统崩溃。

使用 mmap 时的常见优化技术是将数据分页写入内存:循环遍历 mmap 空间,按页大小递增指针,访问每页一个字节并触发操作系统将所有 mmap 的页拉入内存;触发所有这些页面错误。这是一种“启动 RAM”的优化技术,将 mmap 拉入并为将来的使用做好准备。通过这种方法,操作系统将不需要执行太多的延迟加载。您可以在单独的线程上执行此操作,以便在主线程访问之前引导页面 - 只要确保您没有耗尽 RAM 或超出主线程太远,实际上就会开始降低性能。

使用 mmap 和 read() 将页面写入大缓冲区之间有什么区别?这有点复杂。

旧版本的 UNIX 和某些当前版本并不总是使用请求分页(其中内存被划分为块并根据需要换入/换出)。相反,在某些情况下,操作系统使用传统的交换 - 它将内存中的数据结构视为整体,并根据需要将整个结构换入/换出。在处理大文件时,这可能会更有效,因为需求分页需要将页面复制到通用缓冲区高速缓存中,并且可能会导致频繁交换甚至颠簸。交换可以避免使用通用缓冲区高速缓存——减少内存消耗,避免额外的复制操作并避免频繁写入。缺点是您无法从请求分页中受益。
使用 mmap,您可以保证按需分页;使用 read() 则不是。

另请记住,在完整的 mmap 内存空间中进行页面遍历始终比完全读取慢 60% 左右(如果您使用 MADV_SEQUENTIAL 或其他优化,则不算在内)。

使用 mmap w/ MADV_SEQUENTIAL 时需要注意 - 当您使用此功能时,您必须绝对确定您的数据是按顺序存储的,否则这实际上会使文件的分页速度减慢约 10 倍。通常,您的数据不会映射到磁盘的连续部分,而是写入分布在磁盘周围的块。所以我建议你要小心并仔细研究这一点。

请记住,RAM 中的数据过多会污染 RAM,从而使页面错误在其他地方更常见。关于性能的一个常见误解是 CPU 优化比内存占用更重要。事实并非如此——即使使用当今的 SSD,传输到磁盘所需的时间也超过了 CPU 操作时间约 8 个数量级。因此,当考虑程序执行速度时,内存占用和利用率更为重要。

read() 的一个好处是数据可以存储在堆栈上(假设堆栈足够大),这将进一步加快处理速度。

如果适合您的用例,将 read() 与流式方法一起使用是 mmap 的一个很好的替代方案。这就是您使用 fgets/fputs 所做的事情(fgets/fputs 在内部通过 read 实现)。这里你要做的是,在一个循环中,读入缓冲区,处理数据,&然后读取下一节/覆盖旧数据。像这样的流式传输可以使内存消耗保持在非常低的水平,并且是执行 I/O 的最有效方式。唯一的缺点是,您永远不会同时将整个文件保存在内存中,并且它不会保留在内存中。所以这是一种一次性的方法。如果你能使用它——那就太好了,那就用吧。如果没有...使用 mmap。

因此,read 或 mmap 哪个更快......这取决于很多因素。测试可能是您需要做的。一般来说,如果您计划长期使用数据,则 mmap 会很好,您将从请求分页中受益;或者如果您无法一次处理内存中的大量数据。如果您使用流式方法,则 Read() 更好 - 数据不必持久保存,或者数据可以适合内存,因此内存压力不是问题。此外,如果数据不会在内存中保存很长时间,则 read() 可能更可取。

现在,通过您当前的实现(这是一种流式处理方法),您正在使用 fgets() 并在 \n 处停止。大量的批量读取比重复调用 read() 一百万次(这就是 fgets 的作用)更有效。您不必使用巨大的缓冲区 - 您不希望内存压力过大(这会污染您的缓存和其他东西),&系统还使用一些内部缓冲。但您确实希望读取大小为 64k 的缓冲区。您绝对不想逐行调用 read 。

您可以多线程解析该缓冲区。只需确保线程访问不同缓存块中的数据 - 因此找到缓存块的大小,让线程在距离至少为缓存块大小的缓冲区的不同部分上工作。

针对您的特定问题的一些更具体的建议:
您可以尝试将数据重新格式化为某种二进制格式。例如,尝试将文件编码更改为自定义格式,而不是 UTF-8 或其他格式。这可以缩小其规模。 350 万行是相当多的字符需要循环...您可能正在进行约 1.5 亿个字符比较。
如果您可以在程序运行之前按行长度对文件进行排序...您可以编写一个算法来更快地解析行 - 只需增加一个指针并测试您到达的字符,确保它是“\n”。然后进行您需要做的任何处理。
您需要找到一种方法来维护已排序的文件,即使用此方法将新数据插入到适当的位置。

您可以更进一步 - 对文件进行排序后,维护文件中给定长度的行数的列表。使用它来指导您对行的解析 - 直接跳到每行的末尾,而无需进行字符比较。
如果无法对文件进行排序,只需创建一个列表,其中包含从每行开头到其终止换行符的所有偏移量。 350 万个抵消额。
编写算法来在文件中插入/删除行时更新该列表

当您进入这样的文件处理算法时......它开始类似于 noSQL 数据库的实现。另一种选择可能是将所有这些数据插入到 noSQL 数据库中。取决于您需要做什么:信不信由你,有时只是原始的自定义文件操作&上述维护比任何数据库实现都快,甚至比 noSQL 数据库还要快。

还有一些事情:
当您将这种流式传输方法与 read() 一起使用时,您必须小心处理边缘情况 - 到达一个缓冲区的末尾,并适当地启动一个新缓冲区。这就是所谓的缓冲区拼接。

最后,在大多数现代系统上,当您使用 read() 时,数据仍然存储在通用缓冲区缓存中,然后复制到您的进程中。这是一个额外的复制操作。在处理大文件的某些情况下,您可以禁用缓冲区高速缓存以加快 IO 速度。请注意,这将禁用分页。但如果数据仅在内存中短暂存在,则这并不重要。
缓冲区缓存很重要 - 找到一种方法在 IO 完成后重新启用它。也许只是针对特定进程禁用它,在单独的进程中执行 IO,或者其他什么...我不确定细节,但这是可以完成的事情。
不过,我认为这实际上不是你的问题,老实说,我认为角色比较 - 一旦你解决了这个问题就应该没问题。

这是我能想到的最好的,也许专家们会有其他的想法。
继续!

What you're doing - reading through the entire mmap space - is supposed to trigger a series of page faults. with mmap, the OS only lazily loads pages of the mmap'd data into memory (loads them when you access them). So this approach is an optimization. Although you interface with mmap as if the entire thing is in RAM, it is not all in RAM - it is just a chunk set aside in virtual memory.

In contrast, when you do a read of a file into a buffer the OS pulls the entire structure into RAM (into your buffer). This can apply alot of memory pressure, crowding out other pages, forcing them to be written back to disk. It can lead to thrashing if you're low on memory.

A common optimization technique when using mmap is to page-walk the data into memory: loop through the mmap space, incrementing your pointer by the page size, accessing a single byte per page and triggering the OS to pull all the mmap's pages into memory; triggering all these page faults. This is an optimization technique to "prime the RAM", pulling the mmap in and readying it for future use. With this approach, the OS won't need to do as much lazy loading. You can do this on a separate thread to lead the pages in prior to your main threads access - just make sure you don't run out of RAM or get too far ahead of the main thread, you'll actually begin to degrade performance.

What is the difference between page walking w/ mmap and read() into a large buffer? That's kind of complicated.

Older versions of UNIX, and some current versions, don't always use demand-paging (where the memory is divided up into chunks and swapped in / out as needed). Instead, in some cases, the OS uses traditional swapping - it treats data structures in memory as monolithic, and the entire structure is swapped in / out as needed. This may be more efficient when dealing with large files, where demand-paging requires copying pages into the universal buffer cache, and may lead to frequent swapping or even thrashing. Swapping may avoid use of the universal buffer cache - reducing memory consumption, avoiding an extra copy operation and avoiding frequent writes. Downside is you can't benefit from demand-paging.
With mmap, you're guaranteed demand-paging; with read() you are not.

Also bear in mind that page-walking in a full mmap memory space is always about 60% slower than a flat out read (not counting if you use MADV_SEQUENTIAL or other optimizations).

One note when using mmap w/ MADV_SEQUENTIAL - when you use this, you must be absolutely sure your data IS stored sequentially, otherwise this will actually slow down the paging in of the file by about 10x. Usually your data is not mapped to a continuous section of the disk, it's written to blocks that are spread around the disk. So I suggest you be careful and look closely into this.

Remember, too much data in RAM will pollute the RAM, making page faults alot more common elsewhere. One common misconception about performance is that CPU optimization is more important than memory footprint. Not true - the time it takes to travel to disk exceeds the time of CPU operations by something like 8 orders of magnitude, even with todays SSDs. Therefor, when program execution speed is a concern, memory footprint and utilization is far more important.

A nice thing about read() is the data can be stored on the stack (assuming the stack is large enough), which will further speed up processing.

Using read() with a streaming approach is a good alternative to mmap, if it fits your use case. This is kind of what you're doing with fgets/fputs (fgets/fputs is internally implemented with read). Here what you do is, in a loop, read into a buffer, process the data, & then read in the next section / overwrite the old data. Streaming like this can keep your memory consumption very low, and can be the most efficient way of doing I/O. The only downside is that you never have the entire file in memory at once, and it doesn't persist in memory. So it's a one-off approach. If you can use it - great, do it. If not... use mmap.

So whether read or mmap is faster... it depends on many factors. Testing is probably what you need to do. Generally speaking, mmap is nice if you plan on using the data for an extended period, where you will benefit from demand-paging; or if you just can't handle that amount of data in memory at once. Read() is better if you are using a streaming approach - the data doesn't have to persist, or the data can fit in memory so memory pressure isn't a concern. Also if the data won't be in memory for very long, read() may be preferable.

Now, with your current implementation - which is a sort of streaming approach - you are using fgets() and stopping on \n. Large, bulk reads are more efficient than calling read() repeatedly a million times (which is what fgets does). You don't have to use a giant buffer - you don't want excess memory pressure (which can pollute your cache & other things), & the system also has some internal buffering it uses. But you do want to be reading into a buffer of... lets say 64k in size. You definitely dont want to be calling read line by line.

You could multithread the parsing of that buffer. Just make sure the threads access data in different cache blocks - so find the size of the cache block, get your threads working on different portions of the buffer distanced by at least the cache block size.

Some more specific suggestions for your particular problem:
You might try reformatting the data into some binary format. For example, try changing the file encoding to a custom format instead of UTF-8 or whatever it is. That could reduce its size. 3.5 million lines is quite alot of characters to loop through... it's probably ~150 million character comparisons that you are doing.
If you can sort the file by line length prior to the program running... you can write an algorithm to much more quickly parse the lines - just increment a pointer and test the character you arrive at, making sure it's '\n'. Then do whatever processing you need to do.
You'll need to find a way to maintain the sorted file by inserting new data into appropriate places with this approach.

You can take this a step further - after sorting your file, maintain a list of how many lines of a given length are in the file. Use that to guide your parsing of lines - jump right to the end of each line w/out having to do character comparisons.
If you can't sort the file, just create a list of all the offsets from the start of each line to its terminating newline. 3.5 million offsets.
Write algorithms to update that list on insertion/deletion of lines from the file

When you get into file processing algorithms such as this... it begins to resemble the implementation of a noSQL database. An alternative might just be to insert all this data into a noSQL database. Depends on what you need to do: believe it or not, sometimes just raw custom file manipulation & maintenance described above is faster than any database implementation, even noSQL databases.

A few more things:
When you use this streaming approach with read() you must take care to handle the edge cases - where you reach the end of one buffer, and start a new buffer - appropriately. That's called buffer-stitching.

Lastly, on most modern systems when you use read() the data still gets stored in the universal buffer cache and then copied into your process. That's an extra copy operation. You can disable the buffer cache to speed up the IO in certain cases where you're handling big files. Beware, this will disable paging. But if the data is only in memory for a brief time, this doesn't matter.
The buffer cache is important - find a way to reenable it after the IO was finished. Maybe disable it just for the particular process, do your IO in a separate process, or something... I'm not sure about the details, but this is something that can be done.
I don't think that's actually your problem, though, tbh I think the character comparisons - once you fix that it should just be fine.

That's the best I've got, maybe the experts will have other ideas.
Carry onward!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文