Linux/perl mmap 性能
我正在尝试使用 mmap 优化大型数据集的处理。 数据集在千兆字节范围内。 这个想法是将整个文件映射到内存中,允许多个进程同时处理数据集(只读)。 但它并没有按预期工作。
作为一个简单的测试,我简单地映射文件(使用 perl 的 Sys::Mmap 模块,使用“mmap”子,我相信它直接映射到底层 C 函数)并使进程睡眠。 执行此操作时,代码在从 mmap 调用返回之前花费了超过一分钟的时间,尽管此测试没有从 mmap 文件中执行任何操作(甚至没有进行读取)。
猜测,我虽然也许linux需要在第一次mmap'ed时读取整个文件,所以在第一个进程中映射文件后(当它正在睡眠时),我在另一个进程中调用了一个简单的测试,试图读取文件的前几兆字节。
令人惊讶的是,第二个进程似乎在从 mmap 调用返回之前也花费了很多时间,大约与第一次 mmap 文件的时间相同。
我已确保正在使用 MAP_SHARED,并且第一次映射文件的进程仍然处于活动状态(它尚未终止,并且 mmap 尚未取消映射)。
我预计 mmap 文件将允许我为多个工作进程提供对大文件的有效随机访问,但如果每个 mmap 调用都需要首先读取整个文件,那就有点困难了。 我还没有使用长时间运行的进程进行测试来查看第一次延迟后访问是否很快,但我预计使用 MAP_SHARED 和另一个单独的进程就足够了。
我的理论是 mmap 或多或少会立即返回,而 linux 会或多或少按需加载块,但我看到的行为是相反的,表明它需要在每次调用 mmap 时读取整个文件。
知道我做错了什么,或者我是否完全误解了 mmap 应该如何工作?
I'm trying to optimize handling of large datasets using mmap. A dataset is in the gigabyte range. The idea was to mmap the whole file into memory, allowing multiple processes to work on the dataset concurrently (read-only). It isn't working as expected though.
As a simple test I simply mmap the file (using perl's Sys::Mmap module, using the "mmap" sub which I believe maps directly to the underlying C function) and have the process sleep. When doing this, the code spends more than a minute before it returns from the mmap call, despite this test doing nothing - not even a read - from the mmap'ed file.
Guessing, I though maybe linux required the whole file to be read when first mmap'ed, so after the file had been mapped in the first process (while it was sleeping), I invoked a simple test in another process which tried to read the first few megabytes of the file.
Suprisingly, it seems the second process also spends a lot of time before returning from the mmap call, about the same time as mmap'ing the file the first time.
I've made sure that MAP_SHARED is being used and that the process that mapped the file the first time is still active (that it has not terminated, and that the mmap hasn't been unmapped).
I expected a mmapped file would allow me to give multiple worker processes effective random access to the large file, but if every mmap call requires reading the whole file first, it's a bit harder. I haven't tested using long-running processes to see if access is fast after the first delay, but I expected using MAP_SHARED and another separate process would be sufficient.
My theory was that mmap would return more or less immediately, and that linux would load the blocks more or less on-demand, but the behaviour I am seeing is the opposite, indicating it requires reading through the whole file on each call to mmap.
Any idea what I'm doing wrong, or if I've completely misunderstood how mmap is supposed to work?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
好的,找到问题了。 正如所怀疑的,Linux 或 Perl 都不是罪魁祸首。 要打开并访问该文件,我会执行以下操作:
如果您测试该代码,则不会出现像我在原始代码中发现的那样的延迟,并且在创建最小示例之后(始终这样做,对吧!),原因突然变得显而易见。
错误是我在代码中将
$mh
标量视为句柄,这是一种重量轻且可以轻松移动的东西(阅读:按值传递)。 事实证明,它实际上是一个 GB 长的字符串,绝对不是您想要在不创建显式引用的情况下移动的东西(perl 语言中表示“指针”/句柄值)。 因此,如果您需要以散列或类似形式存储,请确保存储\$mh
,并在需要像${$hash->{mh 一样使用它时取消引用它}}
,通常作为 substr 或类似参数中的第一个参数。Ok, found the problem. As suspected, neither linux or perl were to blame. To open and access the file I do something like this:
If you test that code, there are no delays like those I found in my original code, and after creating the minimal sample (always do that, right!) the reason suddenly became obvious.
The error was that I in my code treated the
$mh
scalar as a handle, something which is light weight and can be moved around easily (read: pass by value). Turns out, it's actually a GB long string, definitively not something you want to move around without creating an explicit reference (perl lingua for a "pointer"/handle value). So if you need to store in in a hash or similar, make sure you store\$mh
, and deref it when you need to use it like${$hash->{mh}}
, typically as the first parameter in a substr or similar.如果您有相对较新版本的 Perl,则不应使用 Sys::Mmap。 您应该使用 PerlIO 的 mmap 层。
您可以发布您正在使用的代码吗?
If you have a relatively recent version of Perl, you shouldn't be using Sys::Mmap. You should be using PerlIO's mmap layer.
Can you post the code you are using?
在 32 位系统上,
mmap()
的地址空间相当有限(并且因操作系统而异)。 如果您使用的是多 GB 文件并且仅在 64 位系统上进行测试,请注意这一点。 (我更愿意在评论中写下此内容,但我还没有足够的声誉点)On 32-bit systems the address space for
mmap()
s is rather limited (and varies from OS to OS). Be aware of that if you're using multi-gigabyte files and your are only testing on a 64-bit system. (I would have preferred to write this in a comment but I don't have enough reputation points yet)可以帮助提高性能的一件事是使用“madvise(2)”。 可能最容易
通过 Inline::C 完成。 “madvise”让您告诉内核您的访问模式是什么样的(例如顺序、随机等)。
one thing that can help performance is the use of 'madvise(2)'. probably most easily
done via Inline::C. 'madvise' lets you tell the kernel what your access pattern will be like (e.g. sequential, random, etc).
如果我可以插入自己的模块:我建议使用 File::Map而不是 Sys::Mmap。 它比 Sys::Mmap 更容易使用,并且更不易崩溃。
If I may plug my own module: I'd advice using File::Map instead of Sys::Mmap. It's much easier to use, and is less crash-prone than Sys::Mmap.
这听起来确实令人惊讶。 为什么不尝试纯C 版本呢?
或者在不同的操作系统/perl 版本上尝试您的代码。
That does sound surprising. Why not try a pure C version?
Or try your code on a different OS/perl version.
请参阅 Wide Finder 了解 mmap 的 Perl 性能。 但有一个很大的陷阱。 如果您的数据集采用经典 HD,并且您将从多个进程中读取数据,那么您很容易陷入随机访问,并且您的 IO 将下降到不可接受的值(20~40 倍)。
See Wide Finder for perl performance with mmap. But there is one big pitfall. If your dataset will be on classical HD and you will read from multiple processes, you can easily fall in random access and your IO will fall down to unacceptable values (20~40 times).
好的,这是另一个更新。 使用 Sys::Mmap 或 PerlIO 的“:mmap”属性在 Perl 中都可以正常工作,但最多只能处理 2 GB 的文件(神奇的 32 位限制)。 一旦文件超过 2 GB,就会出现以下问题:
使用 Sys::Mmap 和 substr 来访问文件,似乎 substr 只接受 32 位 int 作为位置参数,即使在 perl 支持 64 位的系统上也是如此。 至少有一个关于它的错误:
#62646: Maximum字符串长度与 substr
使用
open(my $fh, "<:mmap", "bigfile.bin")
,一旦文件大于 2 GB,似乎 perl 会挂起/或坚持在第一次读取时读取整个文件(不确定是哪一个,我从来没有运行足够长的时间来查看它是否完成),导致性能非常慢。我还没有找到任何解决方法来解决这两个问题,并且我目前在处理这些文件时遇到了缓慢的文件(非 mmap'ed)操作。 除非我找到解决方法,否则我可能必须用 C 或其他更好地支持 mmap 大文件的高级语言来实现处理。
Ok, here's another update. Using Sys::Mmap or PerlIO's ":mmap" attribute both works fine in perl, but only up to 2 GB files (the magic 32 bit limit). Once the file is more than 2 GB, the following problems appear:
Using Sys::Mmap and substr for accessing the file, it seems that substr only accepts a 32 bit int for the position parameter, even on systems where perl supports 64 bit. There's at least one bug posted about it:
#62646: Maximum string length with substr
Using
open(my $fh, "<:mmap", "bigfile.bin")
, once the file is larger than 2 GB, it seems perl will either hang/or insist on reading the whole file on the first read (not sure which, I never ran it long enough to see if it completed), leading to dead slow performance.I haven't found any workaround to either of these, and I'm currently stuck with slow file (non mmap'ed) operations for working on these files. Unless I find a workaround I may have to implement the processing in C or another higher level language that supports mmap'ing huge files better.
您对该文件的访问最好是随机的,以证明完整的 mmap 是合理的。 如果您的使用分布不均匀,您可能最好进行查找,读取新分配的区域并处理该区域,释放,冲洗并重复。 并处理 4k 倍数的块,例如 64k 左右。
我曾经对很多字符串模式匹配算法进行了基准测试。 映射整个文件既缓慢又毫无意义。 读取静态 32kish 缓冲区更好,但仍然不是特别好。 读取新分配的块,对其进行处理,然后释放它,使内核能够在幕后创造奇迹。 速度上的差异巨大,但话又说回来,模式匹配在复杂性方面非常快,并且必须比通常需要的更多地强调处理效率。
Your access to that file had better be well random to justify a full mmap. If your usage isn't evenly distributed, you're probably better off with a seek, read to a freshly malloced area and process that, free, rinse and repeat. And work with chunks of multiples of 4k, say 64k or so.
I once benchmarked a lot string pattern matching algorithms. mmaping the entire file was slow and pointless. Reading to a static 32kish buffer was better, but still not particularly good. Reading to freshly malloced chunk, processing that and then letting it go allows kernel to work wonders under the hood. The difference in speed was enormous, but then again pattern matching is very fast complexitywise and more emphasis must be put on handling efficiency than perhaps is usually needed.