mmap 比 getline 慢?

发布于 2024-11-19 16:05:44 字数 1490 浏览 5 评论 0原文

我面临着逐行读取/写入文件(在演出中)的挑战。

阅读许多论坛条目和站点(包括一堆 SO),mmap 被建议作为读取/写入文件的最快选项。但是,当我使用 readline 和 mmap 技术实现代码时,mmap 是两者中较慢的一个。对于阅读和写作来说都是如此。我一直在使用约 600 MB 大的文件进行测试。

我的实现逐行解析,然后标记该行。我将仅介绍文件输入。

这是 getline 实现:

void two(char* path) {

    std::ios::sync_with_stdio(false);
    ifstream pFile(path);
    string mystring;

    if (pFile.is_open()) {
        while (getline(pFile,mystring)) {
            // c style tokenizing
        }
    }
    else perror("error opening file");
    pFile.close();
}

这是 mmap

void four(char* path) {

    int fd;
    char *map;
    char *FILEPATH = path;
    unsigned long FILESIZE;

    // find file size
    FILE* fp = fopen(FILEPATH, "r");
    fseek(fp, 0, SEEK_END);
    FILESIZE = ftell(fp);
    fseek(fp, 0, SEEK_SET);
    fclose(fp);

    fd = open(FILEPATH, O_RDONLY);

    map = (char *) mmap(0, FILESIZE, PROT_READ, MAP_SHARED, fd, 0);

    /* Read the file char-by-char from the mmap
     */
    char c;
    stringstream ss;

    for (long i = 0; i <= FILESIZE; ++i) {
        c = map[i];
        if (c != '\n') {
            ss << c;
        }
        else {
            // c style tokenizing
            ss.str("");
        }

    }

    if (munmap(map, FILESIZE) == -1) perror("Error un-mmapping the file");

    close(fd);

}

为了简洁起见,我省略了很多错误检查。

我的 mmap 实现是否不正确,从而影响性能?也许 mmap 不适合我的应用程序?

感谢您的任何意见或帮助!

I face the challenge of reading/writing files (in Gigs) line by line.

Reading many forum entries and sites (including a bunch of SO's), mmap was suggested as the fastest option to read/write files. However, when I implement my code with both readline and mmap techniques, mmap is the slower of the two. This is true for both reading and writing. I have been testing with files ~600 MB large.

My implementations parse line by line and then tokenize the line. I will present file input only.

Here is the getline implementation:

void two(char* path) {

    std::ios::sync_with_stdio(false);
    ifstream pFile(path);
    string mystring;

    if (pFile.is_open()) {
        while (getline(pFile,mystring)) {
            // c style tokenizing
        }
    }
    else perror("error opening file");
    pFile.close();
}

and here is the mmap:

void four(char* path) {

    int fd;
    char *map;
    char *FILEPATH = path;
    unsigned long FILESIZE;

    // find file size
    FILE* fp = fopen(FILEPATH, "r");
    fseek(fp, 0, SEEK_END);
    FILESIZE = ftell(fp);
    fseek(fp, 0, SEEK_SET);
    fclose(fp);

    fd = open(FILEPATH, O_RDONLY);

    map = (char *) mmap(0, FILESIZE, PROT_READ, MAP_SHARED, fd, 0);

    /* Read the file char-by-char from the mmap
     */
    char c;
    stringstream ss;

    for (long i = 0; i <= FILESIZE; ++i) {
        c = map[i];
        if (c != '\n') {
            ss << c;
        }
        else {
            // c style tokenizing
            ss.str("");
        }

    }

    if (munmap(map, FILESIZE) == -1) perror("Error un-mmapping the file");

    close(fd);

}

I omitted much error checking in the interest of brevity.

Is my mmap implementation incorrect, and thus affecting performance? Perhaps mmap is non ideal for my application?

Thanks for any comments or help!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

浅笑依然 2024-11-26 16:05:44

mmap 的真正强大之处在于能够在文件中自由查找,直接使用其内容作为指针,并避免将数据从内核高速缓存复制到用户空间的开销。但是,您的代码示例没有利用这一点。

在循环中,您一次扫描一个字符到缓冲区,并附加到一个stringstreamstringstream 不知道字符串有多长,因此必须在此过程中重新分配多次。此时,您已经消除了使用 mmap 带来的任何性能提升 - 即使标准 getline 实现也避免了多次重新分配(在 GNU C++ 实现中,通过使用 128 字节的堆栈缓冲区)。

如果您想充分利用 mmap:

  • 不要复制字符串。完全没有。相反,将指针直接复制到 mmap 缓冲区中。
  • 使用内置函数(例如 strnchrmemchr)来查找换行符;它们利用手工汇编程序和其他优化来比大多数开放编码的搜索循环运行得更快。

The real power of mmap is being able to freely seek in a file, use its contents directly as a pointer, and avoid the overhead of copying data from kernel cache memory to userspace. However, your code sample is not taking advantage of this.

In your loop, you scan the buffer one character at a time, appending to a stringstream. The stringstream doesn't know how long the string is, and so has to reallocate several times in the process. At this point you've killed off any performance increase from using mmap - even the standard getline implementation avoids multiple reallocations (by using a 128-byte on-stack buffer, in the GNU C++ implementation).

If you want to use mmap to its fullest power:

  • Don't copy your strings. At all. Instead, copy around pointers right into the mmap buffer.
  • Use built-in functions such as strnchr or memchr to find newlines; these make use of hand-rolled assembler and other optimizations to run faster than most open-coded search loops.
遗失的美好 2024-11-26 16:05:44

告诉你使用 mmap 的人对现代机器不太了解。

mmap 的性能优势完全是一个神话。用Linus Torvalds 的话来说:

是的,内存很“慢”,但是该死,mmap()也是如此。

mmap 的问题在于,每次您第一次触摸映射区域中的页面时,它都会陷入内核并实际将该页面映射到您的地址空间,从而对 TLB 造成严重破坏。

尝试使用read一次读取一个大文件8K的简单基准测试,然后再次使用mmap。 (一遍又一遍地使用相同的 8K 缓冲区。)您几乎肯定会发现读取实际上更快

您的问题从来不是从内核中获取数据;而是从内核中获取数据。问题在于你之后如何处理数据。尽量减少你一次做的工作;只需扫描以找到换行符,然后对块执行单个操作。就我个人而言,我会回到读取实现,使用(并重新使用)适合 L1 缓存(8K 左右)的缓冲区。

或者至少,我会尝试一个简单的 readmmap 基准测试,看看哪个在您的平台上实际上更快。

[更新]

我发现了Torvalds先生的多组评论:

http ://lkml.iu.edu/hypermail/linux/kernel/0004.0/0728.html
http://lkml.iu.edu/hypermail/linux/kernel/0004.0 /0775.html

总结:

除此之外,您还有实际的 CPU TLB 未命中成本等。
如果您只是重新阅读同一区域,通常可以避免这种情况
而不是在内存管理方面过于聪明
避免复制。

memcpy()(即本例中的“read()”)总是会更快
很多情况下,只是因为它避免了所有额外的复杂性。尽管
mmap() 在其他情况下会更快。

根据我的经验,顺序读取和处理大文件是“许多情况”之一,其中使用(和重复使用)带有 read/write 的中等大小的缓冲区性能明显优于 mmap

Whoever told you to use mmap does not know very much about modern machines.

The performance advantages of mmap are a total myth. In the words of Linus Torvalds:

Yes, memory is "slow", but dammit, so is mmap().

The problem with mmap is that every time you touch a page in the mapped region for the first time, it traps into the kernel and actually maps the page into your address space, playing havoc with the TLB.

Try a simple benchmark reading a big file 8K at a time usingread and then again with mmap. (Using the same 8K buffer over and over.) You will almost certainly find that read is actually faster.

Your problem was never with getting data out of the kernel; it was with how you handle the data after that. Minimize the work you are doing character-at-a-time; just scan to find the newline and then do a single operation on the block. Personally, I would go back to the read implementation, using (and re-using) a buffer that fits in the L1 cache (8K or so).

Or at least, I would try a simple read vs. mmap benchmark to see which is actually faster on your platform.

[Update]

I found a couple more sets of commentary from Mr. Torvalds:

http://lkml.iu.edu/hypermail/linux/kernel/0004.0/0728.html
http://lkml.iu.edu/hypermail/linux/kernel/0004.0/0775.html

The summary:

And on top of that you still have the actual CPU TLB miss costs etc.
Which can often be avoided if you just re-read into the same area
instead of being excessively clever with memory management just to
avoid a copy.

memcpy() (ie "read()" in this case) is always going to be faster in
many cases, just because it avoids all the extra complexity. While
mmap() is going to be faster in other cases.

In my experience, reading and processing a large file sequentially is one of the "many cases" where using (and re-using) a modest-sized buffer with read/write performs significantly better than mmap.

偷得浮生 2024-11-26 16:05:44

您可以使用 memchr 来查找行结尾。它比一次向 stringstream 添加一个字符要快得多。

You can use memchr to find line endings. It will be much faster than adding to a stringstream one character at a time.

你如我软肋 2024-11-26 16:05:44

您正在使用stringstream来存储您识别的行。这与 getline 实现无法相比,stringstream 本身增加了开销。正如其他建议的那样,您可以将字符串的开头存储为 char* ,也可以存储行的长度(或指向行末尾的指针)。读取的正文将类似于:

char* str_start = map;
char* str_end;
for (long i = 0; i <= FILESIZE; ++i) {
        if (map[i] == '\n') {
            str_end = map + i;
            {
                // C style tokenizing of the string str_start to str_end
                // If you want, you can build a std::string like:
                // std::string line(str_start,str_end);
                // but note that this implies a memory copy.
            }
            str_start = map + i + 1;
        }
    }

另请注意,这更加高效,因为您无需处理每个字符中的任何内容(在您的版本中,您将字符添加到 stringstream 中)。

You're using stringstreams to store the lines you identify. This is not comparable with the getline implementation, the stringstream itself adds overhead. As other suggested, you can store the beginning of the string as a char*, and maybe the length of the line (or a pointer to the end of the line). The body of the read would be something like:

char* str_start = map;
char* str_end;
for (long i = 0; i <= FILESIZE; ++i) {
        if (map[i] == '\n') {
            str_end = map + i;
            {
                // C style tokenizing of the string str_start to str_end
                // If you want, you can build a std::string like:
                // std::string line(str_start,str_end);
                // but note that this implies a memory copy.
            }
            str_start = map + i + 1;
        }
    }

Note also that this is much more efficient because you don't process anything in each char (in your version you were adding the character to the stringstream).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文