查找文本行数的最快方法 (C++)

发布于 2024-07-20 04:41:39 字数 375 浏览 12 评论 0 原文

在对文件执行某些操作之前,我需要读取文件中的行数。 当我尝试读取文件并在每次迭代时递增 line_count 变量,直到达到 EOF 时。 就我而言,速度并没有那么快。 我使用了 ifstream 和 fgets。 他们俩都很慢。 是否有一种 hacky 方法可以做到这一点,例如 BSD、Linux 内核或 berkeley db 也使用这种方法(可能通过使用按位运算)。

该文件的行数有数百万行,而且还在不断增大,每行大约有 40 或 50 个字符。 我正在使用Linux。

注意:
我确信有人可能会说使用 DB idiot。 但简单来说,就我而言,我无法使用数据库。

I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration until I reach EOF. It was not that fast in my case. I used both ifstream and fgets. They were both slow. Is there a hacky way to do this, which is also used by, for instance BSD, Linux kernel or berkeley db (may be by using bitwise operations).

The number of lines is in the millions in that file and it keeps getting larger, each line is about 40 or 50 characters. I'm using Linux.

Note:
I'm sure there will be people who might say use a DB idiot. But briefly in my case I can't use a db.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

小瓶盖 2024-07-27 04:41:39

查找行数的唯一方法是读取整个文件并计算行结束字符的数量。 最快的方法可能是通过一次读取操作将整个文件读入一个大缓冲区,然后遍历缓冲区计算“\n”字符。

由于您当前的文件大小约为 60Mb,因此这不是一个有吸引力的选择。 您可以通过不读取整个文件,而是分块读取(例如大小为 1Mb)来获得一定的速度。 您还说数据库是不可能的,但它确实看起来是最好的长期解决方案。

编辑:我刚刚对此进行了一个小型基准测试,使用缓冲方法(缓冲区大小 1024K)似乎比使用 getline( 一次读取一行的速度快两倍多) )。 这是代码 - 我的测试是使用 g++ 使用 -O2 优化级别完成的:

#include <iostream>
#include <fstream>
#include <vector>
#include <ctime>
using namespace std;

unsigned int FileRead( istream & is, vector <char> & buff ) {
    is.read( &buff[0], buff.size() );
    return is.gcount();
}

unsigned int CountLines( const vector <char> & buff, int sz ) {
    int newlines = 0;
    const char * p = &buff[0];
    for ( int i = 0; i < sz; i++ ) {
        if ( p[i] == '\n' ) {
            newlines++;
        }
    }
    return newlines;
}

int main( int argc, char * argv[] ) {
    time_t now = time(0);
    if ( argc == 1  ) {
        cout << "lines\n";
        ifstream ifs( "lines.dat" );
        int n = 0;
        string s;
        while( getline( ifs, s ) ) {
            n++;
        }
        cout << n << endl;
    }
    else {
        cout << "buffer\n";
        const int SZ = 1024 * 1024;
        std::vector <char> buff( SZ );
        ifstream ifs( "lines.dat" );
        int n = 0;
        while( int cc = FileRead( ifs, buff ) ) {
            n += CountLines( buff, cc );
        }
        cout << n << endl;
    }
    cout << time(0) - now << endl;
}

The only way to find the line count is to read the whole file and count the number of line-end characters. The fastest way to do this is probably to read the whole file into a large buffer with one read operation and then go through the buffer counting the '\n' characters.

As your current file size appears to be about 60Mb, this is not an attractive option. You can get some of the speed by not reading the whole file, but reading it in chunks, say of size 1Mb. You also say that a database is out of the question, but it really does look to be the best long-term solution.

Edit: I just ran a small benchmark on this and using the buffered approach (buffer size 1024K) seems to be a bit more than twice as fast as reading a line at a time with getline(). Here's the code - my tests were done with g++ using -O2 optimisation level:

#include <iostream>
#include <fstream>
#include <vector>
#include <ctime>
using namespace std;

unsigned int FileRead( istream & is, vector <char> & buff ) {
    is.read( &buff[0], buff.size() );
    return is.gcount();
}

unsigned int CountLines( const vector <char> & buff, int sz ) {
    int newlines = 0;
    const char * p = &buff[0];
    for ( int i = 0; i < sz; i++ ) {
        if ( p[i] == '\n' ) {
            newlines++;
        }
    }
    return newlines;
}

int main( int argc, char * argv[] ) {
    time_t now = time(0);
    if ( argc == 1  ) {
        cout << "lines\n";
        ifstream ifs( "lines.dat" );
        int n = 0;
        string s;
        while( getline( ifs, s ) ) {
            n++;
        }
        cout << n << endl;
    }
    else {
        cout << "buffer\n";
        const int SZ = 1024 * 1024;
        std::vector <char> buff( SZ );
        ifstream ifs( "lines.dat" );
        int n = 0;
        while( int cc = FileRead( ifs, buff ) ) {
            n += CountLines( buff, cc );
        }
        cout << n << endl;
    }
    cout << time(0) - now << endl;
}
自由范儿 2024-07-27 04:41:39

不要使用 C++ stl 字符串和 getline (或 C 的 fgets),仅使用 C 风格的原始指针,并且以页面大小的块读取块或映射文件。

然后使用 uint32_t 或 uint64_t)扫描块。 MAGIC/#SIMD%20Within%20A%20Register%20(SWAR)%20Operations" rel="noreferrer">魔术算法“寄存器内的 SIMD (SWAR) 操作”用于测试字内的字节。 一个例子是这里; 包含 0x0a0a0a0a0a0a0a0aLL 的循环会扫描换行符。 (该代码每个输入字节大约有 5 个周期,与文件的每一行上的正则表达式匹配)

如果文件只有几十或一百左右兆字节,并且它不断增长(即有东西不断写入它),那么Linux 很可能将其缓存在内存中,因此不会受到磁盘 IO 限制,但会受到内存带宽限制。

如果文件仅被附加,您还可以记住行数
和之前的长度,并从那里开始。


有人指出,您可以将 mmap 与 C++ stl 算法一起使用,并创建一个函子来传递给 std::foreach。 我建议您不要这样做,不是因为您不能这样做,而是编写额外的代码来这样做没有任何好处。 或者你可以使用 boost 的 mmapped 迭代器,它会为你处理这一切; 但对于这个问题,我链接到的代码是为此编写的,速度要慢得多,而且问题在于速度而不是风格。

Don't use C++ stl strings and getline ( or C's fgets), just C style raw pointers and either block read in page-size chunks or mmap the file.

Then scan the block at the native word size of your system ( ie either uint32_t or uint64_t) using one of the magic algorithms 'SIMD Within A Register (SWAR) Operations' for testing the bytes within the word. An example is here; the loop with the 0x0a0a0a0a0a0a0a0aLL in it scans for line breaks. ( that code gets to around 5 cycles per input byte matching a regex on each line of a file )

If the file is only a few tens or a hundred or so megabytes, and it keeps growing (ie something keeps writing to it), then there's a good likelihood that linux has it cached in memory, so it won't be disk IO limited, but memory bandwidth limited.

If the file is only ever being appended to, you could also remember the number of lines
and previous length, and start from there.


It has been pointed out that you could use mmap with C++ stl algorithms, and create a functor to pass to std::foreach. I suggested that you shouldn't do it not because you can't do it that way, but there is no gain in writing the extra code to do so. Or you can use boost's mmapped iterator, which handles it all for you; but for the problem the code I linked to was written for this was much, much slower, and the question was about speed not style.

败给现实 2024-07-27 04:41:39

您写道它不断变大

这听起来像是一个日志文件或类似的文件,其中附加了新行,但现有行未更改。 如果是这种情况,您可以尝试增量方法

  • 解析到文件末尾。
  • 记住行数和 EOF 的偏移量。
  • 当文件增长到 fseek 到偏移量时,解析为 EOF 并更新行数和偏移量。

You wrote that it keeps getting larger.

This sounds like it is a log file or something similar where new lines are appended but existing lines are not changed. If this is the case you could try an incremental approach:

  • Parse to the end of file.
  • Remember the line count and the offset of EOF.
  • When the file grows fseek to the offset, parse to EOF and update the line count and the offset.
亢潮 2024-07-27 04:41:39

计数行和计数行分隔符之间存在差异。 如果获取精确的行数很重要,需要注意一些常见问题:

  1. 文件编码是什么? 逐字节解决方案适用于 ASCII 和 UTF-8,但请注意您是否使用 UTF-16 或某些多字节编码,这些编码不能保证具有换行值的字节必然对换行进行编码。

  2. 许多文本文件的最后一行末尾没有行分隔符。 因此,如果您的文件显示“Hello, World!”,您最终可能会得到 0 而不是 1 的计数。您需要一个简单的状态机来保持,而不仅仅是计算行分隔符

  3. 一些非常晦涩的文件使用 Unicode U+2028 LINE SEPARATOR (甚至 U+2029 PARAGRAPH SEPARATOR)作为行分隔符,而不是更常见的回车符和/或换行。 您可能还需要注意 U+0085 NEXT LINE (NEL)

  4. 您必须考虑是否要将其他一些控制字符算作换行符。 例如,是否应考虑使用 U+000C FORM FEEDU+000B LINE TABULATION(又名垂直制表符)换行?

  5. 旧版 Mac OS(OS X 之前)的文本文件使用回车符 (U+000D) 而不是换行符 (U+000A) 来分隔行。 如果您将原始字节读入缓冲区(例如,您的流处于二进制模式)并扫描它们,您将得出这些文件的计数为 0。 不能同时计算回车符和换行符,因为 PC 文件通常以回车符和换行符来结束一行。 同样,您将需要一个简单的状态机。 (或者,您可以以文本模式而不是二进制模式读取文件。文本接口会将符合您平台上使用的约定的文件的行分隔符标准化为 '\n'。如果您'从其他平台读取文件时,您将返回到带有状态机的二进制模式。)

  6. 如果文件中有超长行,则 getline() 方法可能会抛出异常导致简单行计数器在少量文件上失败的异常。 (如果您在非 Mac 平台上读取旧的 Mac 文件,导致 getline() 将整个文件视为一大行,则尤其如此。)大小缓冲区并使用状态机,您可以使其防弹。

已接受答案中的代码存在大部分陷阱。 在做快之前先做对。

There's a difference between counting lines and counting line separators. Some common gotchas to watch out for if getting an exact line count is important:

  1. What's the file encoding? The byte-by-byte solutions will work for ASCII and UTF-8, but watch out if you have UTF-16 or some multibyte encoding that doesn't guarantee that a byte with the value of a line feed necessarily encodes a line feed.

  2. Many text files don't have a line separator at the end of the last line. So if your file says "Hello, World!", you could end up with a count of 0 instead of 1. Rather than just counting the line separators, you'll need a simple state machine to keep track.

  3. Some very obscure files use Unicode U+2028 LINE SEPARATOR (or even U+2029 PARAGRAPH SEPARATOR) as line separators instead of the more common carriage return and/or line feed. You might also want to watch out for U+0085 NEXT LINE (NEL).

  4. You'll have to consider whether you want to count some other control characters as line breakers. For example, should a U+000C FORM FEED or U+000B LINE TABULATION (a.k.a. vertical tab) be considered going to a new line?

  5. Text files from older versions of Mac OS (before OS X) use carriage returns (U+000D) rather than line feeds (U+000A) to separate lines. If you're reading the raw bytes into a buffer (e.g., with your stream in binary mode) and scanning them, you'll come up with a count of 0 on these files. You can't count both carriage returns and line feeds, because PC files generally end a line with both. Again, you'll need a simple state machine. (Alternatively, you can read the file in text mode rather than binary mode. The text interfaces will normalize line separators to '\n' for files that conform to the convention used on your platform. If you're reading files from other platforms, you'll be back to binary mode with a state machine.)

  6. If you ever have a super long line in the file, the getline() approach can throw an exception causing your simple line counter to fail on a small number of files. (This is particularly true if you're reading an old Mac file on a non-Mac platform, causing getline() to see the entire file as one gigantic line.) By reading chunks into a fixed-size buffer and using a state machine, you can make it bullet proof.

The code in the accepted answer suffers from most of these traps. Make it right before you make it fast.

简单爱 2024-07-27 04:41:39

请记住,所有 fstream 都是缓冲的。 因此,它们实际上确实按块读取,因此您不必重新创建此功能。 所以你需要做的就是扫描缓冲区。 但不要使用 getline(),因为这会迫使您调整字符串的大小。 所以我只使用 STL std::count 和流迭代器。

#include <iostream>
#include <fstream>
#include <iterator>
#include <algorithm>


struct TestEOL
{
    bool operator()(char c)
    {
        last    = c;
        return last == '\n';
    }
    char    last;
};

int main()
{
    std::fstream  file("Plop.txt");

    TestEOL       test;
    std::size_t   count   = std::count_if(std::istreambuf_iterator<char>(file),
                                          std::istreambuf_iterator<char>(),
                                          test);

    if (test.last != '\n')  // If the last character checked is not '\n'
    {                       // then the last line in the file has not been 
        ++count;            // counted. So increement the count so we count
    }                       // the last line even if it is not '\n' terminated.
}

Remember that all fstreams are buffered. So they in-effect do actually reads in chunks so you do not have to recreate this functionality. So all you need to do is scan the buffer. Don't use getline() though as this will force you to size a string. So I would just use the STL std::count and stream iterators.

#include <iostream>
#include <fstream>
#include <iterator>
#include <algorithm>


struct TestEOL
{
    bool operator()(char c)
    {
        last    = c;
        return last == '\n';
    }
    char    last;
};

int main()
{
    std::fstream  file("Plop.txt");

    TestEOL       test;
    std::size_t   count   = std::count_if(std::istreambuf_iterator<char>(file),
                                          std::istreambuf_iterator<char>(),
                                          test);

    if (test.last != '\n')  // If the last character checked is not '\n'
    {                       // then the last line in the file has not been 
        ++count;            // counted. So increement the count so we count
    }                       // the last line even if it is not '\n' terminated.
}
潇烟暮雨 2024-07-27 04:41:39

不是因为你的算法慢,而是因为IO操作慢。 我想您正在使用一个简单的 O(n) 算法,该算法只是按顺序遍历文件。 在这种情况下,没有更快的算法可以优化您的程序。

但是,我说过没有更快的算法,但是有一种更快的机制,称为“内存映射文件”,映射文件有一些缺点并且可能不适合您的情况,所以您你必须阅读它并自己弄清楚。

内存映射文件不会让您实现比 O(n) 更好的算法,但它可能会减少 IO 访问时间。

It isn't slow because of your algorithm , It is slow because IO operations are slow. I suppose you are using a simple O(n) algorithm that is simply going over the file sequentially. In that case , there is no faster algorithm that can optimize your program.

However , I said there is no faster algorithm , but there is a faster mechanism which called "Memory Mapped file " , There are some drawback for mapped files and it might not be appropiate for you case , So you'll have to read about it and figure out by yourself.

Memory mapped files won't let you implement an algorithm better then O(n) but it may will reduce IO access time.

难如初 2024-07-27 04:41:39

您只能通过扫描整个文件查找换行符来获得明确的答案。 没有办法解决这个问题。

但是,您可能需要考虑几种可能性。

1/ 如果您使用的是简单循环,一次读取一个字符并检查换行符,则不要这样做。 尽管 I/O 可能被缓冲,但函数调用本身在时间上是昂贵的。

更好的选择是通过一次 I/O 操作将大块文件(例如 5M)读入内存,然后进行处理。 您可能不需要太担心特殊的汇编指令,因为无论如何 C 运行时库都会被优化 - 一个简单的 strchr() 就可以做到这一点。

2/ 如果您说一般行长度约为 40-50 个字符,并且不需要精确行数,只需获取文件大小并除以 45(或您认为的任何平均值)认为可以使用)。

3/ 如果这类似于日志文件,并且您没有将其保存在一个文件中(可能需要对系统的其他部分进行返工),请考虑定期拆分该文件。

例如,当它达到 5M 时,将其(例如,x.log)移动到一个带日期的文件名(例如,x_20090101_1022.log)并计算出有多少行此时有(将其存储在x_20090101_1022.count中,然后启动一个新的x.log日志文件。日志文件的特征意味着创建的这个带日期的部分将永远不会改变,因此您永远不必重新计算行数要处理

日志“文件”,您只需通过某个进程管道而不是 cat x 来 cat x_*.log 。要获取“文件”的行数,请对当前 x.log 执行 wc -l(相对较快)并将其添加到中所有值的总和中。 x_*.count 文件。

You can only get a definitive answer by scanning the entire file looking for newline characters. There's no way around that.

However, there are a couple of possibilities which you may want to consider.

1/ If you're using a simplistic loop, reading one character at a time checking for newlines, don't. Even though the I/O may be buffered, function calls themselves are expensive, time-wise.

A better option is to read large chunks of the file (say 5M) into memory with a single I/O operation, then process that. You probably don't need to worry too much about special assembly instruction since the C runtime library will be optimized anyway - a simple strchr() should do it.

2/ If you're saying that the general line length is about 40-50 characters and you don't need an exact line count, just grab the file size and divide by 45 (or whatever average you deem to use).

3/ If this is something like a log file and you don't have to keep it in one file (may require rework on other parts of the system), consider splitting the file periodically.

For example, when it gets to 5M, move it (e.g., x.log) to a dated file name (e.g., x_20090101_1022.log) and work out how many lines there are at that point (storing it in x_20090101_1022.count, then start a new x.log log file. Characteristics of log files mean that this dated section that was created will never change so you will never have to recalculate the number of lines.

To process the log "file", you'd just cat x_*.log through some process pipe rather than cat x.log. To get the line count of the "file", do a wc -l on the current x.log (relatively fast) and add it to the sum of all the values in the x_*.count files.

回首观望 2024-07-27 04:41:39

需要时间的是将 40+ MB 加载到内存中。 最快的方法是将其映射到内存,或者一次性将其加载到一个大缓冲区中。 一旦将其放入内存中,无论以何种方式实现,遍历数据查找 \n 字符的循环几乎是瞬时的,无论它是如何实现的。

所以实际上,最重要的技巧是尽快将文件加载到内存中。 最快的方法是将其作为单个操作来完成。

否则,可能存在大量技巧来加速算法。 如果仅添加行,从未修改或删除行,并且重复读取文件,则可以缓存先前读取的行,并且下次必须读取文件时,仅读取新添加的行。

或者,也许您可​​以维护一个单独的索引文件,显示已知“\n”字符的位置,以便可以跳过文件的这些部分。

从硬盘读取大量数据的速度很慢。 没有办法解决这个问题。

The thing that takes time is loading 40+ MB into memory. The fastest way to do that is to either memorymap it, or load it in one go into a big buffer. Once you have it in memory, one way or another, a loop traversing the data looking for \n characters is almost instantaneous, no matter how it is implemented.

So really, the most important trick is to load the file into memory as fast as possible. And the fastest way to do that is to do it as a single operation.

Otherwise, plenty of tricks may exist to speed up the algorithm. If lines are only added, never modified or removed, and if you're reading the file repeatedly, you can cache the lines read previously, and the next time you have to read the file, only read the newly added lines.

Or perhaps you can maintain a separate index file showing the location of known '\n' characters, so those parts of the file can be skipped over.

Reading large amounts of data from the harddrive is slow. There's no way around that.

十年九夏 2024-07-27 04:41:39

如果您的文件只会增长,那么如果您无法控制作者,那么 Ludwig Weinzierl 是最好的解决方案。 否则,您可以使其更快:每次将一行写入文件时,将计数器加一。 如果多个写入者可能尝试同时写入文件,请确保使用锁。 锁定现有文件就足够了。 计数器可以是 4 或 8 字节的二进制形式,写入在 /run//counter 下编写的文件中(RAM 速度非常快)。

Ludwig算法

  • 将偏移量初始化为0
  • 从偏移量读取文件到EOF计数'\n'(正如其他人提到的,确保使用缓冲I/O并计算 >'\n' 在该缓冲区内)
  • 更新偏移量与 EOF 处的位置
  • 保存计数器并保存 文件或变量的偏移量(如果您只在软件中需要它)
  • 在更改时重复“读取文件...”

这实际上是各种软件处理日志文件的功能(即 fail2ban)头脑)。

第一次,它必须处理一个巨大的文件。 之后,它很小,因此速度很快。

主动算法

创建文件时,将计数器重置为 0。

然后每次收到要添加到文件中的新行时:

  • 锁定文件
  • 写入一行
  • 加载计数器 向
  • 计数器添加 1 保存
  • 计数器
  • 解锁文件

这与数据库系统的操作非常接近因此,在具有数百万行的表上执行SELECT COUNT(*) FROM table立即返回。 数据库也会针对每个索引执行此操作。 因此,如果您添加与特定索引匹配的 WHERE 子句,您还可以立即获得总数。 原理与上面相同。


个人说明:我看到大量的互联网软件是落后的。 看门狗对于软件环境中的各种事物都有意义。 但是,在大多数情况下,当发生重要的事情时,您应该在发生时发送消息。 不要使用检查日志的向后概念来检测刚刚发生的不良情况。

例如,您检测到用户尝试访问网站并连续输入错误密码 5 次。 您想向管理员发送即时消息,以确保没有第六次成功,并且黑客现在可以看到您所有用户的数据...如果您使用日志,“即时消息”将是晚了几秒甚至几分钟。

不要进行向后处理。

If your file only grows, then Ludwig Weinzierl is the best solution if you do not have control of the writers. Otherwise, you can make it even faster: increment the counter by one each time a line is written to the file. If multiple writers may try to write to the file simultaneously, then make sure to use a lock. Locking your existing file is enough. The counter can be 4 or 8 bytes written in binary in a file written under /run/<your-prog-name>/counter (which is RAM so dead fast).

Ludwig Algorithm

  • Initialize offset to 0
  • Read file from offset to EOF counting '\n' (as mentioned by others, make sure to use buffered I/O and count the '\n' inside that buffer)
  • Update offset with position at EOF
  • Save counter & offset to a file or in a variable if you only need it in your software
  • Repeat from "Read file ..." on a change

This is actually how various software processing log files function (i.e. fail2ban comes to mind).

The first time, it has to process a huge file. Afterward, it is very small and thus goes very fast.

Proactive Algorithm

When creating the files, reset counter to 0.

Then each time you receive a new line to add to the file:

  • Lock file
  • Write one line
  • Load counter
  • Add one to counter
  • Save counter
  • Unlock file

This is very close to what database systems do so a SELECT COUNT(*) FROM table on a table with millions of rows return instantly. Databases also do that per index. So if you add a WHERE clause which matches a specific index, you also get the total instantly. Same principle as above.


Personal note: I see a huge number of Internet software which are backward. A watchdog makes sense for various things in a software environment. However, in most cases, when something of importance happens, you should send a message at the time it happens. Not use a backward concept of checking logs to detect that something bad just happened.

For example, you detect that a user tried to access a website and entered the wrong password 5 times in a row. You want to send a instant message to the admin to make sure there wasn't a 6th time which was successful and the hacker can now see all your user's data... If you use logs, the "instant message" is going to be late by seconds if not minutes.

Don't do processing backward.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文