如何在 C++ 中从大型文本文件中读取部分数据

发布于 2024-08-02 01:40:10 字数 124 浏览 2 评论 0原文

我有一个超过 200.000 行的大文本文件,我只需要读取几行。 例如:第 10.000 行到 20.000 行。

重要提示:由于性能问题,我不想打开并搜索完整文件来提取这些行。

这可能吗?

I have a big text file with more then 200.000 lines, and I need to read just a few lines. For instance: line 10.000 to 20.000.

Important: I don´t want to open and search the full file to extract theses lines because of performance issues.

Is this possible?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

御弟哥哥 2024-08-09 01:40:11

如果行的长度是固定的,则可以查找特定的字节位置并仅加载您想要的行。 如果行的长度可变,则查找要查找的行的唯一方法是解析文件并计算行尾标记的数量。 如果文件不经常更改,您可能能够通过执行此解析一次,然后保留每行字节位置的索引以加快将来的访问速度(也许将该索引写入磁盘,这样就不需要再更改)来获得足够的性能。每次运行程序时都会完成)。

If the lines are fixed length, then it would be possible to seek to a specific byte position and load just the lines you want. If lines are variable length, the only way to find the lines you're looking for is to parse the file and count the number of end-of-line markers. If the file changes infrequently, you might be able to get sufficient performance by performing this parsing once and then keeping an index of the byte positions of each line to speed future accesses (perhaps writing that index to disk so it doesn't need to be done every time your program is run).

腻橙味 2024-08-09 01:40:11

您将必须搜索整个文件以计算换行符,除非您知道所有行的长度相同(在这种情况下,您可以查找 offset = line_number * line_size_in_bytes,其中 line_number 从零开始计数,而 line_size_in_bytes 包括线)。

如果行的长度是可变/未知的,那么在读取它一次时,您可以索引每行的起始偏移量,以便后续读取可以查找给定行的开头。

You will have to search through the file to count the newlines, unless you know that all lines are the same length (in which case you could seek to the offset = line_number * line_size_in_bytes, where line_number counts from zero and line_size_in_bytes includes all characters in the line).

If the lines are variable / unknown length then while reading through it once you could index the beginning offset of each line so that subsequent reads could seek to the start of a given line.

兮子 2024-08-09 01:40:11

如果这些行的长度都相同,您可以计算给定行的偏移量并仅读取这些字节。

如果行的长度不同,那么您确实必须读取整个文件来计算有多少行。 行终止字符只是文件中的任意字节。

If these lines are all the same length you could compute an offset for a given line and read just those bytes.

If the lines are varying length then you really have to read the entire file to count how many lines there are. Line terminating characters are just arbitrary bytes in the file.

纵性 2024-08-09 01:40:11

如果线是固定长度的,那么你只需计算偏移量,没问题。

如果它们不是(即常规 CSV 文件),那么您需要浏览该文件,要么构建索引,要么只读取您需要的行。 为了使文件读取速度更快一点,一个好主意是使用内存映射文件(请参阅 Boost iostream 的实现:http://www.boost.org/doc/libs/1_39_0/libs/iostreams/doc/classes/mapped_file.html )。

If the line are fixed length then you just compute the offset, no problem.

If they're not (i.e. a regular CSV file) then you'll need to go through the file, either to build an index or to just read the lines you need. To make the file reading a little faster a good idea would be to use memory mapped files (see the implementation that's part of the Boost iostreams: http://www.boost.org/doc/libs/1_39_0/libs/iostreams/doc/classes/mapped_file.html).

鲜血染红嫁衣 2024-08-09 01:40:11

正如其他人指出的,如果您没有固定宽度的行,则不构建索引是不可能的。 但是,如果您可以控制文件的格式,并且您设法将行本身的编号存储在每一行,即让文件内容看起来像这样:

1: val1, val2, val3
2: val4
3: val5, val6
4: val7, val8, val9, val10

使用这种格式的文件,您可以通过二分搜索快速找到所需的行:从查找文件的中间开始。 阅读直到下一个换行符。 然后读取该行并解析数字。 如果行数大于目标,则需要在文件的前半部分重复该算法,如果小于目标行数,则需要在文件的后半部分重复该算法。

您需要小心极端情况(例如:范围的“开始”和范围的“结束”位于同一行等),但对我来说,这种方法在过去的解析中效果非常好包含日期的日志文件(我需要找到特定时间戳之间的行)。

当然,这仍然无法击败显式构建索引或固定大小记录的性能。

As others noted, if you do not have the lines of fixed width, it is impossible to do without building the index. However, if you are in control of the format of the file, you can get a ~O(log(size)) instead of O(size) performance in finding the start line, if you manage to store number of the line itself on each line, i.e. to have the file contents look something like this:

1: val1, val2, val3
2: val4
3: val5, val6
4: val7, val8, val9, val10

With this format of the file, you can quickly find the needed line by binary search: start with seeking into the middle of the file. Read till the next newline. Then read the line, and parse the number. If the number is bigger than the target, then you need to repeat the algorithm on the first half of the file, if it is smaller than the target line number, then you need to repeat it on the second half of the file.

You'd need to be careful about the corner cases (e.g.: your "beginning" of the range and "end" of the range are on the same line, etc.), but for me this approach worked excellently in the past for parsing the logfiles which had the date in it (and I needed to find the lines that are between the certain timestamps).

Of course, this still does not beat the performance of the explicitly built index or the fixed-size records.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文