在非常大的文本文件中执行二分搜索的 C# 代码

发布于 2024-08-28 16:45:08 字数 86 浏览 7 评论 0原文

是否有一个库可以用来在非常大的文本文件(可以是 10GB)中执行二分搜索。

该文件是一种日志文件 - 每行都以日期和时间开头。因此行是有序的。

Is there a library that I can use to perform binary search in a very big text file (can be 10GB).

The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

因为看清所以看轻 2024-09-04 16:45:08

由于行长度不能保证相同,因此您将需要某种形式的可识别行分隔符,例如回车符或换行符。

二分搜索模式几乎可以成为您的传统算法。查找文件的“中间”(按长度),向后查找(逐字节)到您碰巧进入的行的开头(由行分隔符序列标识),读取该记录并进行比较。根据比较情况,向上或向下查找(以字节为单位)并重复。

当您确定记录的起始索引时,请检查它是否与上次查找的索引相同。您可能会发现,当您拨入目标记录时,中途移动不会让您到达不同的记录。例如,您有分别为 100 字节和 50 字节的相邻记录,因此跳入 75 字节总是会带您回到第一个记录的开头。如果发生这种情况,请在进行比较之前阅读下一条记录。

您应该发现您很快就会达到目标。

As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.

The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.

When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.

You should find that you will reach your target pretty quickly.

人事已非 2024-09-04 16:45:08

我开始编写关于如何执行此操作的伪代码,但我放弃了,因为它可能看起来居高临下。您可能知道如何编写二分搜索,它实际上并不复杂。

您不会在库中找到它,原因有两个:

  1. 它不是真正的“二分搜索” - 行大小不同,因此您需要调整算法(例如,查找文件的中间部分,然后查找下一个) “换行符”并将其视为“中间”)。
  2. 您的日期时间日志格式很可能是非标准的(好吧,它可能看起来“标准”,但想一下......您可能使用“[]”或其他东西将日期与日志消息分开,例如[10 /02/2001 10:35:02] 我的消息)。

总而言之 - 我认为您的需求太具体且太简单,无法在自定义代码中实现,以至于有人费心编写库:)

I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.

You won't find it in a library, for two reasons:

  1. It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
  2. Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).

On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)

笑,眼淚并存 2024-09-04 16:45:08

在您为文件中的每个换行符在内存中保存 Int64 的约束下,这应该不会太糟糕。这实际上取决于文本行的平均长度,假设每行 1000 字节,您要查看的内容大约为 (10,000,000,000 / 1000 * 4) = 40mb。很大,但是有可能。

因此,请尝试以下操作:

  1. 扫描文件并将每个换行的序号偏移量存储在列表中,
  2. 使用扫描到文件偏移量并读取数据的自定义比较器对列表进行二进制搜索。

This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. Very big, but possible.

So try this:

  1. Scan the file and store the ordinal offset of each line-feed in a List
  2. Binary search the List with a custom comparer that scans to the file offset and reads the data.
会发光的星星闪亮亮i 2024-09-04 16:45:08

如果您的文件是静态的(或很少更改)并且您必须对其运行“足够的”查询,我相信最好的方法是创建“索引”文件:

  1. 扫描初始文件并获取文件的日期时间部分加上它们在原始文件中的位置(这就是为什么必须非常静态)如何对它们进行编码(例如:unix时间(完整的10位数字)+纳秒(零填充的4位数字)和行位置(零填充10位数字)。这样,您将拥有具有一致“行”的文件

  2. 对该文件进行二进制搜索(您可以)。需要有点创意才能实现范围搜索)并获取原始文件中的相关位置

  3. 从给定位置开始直接从原始文件读取/读取给定范围

您已经获得了 O(log(n)) 运行时的范围搜索:)(并且您已经创建了原始数据库功能)

不用说,如果文件数据文件更新“太”频繁,或者您没有对索引文件运行“足够”的查询,那么您最终会花费更多时间来创建索引文件比您从查询文件中保存的要多。

顺便说一句,使用此索引文件不需要对数据文件进行排序。由于日志文件往往只是追加和排序,因此您可以通过简单地创建索引文件来加快整个过程,该索引文件仅保存数据文件中 EOL 标记(零填充的 10 位数字)的位置 - 这样您就可以执行直接在数据文件上进行二分搜索(使用索引文件来确定原始文件中的查找位置),如果将行附加到日志文件中,您可以简单地将其 EOL 位置添加(附加)到索引文件中。

If your file is static (or changes rarely) and you have to run "enough" queries against it, I believe the best approach will be creating "index" file:

  1. Scan the initial file and take the datetime parts of the file plus their positions in the original (this is why has to be pretty static) encode them some how (for example: unix time (full 10 digits) + nanoseconds (zero-filled 4 digits) and line position (zero filed 10 digits). this way you will have file with consistent "lines"

  2. preform binary search on that file (you may need to be a bit creative in order to achieve range search) and get the relevant location(s) in the original file

  3. read directly from the original file starting from the given location / read the given range

You've got range search with O(log(n)) run-time :) (and you've created primitive DB functionality)

Needless to say that if the file data file is updated "too" frequently or you don't run "enough" queries against the index file you mat end up with spending more time on creating the index file than you are saving from the query file.

Btw, working with this index file doesn't require the data file to be sorted. As log files tend to be append only, and sorted, you may speed up the whole thing by simply creating index file that only holds the locations of the EOL marks (zero-filled 10 digits) in the data file - this way you can preform the binary search directly on the data-file (using the index file in order to determinate the seek positions in the original file) and if lines are appended to the log file you can simply add (append) their EOL positions to the index file.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文