在 C 中对文件运行二分搜索的最快方法?
例如,假设我想在文件中查找特定的单词或数字。内容按顺序排列(显然)。由于我想对文件运行二分搜索,因此将整个文件复制到数组中然后运行二分搜索似乎确实浪费时间......我已经有效地使其成为线性时间算法,因为我'在运行搜索之前,我必须花费 O(n) 时间复制该文件。
有没有更快的方法来做到这一点?是否有类似 lseek 的东西可以使用行而不是字节?
如果没有,我是否最好只进行线性搜索(假设我在程序的整个持续时间内只运行搜索一次)?
For example, let's say I want to find a particular word or number in a file. The contents are in sorted order (obviously). Since I want to run a binary search on the file, it seems like a real waste of time to copy the entire file into an array and then run binary search...I've effectively made it a linear time algorithm, because I'll have to spend O(n) time copy the darn file before I can run my search.
Is there a faster way to do this? Is there maybe something like lseek which works with lines instead of bytes?
If there isn't, am I better off just doing a linear search instead (assuming I'm only running the search once for the entire duration of my program) ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
你不能通过线来寻找。只要你想一想,就很明显了。
但是您可以对文本文件进行某种二分搜索。
您要做的是:
(我认为这是最好的,但如果必须的话,您可以使用 lseek 并读取。)
You cannot seek by line. It's pretty obvious once you think about it.
But you can do a sort-of binary search on a text file.
What you do is:
(This is best, I think, but you can use lseek and read if you must.)
基于磁盘的二分搜索至少在最初需要“块感知”,即意识到无论您读取整个字节中的单个字节,I/O 成本都是相同。另一种认为需要注意的是与顺序读取操作相比,查找操作的成本相对较高。
它可以通过以下几种方式使用这种对磁盘 I/O 特征的认识:
A disk-based binary search needs to be, at least initially, "block-aware", i.e. aware of the fact that whether you read a single byte of a whole bunch, the I/O cost are the same. The other think it need to be aware is of the relative higher cost for a seek operation as compared to a sequential read operation.
Several of the ways that it can use this awareness about the characteristics of disk I/O:
如果文件很小,比如不到几百千字节,那么将整个文件读入(或虚拟内存映射)到内存中几乎肯定会更快。这是因为执行多个 i/o 操作来查找和传输的开销比仅读取整个文件要严重得多,而这是大多数程序所做的并且大多数操作系统都认为已完成的操作。
除非所有行的长度都相同,或者长度非常可预测,否则没有简单的方法可以找到第 #n 行。但是,要执行二分搜索,我会在二分搜索中使用字节偏移量,并在偏移量之前和之后读取 100 个字节(如果单词的长度都小于 100 个字符),总共 200 个字节。然后扫描中间前后的换行符以提取单词。
If the file is small, like under a few hundred kilobytes, it's almost certainly faster to read (or virtually memory map) the entire file into memory. This is because the overhead of doing several i/o operations to seek and transfer is much worse than just reading the whole file, which is what most programs do and most operating systems assume is done.
Unless all the lines are the same length, or have a very predictable length, there's no easy way to seek to line #n. But, to perform a binary search, I'd work with byte offsets in the binary search and read, say 100 bytes (if the words are all less than 100 characters long) before and after the offset—a total of 200 bytes. Then scan for the newline before and after the middle of it to extract the word.
是的,您可以 lseek 但如果每行每个单词/数字的大小是固定的,这会有所帮助,如果情况并非如此,则更有可能,那么您必须按文件大小进行 lseek 并查找最近的单词开头仍能实现接近二分搜索的典型 O(log n) 时间复杂度。
Yes you can lseek but it would help if the size of each word/number per line is fixed, if that is not the case, which is more likely, then you have to lseek by the size of file and seek to the nearest word beginning to still achieve close to the typical O(log n) time complexity of binary searches.
不会有“lseek”函数,因为文件命令没有“行”的概念。这个概念存在于与原始文件命令不同的抽象层中。
至于是否更快,答案取决于许多因素,包括文件大小、磁盘驱动器速度和可用 RAM 量。如果它不是一个大文件,我的猜测是将整个文件加载到内存中会更快。
如果它是一个大文件,我会使用二进制搜索算法将其缩小到较小的范围(例如几兆字节),然后加载整个块。
There wouldn't be a "lseek" function, because the file commands do not have the concept of a "line" This concept exists in a different layer of abstraction then the raw file commands.
As to whether it's faster or not, the answer will depend upon a number of factors, including the size of the file, the disk drive speed, and the amount of RAM available. If it isn't a large file, my guess is it would be faster to load the entire file into memory.
If it is a large file, I would use the binary search algorithm to narrow it down to a smaller range (say, a couple of megabytes), then load up that entire block.
如上所述,由于文件是文本文件,因此无法可靠地预测给定行在文件中开始的字节。替代二分搜索的想法是一个非常好的想法。但考虑到当今顺序 I/O 的速度有多快以及随机 I/O 的速度有多慢,除非文件很大,否则它实际上不会为您节省很多。
正如您提到的,如果您要读入它,您不妨边读边线性搜索它。因此,在阅读时使用修改后的 Boyer-Moore 搜索,您会做得很好。
As mentioned above, since the file is a text file, predicting the byte at which a given line begins within the file can't be done reliably. The ersatz binary search idea is a pretty good one. But it really won't save you a ton unless the file is huge, given how fast sequential I/O is nowadays and how slow random I/O is.
As you mention, if you are going to read it in, you might as well linearly search it as you go. So do so, use a modified Boyer-Moore search as you read it in and you'll do pretty well.
这里有如此多的性能权衡,除非您对典型数据进行测量,否则不可能知道什么是有意义的。
如果您要维护此代码,它需要简单。如果搜索很少或文件很小,请使用线性搜索。如果成本确实很重要,您就必须做一些实验。
线性搜索后我要尝试的第二件事是
mmap
文件并扫描它以查找换行符。这确实需要线性时间,但 strchr 可以非常快。如果您可以保证文件以换行符结尾,这会有所帮助。一旦划定了界限,您就可以通过二分搜索来减少比较次数。您应该考虑的另一个选项是 Boyer-Moore 字符串搜索。这是一种亚线性时间搜索,根据搜索模式的大小,它可能比对数二分搜索更快。 Boyer-Moore 尤其擅长处理长搜索字符串。
最后,如果您确定二分搜索确实很好,但识别行是性能瓶颈,则可以预先计算每行的起始位置,并将这些预先计算的位置以二进制格式存储在辅助文件中。
我觉得只做一个预测是很舒服的:几乎可以肯定的是,避免使用诸如 readline() 或 fgets() 之类的东西一次读取一行是值得的,因为这种策略总是涉及调用
malloc()
来保存该行的内容。在每一行调用malloc()
的成本可能会淹没搜索或比较的成本。There are so many performance tradeoffs here that it's impossible to know what makes sense until you have measurements on typical data.
If you're going to maintain this code, it needs to be simple. If searches are rare or the file is small, go with linear search. If the cost actually matters, you'll have to do some experiments.
The second thing I would try after linear search would be to
mmap
the file and scan through it for newlines. This does take linear time, butstrchr
can be very fast. It helps if you can guarantee the file ends in a newline. Once you have the lines demarcated, you can keep the number of comparisons small by doing a binary search.Another option you should consider is Boyer-Moore string search. This is a sub-linear time search and depending on the size of the search pattern, it may be faster than the logarithmic binary search. Boyer-Moore is especially good with long search strings.
Finally, if you determine binary search is really good, but that identifying the lines is a performance bottleneck, you could precompute the start location of each line and store these precomputed locations in binary format in an auxiliary file.
I feel comfortable making only one prediction: it is almost certainly worth avoiding reading in one line at a time with something like
readline()
orfgets()
, because this strategy invariably involves callingmalloc()
to hold the contents of the line. The cost of callingmalloc()
on every line is likely to swamp any cost of search or comparison.