在打开的文件的排序行中进行二分搜索(未加载到内存中)

发布于 2025-01-09 02:35:36 字数 543 浏览 0 评论 0原文

我有一个数千兆字节的文本文件,并且对数百万行进行了排序:

aaaaa
bcijkjf
dfsdf
gdfgdfhqiuzhf
zzdiszfhj

如何在不将整个文件加载到内存中的情况下,通过二分搜索来搜索一行是否存在? (行数可能是 O(log n))

文件的行中是否有类似 bisect.bisect_left 的函数 f = open('file.txt', 'r') 在 Python 库中?

窗口最初为[a, b] = [0, file_size]。然后它会在文件中的位置 m=(a+b)/2 处查找,查找下一个 \n,并读取以下行 l< /代码>。如果要搜索的模式小于或大于 l(按字典顺序),则我们继续搜索 [m, b][a, m]< /代码>。在我自己推出之前,Python 中是否存在这个?

I have a text file of multiple gigabytes, and the millions of lines are sorted:

aaaaa
bcijkjf
dfsdf
gdfgdfhqiuzhf
zzdiszfhj

How is it possible, without loading the whole file in memory, to search if a line is existent, with a bisection search? (Possibly in O(log n) in the number of lines)

Is there a function like bisect.bisect_left among the lines of a file f = open('file.txt', 'r') in the Python library?

The window would initially be [a, b] = [0, file_size]. Then it would seek in the file at position m=(a+b)/2, look for the next \n, and read the following line l. If the pattern to search is smaller or greater than l (with lexicographic order), then we continue on [m, b] or [a, m]. Before rolling my own, does this exist in Python?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

七堇年 2025-01-16 02:35:36

您可以使用 mmap 内置模块。它提供对文件的随机访问(即,文件的行为类似于存储在文件系统中的大字节数组)。您可以在此处找到更多信息。

import mmap

def bisect_search(file_path, line):
    line = line.encode()
    with open(file_path, 'r+b') as f:
        mm = mmap.mmap(f.fileno(), 0)
        lo = 0
        hi = mm.size()
        while lo < hi:
            mid = (lo + hi) // 2
            left_endl_idx = mm.rfind(b'\n', lo, mid)
            right_endl_idx = mm.find(b'\n', mid, hi)
            if left_endl_idx == -1:
                left_endl_idx = lo - 1
            if right_endl_idx == -1:
                right_endl_idx = hi
            mid_line = mm[left_endl_idx + 1: right_endl_idx]
            if mid_line == line:
                return True
            if mid_line < line:
                lo = right_endl_idx + 1
            else:
                hi = left_endl_idx
    return False

如果文件中存在 line,则该函数返回 True,否则返回 False。让我们使用以下 myfile.txt 文件来运行一些示例:

aaaaa
bcijkjf
dfsdf
gdfgdfhqiuzhf
zzdiszfhj
>>> bisect_search('myfile.txt', 'hello')
False
>>> bisect_search('myfile.txt', 'aaaaa')
True
>>> bisect_search('myfile.txt', 'aaaa')
False
>>> bisect_search('myfile.txt', 'dfsdf')
True
>>> bisect_search('myfile.txt', 'zzdiszfhjj')
False

此函数应该比大文件上的线性搜索快得多。

注意:此代码适用于 \n 结尾,目前不适用于 \r\n Windows 风格的结尾(OP 不需要)。

You can use the mmap built-in module. It provides random access to files (i.e., a file behaves like a large array of bytes stored in the file system). You can find more info here.

import mmap

def bisect_search(file_path, line):
    line = line.encode()
    with open(file_path, 'r+b') as f:
        mm = mmap.mmap(f.fileno(), 0)
        lo = 0
        hi = mm.size()
        while lo < hi:
            mid = (lo + hi) // 2
            left_endl_idx = mm.rfind(b'\n', lo, mid)
            right_endl_idx = mm.find(b'\n', mid, hi)
            if left_endl_idx == -1:
                left_endl_idx = lo - 1
            if right_endl_idx == -1:
                right_endl_idx = hi
            mid_line = mm[left_endl_idx + 1: right_endl_idx]
            if mid_line == line:
                return True
            if mid_line < line:
                lo = right_endl_idx + 1
            else:
                hi = left_endl_idx
    return False

The function returns True if line exists within the file, False otherwise. Let's use the following myfile.txt file to run a few examples:

aaaaa
bcijkjf
dfsdf
gdfgdfhqiuzhf
zzdiszfhj
>>> bisect_search('myfile.txt', 'hello')
False
>>> bisect_search('myfile.txt', 'aaaaa')
True
>>> bisect_search('myfile.txt', 'aaaa')
False
>>> bisect_search('myfile.txt', 'dfsdf')
True
>>> bisect_search('myfile.txt', 'zzdiszfhjj')
False

This function should be much faster than a linear search on a big file.

Note: this code works with \n endings, and not currently with \r\n Windows-style endings (not necessary for OP).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文