Python 随机访问文件

发布于 2024-10-17 13:21:04 字数 92 浏览 3 评论 0原文

是否有一种 Python 文件类型可以访问随机行而不遍历整个文件?我需要在一个大文件中进行搜索,将整个文件读入内存是不可能的。

任何类型或方法将不胜感激。

Is there a Python file type for accessing random lines without traversing the whole file? I need to search within a large file, reading the whole thing into memory wouldn't be possible.

Any types or methods would be appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

傲世九天 2024-10-24 13:21:04

这似乎正是 mmap 的设计目的。 mmap 对象创建一个类似于字符串的文件接口:

>>> f = open("bonnie.txt", "wb")
>>> f.write("My Bonnie lies over the ocean.")
>>> f.close()
>>> f.open("bonnie.txt", "r+b")
>>> mm = mmap(f.fileno(), 0)
>>> print mm[3:9]
Bonnie

如果您想知道,mmap 对象也可以分配给:

>>> print mm[24:]
ocean.
>>> mm[24:] = "sea.  "
>>> print mm[:]
My Bonnie lies over the sea.  

This seems like just the sort of thing mmap was designed for. A mmap object creates a string-like interface to a file:

>>> f = open("bonnie.txt", "wb")
>>> f.write("My Bonnie lies over the ocean.")
>>> f.close()
>>> f.open("bonnie.txt", "r+b")
>>> mm = mmap(f.fileno(), 0)
>>> print mm[3:9]
Bonnie

In case you were wondering, mmap objects can also be assigned to:

>>> print mm[24:]
ocean.
>>> mm[24:] = "sea.  "
>>> print mm[:]
My Bonnie lies over the sea.  
酒儿 2024-10-24 13:21:04

您可以使用 linecache

import linecache
print linecache.getline(your_file.txt, randomLineNumber) # Note: first line is 1, not 0

You can use linecache:

import linecache
print linecache.getline(your_file.txt, randomLineNumber) # Note: first line is 1, not 0
溺渁∝ 2024-10-24 13:21:04

由于行可以是任意长度,因此如果不遍历整个文件,您实际上无法获得随机行(无论您的意思是“数字实际上是随机的行”还是“由我选择的具有任意数字的行”) 。

如果 kinda-sorta-random 就足够了,您可以在文件中查找随机位置,然后向前读取,直到遇到行终止符。但是,如果您想查找(例如)行号 1234,则这是无用的,并且如果您实际上想要随机选择的行,则会对行进行非均匀采样。

Since lines can be of arbitrary length, you really can't get at a random line (whether you mean "a line whose number is actually random" or "a line with an arbitrary number, selected by me") without traversing the whole file.

If kinda-sorta-random is enough, you can seek to a random place in the file and then read forward until you hit a line terminator. But that's useless if you want to find (say) line number 1234, and will sample lines non-uniformly if you actually want a randomly chosen line.

谜兔 2024-10-24 13:21:04

文件对象有一个查找方法,可以为该文件中的特定字节获取值。
要遍历大文件,请对其进行迭代并检查每行中的值。迭代文件对象不会将整个文件内容加载到内存中。

file objects have a seek method which can take a value to particular byte within that file.
For traversing through the large files, iterate over it and check for the value in each line. Iterating the file object does not load the whole file content into memory.

新一帅帅 2024-10-24 13:21:04

是的,您可以轻松获得随机线路。只需查找文件中的随机位置,然后向开头查找,直到遇到 \n 或文件的开头,然后读取一行。

代码:

import sys,random
with open(sys.argv[1],"r") as f:
    f.seek(0,2)                 # seek to end of file
    bytes = f.tell()
    f.seek(int(bytes*random.random()))

    # Now seek forward until beginning of file or we get a \n
    while True:
        f.seek(-2,1)
        ch = f.read(1)
        if ch=='\n': break
        if f.tell()==1: break

    # Now get a line
    print f.readline()

Yes, you can easily get a random line. Just seek to a random position in the file, then seek towards the beginning until you hit a \n or the beginning of the file, then read a line.

Code:

import sys,random
with open(sys.argv[1],"r") as f:
    f.seek(0,2)                 # seek to end of file
    bytes = f.tell()
    f.seek(int(bytes*random.random()))

    # Now seek forward until beginning of file or we get a \n
    while True:
        f.seek(-2,1)
        ch = f.read(1)
        if ch=='\n': break
        if f.tell()==1: break

    # Now get a line
    print f.readline()
胡渣熟男 2024-10-24 13:21:04

文件对象支持查找,但请确保将它们作为二进制文件打开,即“rb”。

您可能还希望使用 mmap 模块进行随机访问,特别是如果数据已经采用内部格式。

The File object supports seek but make sure that you open them as binary, i.e. "rb".

You may also wish to use the mmap module for random access, particularly if the data is in an internal format already.

暮年慕年 2024-10-24 13:21:04

有固定长度的记录吗?如果是这样,是的,您可以使用查找来实现二分搜索算法。

否则,将文件加载到 SQLlite 数据库中。询问一下。

Has fixed-length records? If so, yes, you can implement a binary search algorithm using seeking.

Otherwise, load your file into an SQLlite database. Query that.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文