Python 随机访问文件
是否有一种 Python 文件类型可以访问随机行而不遍历整个文件?我需要在一个大文件中进行搜索,将整个文件读入内存是不可能的。
任何类型或方法将不胜感激。
Is there a Python file type for accessing random lines without traversing the whole file? I need to search within a large file, reading the whole thing into memory wouldn't be possible.
Any types or methods would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
这似乎正是
mmap
的设计目的。mmap
对象创建一个类似于字符串的文件接口:如果您想知道,
mmap
对象也可以分配给:This seems like just the sort of thing
mmap
was designed for. Ammap
object creates a string-like interface to a file:In case you were wondering,
mmap
objects can also be assigned to:您可以使用 linecache:
You can use linecache:
由于行可以是任意长度,因此如果不遍历整个文件,您实际上无法获得随机行(无论您的意思是“数字实际上是随机的行”还是“由我选择的具有任意数字的行”) 。
如果 kinda-sorta-random 就足够了,您可以在文件中查找随机位置,然后向前读取,直到遇到行终止符。但是,如果您想查找(例如)行号 1234,则这是无用的,并且如果您实际上想要随机选择的行,则会对行进行非均匀采样。
Since lines can be of arbitrary length, you really can't get at a random line (whether you mean "a line whose number is actually random" or "a line with an arbitrary number, selected by me") without traversing the whole file.
If kinda-sorta-random is enough, you can seek to a random place in the file and then read forward until you hit a line terminator. But that's useless if you want to find (say) line number 1234, and will sample lines non-uniformly if you actually want a randomly chosen line.
文件对象有一个查找方法,可以为该文件中的特定字节获取值。
要遍历大文件,请对其进行迭代并检查每行中的值。迭代文件对象不会将整个文件内容加载到内存中。
file objects have a seek method which can take a value to particular byte within that file.
For traversing through the large files, iterate over it and check for the value in each line. Iterating the file object does not load the whole file content into memory.
是的,您可以轻松获得随机线路。只需查找文件中的随机位置,然后向开头查找,直到遇到 \n 或文件的开头,然后读取一行。
代码:
Yes, you can easily get a random line. Just seek to a random position in the file, then seek towards the beginning until you hit a \n or the beginning of the file, then read a line.
Code:
文件对象支持查找,但请确保将它们作为二进制文件打开,即“rb”。
您可能还希望使用 mmap 模块进行随机访问,特别是如果数据已经采用内部格式。
The File object supports seek but make sure that you open them as binary, i.e. "rb".
You may also wish to use the mmap module for random access, particularly if the data is in an internal format already.
有固定长度的记录吗?如果是这样,是的,您可以使用查找来实现二分搜索算法。
否则,将文件加载到 SQLlite 数据库中。询问一下。
Has fixed-length records? If so, yes, you can implement a binary search algorithm using seeking.
Otherwise, load your file into an SQLlite database. Query that.