Python 中的 ungetc
Python中的一些文件读取(readlines())函数
将文件内容复制到内存(作为列表)
我需要处理一个太大的文件
被复制到内存中,因此需要使用
文件指针(用于访问文件一个字节
一次)——如 C getc() 中那样。
我的额外要求是
我想将文件指针倒回到上一个
类似于 C ungetc() 中的字节。
有没有办法在 Python 中做到这一点?
另外,在Python中,我可以在
处读取一行 使用 readline() 的时间
有没有办法读取上一行
倒退?
Some file read (readlines()) functions in Python
copy the file contents to memory (as a list)
I need to process a file that's too large to
be copied in memory and as such need to use
a file pointer (to access the file one byte
at a time) -- as in C getc().
The additional requirement I have is that
I'd like to rewind the file pointer to previous
bytes like in C ungetc().
Is there a way to do this in Python?
Also, in Python, I can read one line at a
time with readline()
Is there a way to read the previous line
going backward?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您不需要文件指针,Python 没有或不需要文件指针。
要逐行浏览文件而不将整个文件读入内存,只需迭代文件对象本身,即
通常要避免使用
readlines
。返回一行并不是一件非常容易的事情。如果您不需要返回超过一行,请查看
itertools
文档。You do not need file pointers, which Python does not have or want.
To go through a file line by line without reading the whole thing into memory, just iterate over the file object itself, i.e.
Using
readlines
is generally to be avoided.Going back a line isn't something you can do super-easily. If you never need to go back more than one line, check out the
pairwise
recipe in theitertools
documentation.好的,这就是我的想法。感谢布伦达提出建立班级的想法。
感谢 Josh 提出使用类似 C 语言的函数eek() 和 read() 的想法
OK, here's what I came up with. Thanks Brenda for the idea of building a class.
Thanks Josh for the idea to use C like functions seek() and read()
如果您确实想直接使用文件指针(不过我认为 Mike Graham 的建议更好),您可以使用文件对象的 seek() 方法,可让您设置内部指针,与 read() 方法,该方法支持指定要读取的字节数的选项参数。
If you do want to use a file pointer directly (I think Mike Graham's suggestion is better though), you can use the file object's seek() method which lets you set the internal pointer, combined with the read() method, which support an option argument specifying how many bytes you'd like to read.
为您编写一个读取和缓冲输入的类,并在其上实现 ungetc —— 可能是这样的(警告:未经测试,在编译时编写):
Write a class the reads and buffers input for you, and implement ungetc on it -- something like this perhaps (warning: untested, written while compiling):
我不想进行数十亿次无缓冲的单个字符文件读取,而且我想要一种方法
调试文件指针的位置。因此,我决定返回文件位置
除了字符或行之外,还使用 mmap 将文件映射到内存。 (并让 mmap
处理分页)我认为如果文件真的很大,这会是一个问题。
(如大于物理内存量)此时 mmap 将开始进入
虚拟内存和事情可能会变得非常慢。目前,它处理一个 50 MB 的文件大约需要 4 分钟。
I don't want to do billions of unbuffered single char file reads plus I wanted a way
to debug the position of the file pointer. Hence, I resolved to return the file position
in addition to a char or line and to use mmap to map the file to memory. (and let mmap
handle paging) I think this will be a bit of a problem if the file is really, really big.
(as in larger than the amount of physical memory) That's when mmap would start going into
the virtual memory and things could get really slow. For now, it processes a 50 MB file in about 4 min.
这个问题最初是由我需要构建一个词法分析器引起的。
getc() 和 ungetc() 一开始很有用(可以消除读取错误,并且
构建状态机)状态机完成后,
getc() 和 ungetc() 成为一种负担,因为它们读取时间太长
直接来自存储。
当状态机完成时(调试任何 IO 问题,
最终确定了状态),我优化了词法分析器。
将源文件以块(或页)形式读取到内存中并运行
每个页面上的状态机都会产生最佳时间结果。
我发现如果不使用 getc() 和 ungetc() 可以节省大量时间
直接从文件中读取。
The question was initially prompted by my need to build a lexical analyzer.
getc() and ungetc() are useful at first (to get the read bugs out the way and
to build the state machine) After the state machine is done,
getc() and ungetc() become a liability as they take too long to read
directly from storage.
When the state machine was complete (debugged any IO problems,
finalized the states), I optimized the lexical analyzer.
Reading the source file in chunks (or pages) into memory and running
the state machine on each page yields the best time result.
I found that considerable time is saved if getc() and ungetc() are not used
to read from the file directly.