Python:itertools.islice 不在循环中工作
我有这样的代码:
#opened file f
goto_line = num_lines #Total number of lines
while not found:
line_str = next(itertools.islice(f, goto_line - 1, goto_line))
goto_line = goto_line/2
#checks for data, sets found to True if needed
line_str 在第一遍中是正确的,但之后的每遍都读取不同的行。
例如,goto_line 从 1000 开始。它可以很好地读取第 1000 行。然后下一个循环,goto_line 是 500,但它不读取第 500 行。它读取一些接近 1000 的行。
我试图读取大文件中的特定行,而不读取超出必要的行。有时它向后跳到一行,有时向前跳。
我确实尝试过 linecache,但我通常不会在同一个文件上多次运行此代码。
I have code like this:
#opened file f
goto_line = num_lines #Total number of lines
while not found:
line_str = next(itertools.islice(f, goto_line - 1, goto_line))
goto_line = goto_line/2
#checks for data, sets found to True if needed
line_str is correct the first pass, but every pass after that is reading a different line then it should.
So for example, goto_line starts off as 1000. It reads line 1000 just fine. Then the next loop, goto_line is 500 but it doesn't read line 500. It reads some line closer to 1000.
I'm trying to read specific lines in a large file without reading more than necessary. Sometimes it jumps backwards to a line and sometimes forward.
I did try linecache, but I typically don't run this code more than once on the same file.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Python 迭代器只能使用一次。通过示例最容易看出这一点。下面的代码
打印
切片总是从我们上次停止的地方开始。
让代码正常工作的最简单方法是使用
f.readlines()
获取文件中的行列表,然后使用普通的 Python 列表切片[i:j]< /代码>。如果你确实想使用
islice()
,你可以使用f.seek(0)
每次从头开始读取文件,但这会非常低效。Python iterators can be consumed only once. This is easiest seen by example. The following code
prints
The slicing always starts where we stopped last time.
The easiest way to make your code work is to use the
f.readlines()
to get a list of the lines in the file and then use normal Python list slicing[i:j]
. If you really want to useislice()
, you could start reading the file from the beginning each time by usingf.seek(0)
, but this will be very inefficient.您不能(这样 - 也许有某种方法取决于文件的打开方式)返回文件。标准文件迭代器(事实上,大多数迭代器 - Python 的迭代器协议仅支持前向迭代器)仅向前移动。因此,在读取
k
行后,再读取k/2
行实际上给出了第k+k/2
行。您可以尝试将整个文件读入内存,但是您有大量数据,因此内存消耗可能会成为一个问题。您可以使用
file.seek
滚动浏览文件。但这仍然有很多工作 - 也许您可以使用内存映射文件?但这只有在线条尺寸固定的情况下才有可能。如果有必要,您可以预先计算要检查的行号并保存所有这些行(不应该太多,大致为int(log_2(line_count)) + 1
如果我没有弄错)在一次迭代中,这样您就不必在阅读整个文件后向后滚动。You cannot (this way - perhaps there is some way depending on how the file is opened) go back in the file. The standard file iterator (in fact, most iterators - Python's iterator protocol only supports forward iterators) moves only forward. So after reading
k
lines, reading anotherk/2
lines actually gives thek+k/2
th line.You could try reading the whole file into memory, but you have a lot of data so memory consumption propably becomes an issue. You could use
file.seek
to scroll through the file. But that's still a lot of work - perhaps you could use a memory-mapped file? That's only possible if lines are fixed-size though. If it's necessary, you could pre-calculate the line numbers you'd like to check and save all those lines (shouldn't be too much, roughlyint(log_2(line_count)) + 1
if I'm not mistaken) in one iteration so you don't have to scroll back after reading the whole file.