为什么 takewhile() 会跳过第一行?
我有一个这样的文件:
1
2
3
TAB
1
2
3
TAB
我想将 TAB 之间的行作为块读取。
import itertools
def block_generator(file):
with open(file) as lines:
for line in lines:
block = list(itertools.takewhile(lambda x: x.rstrip('\n') != '\t',
lines))
yield block
我想这样使用它:
blocks = block_generator(myfile)
for block in blocks:
do_something(block)
我得到的块都以第二行开始,例如 [2,3] [2,3]
,为什么?
I have a file like this:
1
2
3
TAB
1
2
3
TAB
I want to read the lines between TAB as blocks.
import itertools
def block_generator(file):
with open(file) as lines:
for line in lines:
block = list(itertools.takewhile(lambda x: x.rstrip('\n') != '\t',
lines))
yield block
I want to use it as such:
blocks = block_generator(myfile)
for block in blocks:
do_something(block)
The blocks i get all start with the second line like [2,3] [2,3]
, why?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这是使用 groupby 的另一种方法
Here is another approach using groupby
给你,经过测试的代码。使用
while True:
进行循环,并让itertools.takewhile()
使用lines
执行所有操作。当 itertools.takewhile() 到达输入末尾时,它返回一个迭代器,该迭代器除了引发StopIteration
之外不执行任何操作,而list()
只是简单地将到一个空列表中,因此一个简单的if not block:
测试会检测到空列表并跳出循环。正如下面的评论所述,这有一个缺陷:如果输入文本连续两行仅包含制表符,则此循环将停止处理而不读取所有输入文本。我想不出有什么办法可以干净利落地处理这个问题;确实不幸的是,从
itertools.takewhile()
返回的迭代器使用StopIteration
both 作为组结束的标记,并作为您在文件末尾得到的内容。更糟糕的是,我找不到任何方法来询问文件迭代器对象是否已到达文件末尾。更糟糕的是,itertools.takewhile() 似乎将文件迭代器立即推进到文件末尾;当我尝试重写上面的内容以使用lines.tell()检查我们的进度时,它已经在第一组之后的文件末尾了。我建议使用 itertools.groupby() 解决方案。更干净了。
Here you go, tested code. Uses
while True:
to loop, and letsitertools.takewhile()
do everything withlines
. Whenitertools.takewhile()
reaches the end of input, it returns an iterator that does nothing except raiseStopIteration
, whichlist()
simply turns into an empty list, so a simpleif not block:
test detects the empty list and breaks out of the loop.As noted in a comment below, this has one flaw: if the input text has two lines in a row with just the tab character, this loop will stop processing without reading all the input text. And I cannot think of any way to handle this cleanly; it's really unfortunate that the iterator you get back from
itertools.takewhile()
usesStopIteration
both as the marker for the end of a group and as what you get at end-of-file. To make it worse, I cannot find any way to ask a file iterator object whether it has reached end-of-file or not. And to make it even worse,itertools.takewhile()
seems to advance the file iterator to end-of-file instantly; when I tried to rewrite the above to check on our progress usinglines.tell()
it was already at end-of-file after the first group.I suggest using the
itertools.groupby()
solution. It's cleaner.我认为问题在于您在 lambda 函数中使用的是
lines
,而不是line
。您的预期产出是多少?I think the problem is that you are taking
lines
in your lambda function rather thanline
. What is your expected output?itertools.takewhile 隐式迭代文件的行以获取块,但 for line inlines: 也是如此。每次循环时,都会抓取一条
line
,然后将其丢弃(因为没有使用line
的代码),然后再添加一些block
>一起。itertools.takewhile
implicitly iterates over thelines
of the file in order to grab chunks, but so doesfor line in lines:
. Each time through the loop, aline
is grabbed, thrown away (since there is no code that usesline
), and then some more areblock
ed together.