为什么 takewhile() 会跳过第一行？

发布于 2024-12-02 17:10:39 字数 549 浏览 7 评论 0原文

我有一个这样的文件：

1
2
3
TAB
1
2
3
TAB

我想将 TAB 之间的行作为块读取。

import itertools

def block_generator(file):
    with open(file) as lines:
        for line in lines:
            block = list(itertools.takewhile(lambda x: x.rstrip('\n') != '\t',
                                             lines))
            yield block

我想这样使用它：

blocks = block_generator(myfile)
for block in blocks:
    do_something(block)

我得到的块都以第二行开始，例如 [2,3] [2,3]，为什么？

原文

I have a file like this:

1
2
3
TAB
1
2
3
TAB

I want to read the lines between TAB as blocks.

import itertools

def block_generator(file):
    with open(file) as lines:
        for line in lines:
            block = list(itertools.takewhile(lambda x: x.rstrip('\n') != '\t',
                                             lines))
            yield block

I want to use it as such:

blocks = block_generator(myfile)
for block in blocks:
    do_something(block)

The blocks i get all start with the second line like [2,3] [2,3], why?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

但可醉心 2024-12-09 17:10:39

这是使用 groupby 的另一种方法

from itertools import groupby
def block_generator(filename):
    with open(filename) as lines:
        for pred,block in groupby(lines, "\t\n".__ne__):
            if pred:
                yield block

Here is another approach using groupby

from itertools import groupby
def block_generator(filename):
    with open(filename) as lines:
        for pred,block in groupby(lines, "\t\n".__ne__):
            if pred:
                yield block

回复收藏 0 原文

甲如呢乙后呢 2024-12-09 17:10:39

给你，经过测试的代码。使用 while True: 进行循环，并让 itertools.takewhile() 使用 lines 执行所有操作。当 itertools.takewhile() 到达输入末尾时，它返回一个迭代器，该迭代器除了引发 StopIteration 之外不执行任何操作，而 list() 只是简单地将到一个空列表中，因此一个简单的 if not block: 测试会检测到空列表并跳出循环。

import itertools

def not_tabline(line):
    return '\t' != line.rstrip('\n')

def block_generator(file):
    with open(file) as lines:
        while True:
            block = list(itertools.takewhile(not_tabline, lines))
            if not block:
                break
            yield block

for block in block_generator("test.txt"):
    print "BLOCK:"
    print block

正如下面的评论所述，这有一个缺陷：如果输入文本连续两行仅包含制表符，则此循环将停止处理而不读取所有输入文本。我想不出有什么办法可以干净利落地处理这个问题；确实不幸的是，从 itertools.takewhile() 返回的迭代器使用 StopIteration both 作为组结束的标记，并作为您在文件末尾得到的内容。更糟糕的是，我找不到任何方法来询问文件迭代器对象是否已到达文件末尾。更糟糕的是，itertools.takewhile() 似乎将文件迭代器立即推进到文件末尾；当我尝试重写上面的内容以使用lines.tell()检查我们的进度时，它已经在第一组之后的文件末尾了。

我建议使用 itertools.groupby() 解决方案。更干净了。

Here you go, tested code. Uses while True: to loop, and lets itertools.takewhile() do everything with lines. When itertools.takewhile() reaches the end of input, it returns an iterator that does nothing except raise StopIteration, which list() simply turns into an empty list, so a simple if not block: test detects the empty list and breaks out of the loop.

import itertools

def not_tabline(line):
    return '\t' != line.rstrip('\n')

def block_generator(file):
    with open(file) as lines:
        while True:
            block = list(itertools.takewhile(not_tabline, lines))
            if not block:
                break
            yield block

for block in block_generator("test.txt"):
    print "BLOCK:"
    print block

As noted in a comment below, this has one flaw: if the input text has two lines in a row with just the tab character, this loop will stop processing without reading all the input text. And I cannot think of any way to handle this cleanly; it's really unfortunate that the iterator you get back from itertools.takewhile() uses StopIteration both as the marker for the end of a group and as what you get at end-of-file. To make it worse, I cannot find any way to ask a file iterator object whether it has reached end-of-file or not. And to make it even worse, itertools.takewhile() seems to advance the file iterator to end-of-file instantly; when I tried to rewrite the above to check on our progress using lines.tell() it was already at end-of-file after the first group.

I suggest using the itertools.groupby() solution. It's cleaner.

回复收藏 0 原文