为什么 takewhile() 会跳过第一行?

发布于 2024-12-02 17:10:39 字数 549 浏览 7 评论 0原文

我有一个这样的文件:

1
2
3
TAB
1
2
3
TAB

我想将 TAB 之间的行作为块读取。

import itertools

def block_generator(file):
    with open(file) as lines:
        for line in lines:
            block = list(itertools.takewhile(lambda x: x.rstrip('\n') != '\t',
                                             lines))
            yield block

我想这样使用它:

blocks = block_generator(myfile)
for block in blocks:
    do_something(block)

我得到的块都以第二行开始,例如 [2,3] [2,3],为什么?

I have a file like this:

1
2
3
TAB
1
2
3
TAB

I want to read the lines between TAB as blocks.

import itertools

def block_generator(file):
    with open(file) as lines:
        for line in lines:
            block = list(itertools.takewhile(lambda x: x.rstrip('\n') != '\t',
                                             lines))
            yield block

I want to use it as such:

blocks = block_generator(myfile)
for block in blocks:
    do_something(block)

The blocks i get all start with the second line like [2,3] [2,3], why?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

但可醉心 2024-12-09 17:10:39

这是使用 groupby 的另一种方法

from itertools import groupby
def block_generator(filename):
    with open(filename) as lines:
        for pred,block in groupby(lines, "\t\n".__ne__):
            if pred:
                yield block

Here is another approach using groupby

from itertools import groupby
def block_generator(filename):
    with open(filename) as lines:
        for pred,block in groupby(lines, "\t\n".__ne__):
            if pred:
                yield block
甲如呢乙后呢 2024-12-09 17:10:39

给你,经过测试的代码。使用 while True: 进行循环,并让 itertools.takewhile() 使用 lines 执行所有操作。当 itertools.takewhile() 到达输入末尾时,它返回一个迭代器,该迭代器除了引发 StopIteration 之外不执行任何操作,而 list() 只是简单地将到一个空列表中,因此一个简单的 if not block: 测试会检测到空列表并跳出循环。

import itertools

def not_tabline(line):
    return '\t' != line.rstrip('\n')

def block_generator(file):
    with open(file) as lines:
        while True:
            block = list(itertools.takewhile(not_tabline, lines))
            if not block:
                break
            yield block

for block in block_generator("test.txt"):
    print "BLOCK:"
    print block

正如下面的评论所述,这有一个缺陷:如果输入文本连续两行仅包含制表符,则此循环将停止处理而不读取所有输入文本。我想不出有什么办法可以干净利落地处理这个问题;确实不幸的是,从 itertools.takewhile() 返回的迭代器使用 StopIteration both 作为组结束的标记,并作为您在文件末尾得到的内容。更糟糕的是,我找不到任何方法来询问文件迭代器对象是否已到达文件末尾。更糟糕的是,itertools.takewhile() 似乎将文件迭代器立即推进到文件末尾;当我尝试重写上面的内容以使用lines.tell()检查我们的进度时,它已经在第一组之后的文件末尾了。

我建议使用 itertools.groupby() 解决方案。更干净了。

Here you go, tested code. Uses while True: to loop, and lets itertools.takewhile() do everything with lines. When itertools.takewhile() reaches the end of input, it returns an iterator that does nothing except raise StopIteration, which list() simply turns into an empty list, so a simple if not block: test detects the empty list and breaks out of the loop.

import itertools

def not_tabline(line):
    return '\t' != line.rstrip('\n')

def block_generator(file):
    with open(file) as lines:
        while True:
            block = list(itertools.takewhile(not_tabline, lines))
            if not block:
                break
            yield block

for block in block_generator("test.txt"):
    print "BLOCK:"
    print block

As noted in a comment below, this has one flaw: if the input text has two lines in a row with just the tab character, this loop will stop processing without reading all the input text. And I cannot think of any way to handle this cleanly; it's really unfortunate that the iterator you get back from itertools.takewhile() uses StopIteration both as the marker for the end of a group and as what you get at end-of-file. To make it worse, I cannot find any way to ask a file iterator object whether it has reached end-of-file or not. And to make it even worse, itertools.takewhile() seems to advance the file iterator to end-of-file instantly; when I tried to rewrite the above to check on our progress using lines.tell() it was already at end-of-file after the first group.

I suggest using the itertools.groupby() solution. It's cleaner.

拿命拼未来 2024-12-09 17:10:39

我认为问题在于您在 lambda 函数中使用的是 lines,而不是 line。您的预期产出是多少?

I think the problem is that you are taking lines in your lambda function rather than line. What is your expected output?

懵少女 2024-12-09 17:10:39

itertools.takewhile 隐式迭代文件的行以获取块,但 for line inlines: 也是如此。每次循环时,都会抓取一条 line,然后将其丢弃(因为没有使用 line 的代码),然后再添加一些 block >一起。

itertools.takewhile implicitly iterates over the lines of the file in order to grab chunks, but so does for line in lines:. Each time through the loop, a line is grabbed, thrown away (since there is no code that uses line), and then some more are blocked together.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文