.append() 耗时吗?

发布于 2024-12-01 05:12:07 字数 286 浏览 0 评论 0原文

这些天我一直在处理巨大的文本文件。有时我需要删除行。 我的做法如下:

f=open('txt','r').readlines()
list=[]
for line in f:
    if blablablabla:
       list.append(line)

我知道对于大文件,.readlines()是速率限制步骤,但是.append()步骤呢?读取行后追加是否会花费大量额外时间? 如果是这样,也许我应该找到方法直接删除我不想要的行,而不是附加我想要的行。

谢谢

I've been manipulating huge text files these days. Sometimes I need to delete lines.
My way of doing is like below:

f=open('txt','r').readlines()
list=[]
for line in f:
    if blablablabla:
       list.append(line)

I know for large files, .readlines()is rate-limiting step, but what about .append() step? Does append cost lots of extra time after readlines?
If so, maybe I should find way to directly delete lines I don't want, instead of appending lines I want.

thx

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

时间你老了 2024-12-08 05:12:07

如果您稍后要过滤它,为什么要使用 readlines() 读取整个文件呢?只需迭代它即可保存您想要保留的行。您可以使用列表理解将其减少到几行:

with open('txt', 'r') as f:
    myList = [ line for line in f if blablablabla ]

Why read the whole entire file in using readlines() if you're going to filter it later? Just iterate through it saving the lines you want to keep. You could reduce this down to a couple of lines using list comprehension instead:

with open('txt', 'r') as f:
    myList = [ line for line in f if blablablabla ]
痴情 2024-12-08 05:12:07

作为一般提示,请执行此操作,无需在迭代之前先读取完整文件...

with open('txt') as fd:
    for line in fd:
        if blablabla:
            my_list.append(line)

并且不要将列表称为“列表”...

As a general hint, do this instead, no need to read the complete file first before iterating through it...

with open('txt') as fd:
    for line in fd:
        if blablabla:
            my_list.append(line)

and don't call a list "list"...

跨年 2024-12-08 05:12:07

您应该使用列表理解,就像杰夫的回答一样。根据您需要处理数据的方式,您也许可以使用生成器表达式。

为了回答你关于append()的问题,

Python列表是预先分配的,并在末尾有一些额外的空间。这意味着追加速度非常快 - 直到您用完预分配的空间。每当扩展列表时,就会分配一个新的内存块,并将所有引用复制到其中。随着列表的增长,额外预分配空间的大小也会随之增长。这样做是为了使追加分摊为 O(1)。即追加的平均时间很快且恒定

You should use a list comprehension instead as in Jeff's answer. Depending on how you need to process the data, you may be able to use a generator expression instead.

To answer your question about append()

Python lists are preallocated with some extra space at the end. This means that append is very fast - until you run out of preallocated space. Whenever the list is extended, a new block of memory is allocated and all the references copied over to it. As the list grows, so does the size of the extra preallocated space. This is done so that append is amortized O(1). ie the average time for append is fast and constant

以往的大感动 2024-12-08 05:12:07

在这篇文章中,我试图解释列表的工作方式以及为什么追加不是非常昂贵。我还在底部发布了一个解决方案,您可以使用它来删除行。

Python 列表的结构就像一个节点网络:

>>> class ListItem:
        def __init__(self, value, next=None):
            self.value = value
            self.next = next
        def __repr__(self):
            return "Item: %s"%self.value


>>> ListItem("a", ListItem("b", ListItem("c")))
Item: a
>>> mylist = ListItem("a", ListItem("b", ListItem("c")))
>>> mylist.next.next
Item: c

因此,append 基本上就是这样:

ListItem(mynewvalue, oldlistitem)

Append 没有太多开销,但另一方面 insert() 需要您重建整个列表,因此会花费更多的时间。

>>> from timeit import timeit
>>> timeit('a=[]\nfor i in range(100): a.append(i)', number=1000)
0.03651859015577941
>>> timeit('a=[]\nfor i in range(100): a.insert(0, i)', number=1000)
0.047090002177625934
>>> timeit('a=[]\nfor i in range(100): a.append(i)', number=10000)
0.18015429656996673
>>> timeit('a=[]\nfor i in range(100): a.insert(0, i)', number=10000)
0.35550057300308424

正如您所看到的,插入速度要慢得多。如果我是您,我会立即将不需要的行写回来,从而消除它们。

with open("large.txt", "r") as fin:
    with open("large.txt", "w") as f:
        for line in fin:
            if myfancyconditionismet:
                # write the line to the file again
                f.write(line + "\n")
            # otherwise it is gone

有我的解释和解决方案。

-Sunjay03

In this post, I tried to explain the way lists work and why append is not very expensive. I also posted a solution on the bottom which you could use to delete lines.

The structure of Python's lists is like a node network:

>>> class ListItem:
        def __init__(self, value, next=None):
            self.value = value
            self.next = next
        def __repr__(self):
            return "Item: %s"%self.value


>>> ListItem("a", ListItem("b", ListItem("c")))
Item: a
>>> mylist = ListItem("a", ListItem("b", ListItem("c")))
>>> mylist.next.next
Item: c

Therefore, append is basically just this:

ListItem(mynewvalue, oldlistitem)

Append doesn't have much overhead, but insert() on the other hand requires you to reconstruct the whole list, and will therefore take much more time.

>>> from timeit import timeit
>>> timeit('a=[]\nfor i in range(100): a.append(i)', number=1000)
0.03651859015577941
>>> timeit('a=[]\nfor i in range(100): a.insert(0, i)', number=1000)
0.047090002177625934
>>> timeit('a=[]\nfor i in range(100): a.append(i)', number=10000)
0.18015429656996673
>>> timeit('a=[]\nfor i in range(100): a.insert(0, i)', number=10000)
0.35550057300308424

As you can see, insert is much slower. If I were you, I would just eliminate the lines you don't need, by writing them back right away.

with open("large.txt", "r") as fin:
    with open("large.txt", "w") as f:
        for line in fin:
            if myfancyconditionismet:
                # write the line to the file again
                f.write(line + "\n")
            # otherwise it is gone

There is my explanation and solution.

-Sunjay03

无人接听 2024-12-08 05:12:07

也许您想将其全部拉入内存,然后对其进行操作。也许一次只操作一条线更有意义。从你的解释中并不清楚哪个更好。

无论如何,无论您采用哪种方法,这里都有相当标准的代码:

# Pull one line into memory at a time
with open('txt','r') as f:
    lineiter = (line for line in f if blablablabla)
    for line in lineiter:
        # Do stuff

# Read the whole file into memory then work on it
with open('txt','r') as f:
    lineiter = (line for line in f if blablablabla)
    mylines = [line for line in lineiter]

如果您选择前一条路线,我建议您阅读有关生成器的内容。 Dave Beazley 有一篇关于生成器的精彩文章,名为“系统程序员的生成器技巧”。强烈推荐。

Maybe you want to pull it all into memory and then operate on it. Maybe it makes more sense to operate on one line at a time. It's not clear from your explanation which is better.

In any event, here is pretty standard code for whichever approach you take:

# Pull one line into memory at a time
with open('txt','r') as f:
    lineiter = (line for line in f if blablablabla)
    for line in lineiter:
        # Do stuff

# Read the whole file into memory then work on it
with open('txt','r') as f:
    lineiter = (line for line in f if blablablabla)
    mylines = [line for line in lineiter]

If you go the former route, I recommend that you read up on generators. Dave Beazley has an awesome article on generators called "Generator Tricks for Systems Programmers". Highly recommended.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文