Python:修改列表时的内存使用和优化
我关心的问题
如下:我将相对论大型数据集存储在经典的Python列表中,为了处理数据,我必须多次迭代列表,对元素执行一些操作,并且经常从列表中弹出一个项目名单。
似乎从 Python 列表中删除一项的成本为 O(N),因为 Python 必须将元素上方的所有项目复制到一处。此外,由于要删除的项目数大约与列表中的元素数成正比,这会导致 O(N^2) 算法。
我希望找到一种具有成本效益(时间和内存方面)的解决方案。我研究了在互联网上可以找到的内容,并总结了下面的不同选择。哪一位是最佳候选人?
保留本地索引:
while processingdata:
index = 0
while index < len(somelist):
item = somelist[index]
dosomestuff(item)
if somecondition(item):
del somelist[index]
else:
index += 1
这是我提出的原始解决方案。这不仅不是很优雅,而且我希望有更好的方法来保持时间和内存效率。
向后遍历列表:
while processingdata:
for i in xrange(len(somelist) - 1, -1, -1):
dosomestuff(item)
if somecondition(somelist, i):
somelist.pop(i)
这可以避免增加索引变量,但最终具有与原始版本相同的成本。它还破坏了 dosomestuff(item) 的逻辑,该逻辑希望按照原始列表中出现的顺序来处理它们。
制作新列表:
while processingdata:
for i, item in enumerate(somelist):
dosomestuff(item)
newlist = []
for item in somelist:
if somecondition(item):
newlist.append(item)
somelist = newlist
gc.collect()
这是从列表中删除元素的非常幼稚的策略,并且需要大量内存,因为必须制作列表的几乎完整副本。
使用列表推导式:
while processingdata:
for i, item in enumerate(somelist):
dosomestuff(item)
somelist[:] = [x for x in somelist if somecondition(x)]
这非常优雅,但在幕后它会再次遍历整个列表,并且必须复制其中的大部分元素。我的直觉是,这个操作可能比原始的 del 语句花费更多,至少在内存方面是这样。请记住,somelist 可能很大,并且每次运行仅迭代一次的任何解决方案可能总是会获胜。
使用过滤功能:
while processingdata:
for i, item in enumerate(somelist):
dosomestuff(item)
somelist = filter(lambda x: not subtle_condition(x), somelist)
这也会创建一个占用大量 RAM 的新列表。
使用 itertools 的过滤器函数:
from itertools import ifilterfalse
while processingdata:
for item in itertools.ifilterfalse(somecondtion, somelist):
dosomestuff(item)
此版本的过滤器调用不会创建新列表,但不会对破坏算法逻辑的每个项目调用 dosomestuff。我提供此示例只是为了创建详尽的列表。
边走边移动列表中的项目
while processingdata:
index = 0
for item in somelist:
dosomestuff(item)
if not somecondition(item):
somelist[index] = item
index += 1
del somelist[index:]
这是一种看似成本有效的巧妙方法。我认为它将精确地移动每个项目(或指向每个项目的指针?)一次,从而产生 O(N) 算法。最后,我希望 Python 能够足够智能地在最后调整列表的大小,而无需为列表的新副本分配内存。但不确定。
放弃 Python 列表:
class Doubly_Linked_List:
def __init__(self):
self.first = None
self.last = None
self.n = 0
def __len__(self):
return self.n
def __iter__(self):
return DLLIter(self)
def iterator(self):
return self.__iter__()
def append(self, x):
x = DLLElement(x)
x.next = None
if self.last is None:
x.prev = None
self.last = x
self.first = x
self.n = 1
else:
x.prev = self.last
x.prev.next = x
self.last = x
self.n += 1
class DLLElement:
def __init__(self, x):
self.next = None
self.data = x
self.prev = None
class DLLIter:
etc...
这种类型的对象在某种程度上类似于 Python 列表。然而,删除一个元素的时间复杂度是 O(1)。我不想去这里,因为这几乎在任何地方都需要大量的代码重构。
The problem
My concern is the following: I am storing a relativity large dataset in a classical python list and in order to process the data I must iterate over the list several times, perform some operations on the elements, and often pop an item out of the list.
It seems that deleting one item out of a Python list costs O(N) since Python has to copy all the items above the element at hand down one place. Furthermore, since the number of items to delete is approximately proportional to the number of elements in the list this results in an O(N^2) algorithm.
I am hoping to find a solution that is cost effective (time and memory-wise). I have studied what I could find on the internet and have summarized my different options below. Which one is the best candidate ?
Keeping a local index:
while processingdata:
index = 0
while index < len(somelist):
item = somelist[index]
dosomestuff(item)
if somecondition(item):
del somelist[index]
else:
index += 1
This is the original solution I came up with. Not only is this not very elegant, but I am hoping there is better way to do it that remains time and memory efficient.
Walking the list backwards:
while processingdata:
for i in xrange(len(somelist) - 1, -1, -1):
dosomestuff(item)
if somecondition(somelist, i):
somelist.pop(i)
This avoids incrementing an index variable but ultimately has the same cost as the original version. It also breaks the logic of dosomestuff(item) that wishes to process them in the same order as they appear in the original list.
Making a new list:
while processingdata:
for i, item in enumerate(somelist):
dosomestuff(item)
newlist = []
for item in somelist:
if somecondition(item):
newlist.append(item)
somelist = newlist
gc.collect()
This is a very naive strategy for eliminating elements from a list and requires lots of memory since an almost full copy of the list must be made.
Using list comprehensions:
while processingdata:
for i, item in enumerate(somelist):
dosomestuff(item)
somelist[:] = [x for x in somelist if somecondition(x)]
This is very elegant but under-the-cover it walks the whole list one more time and must copy most of the elements in it. My intuition is that this operation probably costs more than the original del statement at least memory wise. Keep in mind that somelist can be huge and that any solution that will iterate through it only once per run will probably always win.
Using the filter function:
while processingdata:
for i, item in enumerate(somelist):
dosomestuff(item)
somelist = filter(lambda x: not subtle_condition(x), somelist)
This also creates a new list occupying lots of RAM.
Using the itertools' filter function:
from itertools import ifilterfalse
while processingdata:
for item in itertools.ifilterfalse(somecondtion, somelist):
dosomestuff(item)
This version of the filter call does not create a new list but will not call dosomestuff on every item breaking the logic of the algorithm. I am including this example only for the purpose of creating an exhaustive list.
Moving items up the list while walking
while processingdata:
index = 0
for item in somelist:
dosomestuff(item)
if not somecondition(item):
somelist[index] = item
index += 1
del somelist[index:]
This is a subtle method that seems cost effective. I think it will move each item (or the pointer to each item ?) exactly once resulting in an O(N) algorithm. Finally, I hope Python will be intelligent enough to resize the list at the end without allocating memory for a new copy of the list. Not sure though.
Abandoning Python lists:
class Doubly_Linked_List:
def __init__(self):
self.first = None
self.last = None
self.n = 0
def __len__(self):
return self.n
def __iter__(self):
return DLLIter(self)
def iterator(self):
return self.__iter__()
def append(self, x):
x = DLLElement(x)
x.next = None
if self.last is None:
x.prev = None
self.last = x
self.first = x
self.n = 1
else:
x.prev = self.last
x.prev.next = x
self.last = x
self.n += 1
class DLLElement:
def __init__(self, x):
self.next = None
self.data = x
self.prev = None
class DLLIter:
etc...
This type of object resembles a python list in a limited way. However, deletion of an element is guaranteed O(1). I would not like to go here since this would require massive amounts of code refactoring almost everywhere.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
如果不知道您正在使用此列表执行的具体操作,则很难确切地知道在这种情况下什么是最好的。如果您的处理阶段取决于列表元素的当前索引,则这将不起作用,但如果不是,则似乎您已经放弃了最Pythonic(并且在许多方面,最简单)的方法:生成器。
如果您所做的只是迭代每个元素,以某种方式处理它,然后是否将该元素包含在列表中,请使用生成器。那么你就不需要将整个可迭代存储在内存中。
您需要有一个处理循环来处理持久化处理后的可迭代对象(将其写回文件或其他内容),或者如果您有多个处理阶段,您希望将其分成不同的生成器,则可以让处理循环通过一台发电机到另一台发电机。
Without knowing the specifics of what you're doing with this list, it's hard to know exactly what would be best in this case. If your processing stage depends on the current index of the list element, this won't work, but if not, it appears you've left off the most Pythonic (and in many ways, easiest) approach: generators.
If all you're doing is iterating over each element, processing it in some way, then either including that element in the list or not, use a generator. Then you never need to store the entire iterable in memory.
You would need to have a processing loop that dealt with persisting the processed iterable (writing it back to a file, or whatever), or if you have multiple processing stages you'd prefer to separate into different generators you could have your processing loop pass one generator to the next.
从您的描述来看,双端队列(“甲板”)正是您正在寻找的:
http://docs.python.org/library/collections.html#deque-objects
通过重复调用 pop() 来“迭代”它,然后,如果您想将弹出的项目保留在双端队列,使用appendleft(item)将该项目返回到前面。为了跟上您完成迭代并查看双端队列中的所有内容的时间,可以放入您监视的标记对象(例如 None ),或者在启动特定循环并使用 range( 时询问双端队列的 len() ) ) 到 pop() 正好那么多项。
我相信你会发现你需要的所有操作都是 O(1) 的。
From your description it sounds like a deque ("deck") would be exactly what you are looking for:
http://docs.python.org/library/collections.html#deque-objects
"Iterate" across it by repeatedly calling pop() and then, if you want to keep the popped item in the deque, returning that item to the front with appendleft(item). To keep up with when you're done iterating and have seen everything in the deque, either put in a marker object like None that you watch for, or just ask for the deque's len() when you start a particular loop and use range() to pop() exactly that many items.
I believe you will find all of the operations you need are then O(1).
Python 仅存储对列表中对象的引用,而不存储元素本身。如果逐项增长列表,列表(即对象的引用列表)将逐项增长,最终到达 Python 在末尾预先分配的多余内存的末尾。列表(参考文献!)。然后,它将列表(引用!)复制到一个新的更大的位置,而列表元素保留在原来的位置。由于您的代码无论如何都会访问旧列表中的所有元素,因此通过 new_list[i]=old_list[i] 将引用复制到新列表几乎没有任何负担。唯一的性能提示是立即分配所有新元素,而不是追加它们(OTOH,Python 文档说摊销追加仍然是 O(1),因为多余元素的数量随着列表大小的增加而增长)。如果您缺少新列表(引用)的位置,那么我担心您运气不佳 - 任何可以逃避 O(n) 就地插入/删除的数据结构都可能比简单的 4 数组更大- 或 8 字节条目。
Python stores only references to objects in the list - not the elements themselves. If you grow a list item by item, the list (that is the list of references to the objects) will grow one by one, eventually reaching the end of the excess memory that Python preallocated at the end of the list (of references!). It then copies the list (of references!) into a new larger place while your list elements stay at their old location. As your code visits all the elements in the old list anyway, copying the references to a new list by new_list[i]=old_list[i] will be nearly no burden at all. The only performance hint is to allocate all new elements at once instead of appending them (OTOH the Python docs say that amortized append is still O(1) as the number of excess elements grows with the list size). If you are lacking the place for the new list (of references) then I fear you are out of luck - any data structure that would evade the O(n) in-place insert/delete will likely be bigger than a simple array of 4- or 8-byte entries.
双向链表比仅仅重新分配列表更糟糕。 Python 列表使用 5 个单词+每个元素一个单词。双向链表的每个元素将使用 5 个单词。即使您使用单链表,每个元素仍然需要 4 个单词 - 这比重建列表时每个元素少于 2 个单词要糟糕得多。
从内存使用的角度来看,将项目向上移动并删除最后的空闲部分是最好的方法。如果列表未满一半,Python 将释放内存。要问自己的问题是,这真的很重要吗?列表条目可能指向某些数据,除非列表中有很多重复的对象,否则列表使用的内存与数据相比是微不足道的。鉴于此,您不妨建立一个新列表。
对于构建新列表,您建议的方法不太好。没有明显的理由让你不能只浏览一遍这个列表。此外,调用 gc.collect() 是不必要的,而且实际上是有害的 - CPython 引用计数无论如何都会立即释放旧列表,甚至其他垃圾收集器在遇到内存压力时也最好进行收集。所以这样的事情会起作用:
如果您不介意在列表推导中使用副作用,那么以下也是一个选择:
还可以重构 inplace 方法,以便将机制和业务逻辑分开:
A doubly linked list is worse than just reallocating the list. A Python list uses 5 words + one word per element. A doubly linked list will use 5 words per element. Even if you use a singly linked list, it's still going to be 4 words per element - a lot worse than the less than 2 words per element that rebuilding the list will take.
From memory usage perspective, moving items up the list and deleting the slack at the end is the best approach. Python will release the memory if the list gets less than half full. The question to ask yourself is, does it really matter. The list entries probably point to some data, unless you have lots of duplicate objects in the list, the memory used for the list is insignificant compared to the data. Given that, you might just as well just build a new list.
For building a new list, the approach you suggested is not that good. There's no apparent reason why you couldn't just go over the list once. Also, calling
gc.collect()
is unnecessary and actually harmful - the CPython reference counting will release the old list immediately anyway, and even the other garbage collectors are better off collecting when they hit memory pressure. So something like this will work:If you don't mind using side effects in list comprehensions, then the following is also an option:
The inplace method can also be refactored so the mechanism and business logic are separated:
您没有提供足够的信息,我无法很好地回答这个问题。我不太了解您的用例,无法告诉您如果您必须优化时间,哪些数据结构将为您带来所需的时间复杂度。典型的解决方案是构建一个新列表而不是重复删除,但显然这会使内存使用量增加一倍。
如果您遇到内存使用问题,您可能希望放弃使用内存中的 Python 结构,而使用磁盘数据库。许多数据库都可用,sqlite 随 Python 一起提供。根据您的使用情况以及内存需求的紧张程度,
array.array
或 numpy 可能会对您有所帮助,但这在很大程度上取决于您需要执行的操作。 array.array 的时间复杂度与 list 相同,numpy 数组的工作方式有所不同。使用惰性迭代器(例如生成器和 itertools 模块中的内容)通常可以将内存使用量减少 n 倍。使用数据库将缩短从任意位置删除项目的时间(尽管如果这很重要,顺序将会丢失)。使用
dict
会起到相同的作用,但可能会占用大量内存。您还可以考虑使用
blist
作为一个可能会得到你想要的一些妥协的列表。我不相信它会大幅增加内存使用量,但它会将项目删除更改为 O(log n)。当然,这是以其他操作变得更加昂贵为代价的。我必须进行测试才能相信,双向链表实现的内存使用常数因子将小于通过简单创建新列表获得的 2。我真的很怀疑。
我认为,您将必须分享有关您的问题类的更多信息,以获得更具体的答案,但一般建议是
You do not provide enough information I can find to answer this question really well. I don't know your use case well enough to tell you what data structures will get you the time complexities you want if you have to optimize for time. The typical solution is to build a new list rather than repeated deletions, but obviously this doubles(ish) memory usage.
If you have memory usage issues, you might want to abandon using in-memory Python constructs and go with an on-disk database. Many databases are available and sqlite ships with Python. Depending on your usage and how tight your memory requirements are,
array.array
or numpy might help you, but this is highly dependent on what you need to do.array.array
will have all the same time complexities aslist
and numpy arrays sort of will but work in some different ways. Using lazy iterators (like generators and the stuff in theitertools
module) can often reduce memory usage by a factor of n.Using a database will improve time to delete items from arbitrary locations (though order will be lost if this is important). Using a
dict
will do the same, but potentially at high memory usage.You can also consider
blist
as a drop-in replacement for a list that might get some of the compromises you want. I don't believe it will drastically increase memory usage, but it will change item removal to O(log n). This comes at the cost of making other operations more expensive, of course.I would have to see testing to believe that the constant factor for memory use for your doubly linked list implementation would be less than the 2 that you get by simply creating a new list. I really doubt it.
You will have to share more about your problem class for a more concrete answer, I think, but the general advice is
Brandon Craig Rhodes 建议使用
collections.deque
,它可以解决这个问题:操作不需要额外的内存,并且保持 O(n) 的复杂度。我不知道总内存使用量以及它与列表的比较如何;值得注意的是,双端队列必须存储更多的引用,如果它不像使用两个列表那样占用内存,我不会感到惊讶。您必须测试或研究它才能了解自己。如果您要使用双端队列,我的部署方式与罗兹建议的略有不同:
这样做不会产生显着的内存差异,但与随心所欲地改变同一个双端队列相比,出错的机会要少得多。
Brandon Craig Rhodes suggests using a
collections.deque
, which can suit this problem: no additional memory is required for the operation and it is kept O(n). I do not know the total memory usage and how it compares to a list; it's worth noting that a deque has to store a lot more references and I would not be surprised if it isn't as memory intensive as using two lists. You would have to test or study it to know yourself.If you were to use a deque, I would deploy it slightly differently than Rhodes suggests:
There is no significant memory difference doing it this way, but there's a lot less opportunity to flub up than mutating the same deque as you go.