可以从左侧截断的 Python 缓冲区?
现在,我使用字符串、StringIO 或 cStringIO 缓冲字节。 但是,我经常需要从缓冲区左侧删除字节。 一种简单的方法是重建整个缓冲区。 如果左截断是一种非常常见的操作,是否有最佳方法来做到这一点? Python 的垃圾收集器实际上应该 GC 被截断的字节。
任何类型的算法(将缓冲区保持在小块中?)或现有的实现都会真正有帮助。
编辑:
我尝试为此使用Python 2.7的内存视图,但遗憾的是,当原始引用被删除时,“视图”之外的数据不会被GCed:
# (This will use ~2GB of memory, not 50MB)
memoryview # Requires Python 2.7+
smalls = []
for i in xrange(10):
big = memoryview('z'*(200*1000*1000))
small = big[195*1000*1000:]
del big
smalls.append(small)
print '.',
Right now, I am buffering bytes using strings, StringIO, or cStringIO. But, I often need to remove bytes from the left side of the buffer. A naive approach would rebuild the entire buffer. Is there an optimal way to do this, if left-truncating is a very common operation? Python's garbage collector should actually GC the truncated bytes.
Any sort of algorithm for this (keep the buffer in small pieces?), or an existing implementation, would really help.
Edit:
I tried to use Python 2.7's memoryview for this, but sadly, the data outside the "view" isn't GCed when the original reference is deleted:
# (This will use ~2GB of memory, not 50MB)
memoryview # Requires Python 2.7+
smalls = []
for i in xrange(10):
big = memoryview('z'*(200*1000*1000))
small = big[195*1000*1000:]
del big
smalls.append(small)
print '.',
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果左删除操作频繁,双端队列将是高效的(与使用列表、字符串或缓冲区,它的分摊时间为 O(1)(对于任一端删除)。 然而,它在内存方面比字符串更昂贵,因为您将每个字符存储为其自己的字符串对象,而不是打包序列。
或者,您可以创建自己的实现(例如固定大小的字符串/缓冲区对象的链接列表),这可以更紧凑地存储数据。
A deque will be efficient if left-removal operations are frequent (Unlike using a list, string or buffer, it's amortised O(1) for either-end removal). It will be more costly memory-wise than a string however, as you'll be storing each character as its own string object, rather than a packed sequence.
Alternatively, you could create your own implementation (eg. a linked list of string / buffer objects of fixed size), which may store the data more compactly.
将缓冲区构建为字符或行列表并对列表进行切片。 仅在输出中作为字符串连接。 这对于大多数类型的“可变字符串”行为非常有效。
GC 将收集被截断的字节,因为它们不再在列表中被引用。
更新:要修改列表头,您可以简单地反转列表。 这听起来像是一件低效的事情,但是 python 的列表实现在内部对此进行了优化。
来自 http://effbot.org/zone/python-list.htm :
Build your buffer as a list of characters or lines and slice the list. Only join as string on output. This is pretty efficient for most types of 'mutable string' behaviour.
The GC will collect the truncated bytes because they are no longer referenced in the list.
UPDATE: For modifying the list head you can simply reverse the list. This sounds like an inefficient thing to do however python's list implementation optimises this internally.
from http://effbot.org/zone/python-list.htm :