Python 中数百万字符串列表的内存使用情况
如查找内存所示一组字符串的大小与一组字节字符串的大小,很难精确测量包含字符串的集合或列表所使用的内存。但这里有一个很好的估计/上限:
import os, psutil
process = psutil.Process(os.getpid())
a = process.memory_info().rss
L = [b"a%09i" % i for i in range(10_000_000)]
b = process.memory_info().rss
print(L[:10]) # [b'a000000000', b'a000000001', b'a000000002', b'a000000003', b'a000000004', b'a000000005', b'a000000006', b'a000000007', b'a000000008', b'a000000009']
print(b-a)
# 568762368 bytes
即 100 MB 的实际数据为 569 MB。
改进此问题的解决方案(例如使用其他数据结构)已在 一组短字节字符串的内存高效数据结构和10 套 - Python 中的 char 字符串在 RAM 中的大小是预期的 10 倍,所以这里我的问题不是“如何改进”,而是:
我们如何在标准列表的情况下精确解释这个大小字节串?
每个字节串有多少字节,每个(链接?)列表项最终获得 569 MB?
这将有助于理解 CPython 中列表和字节字符串的内部结构(平台:Windows 64 位)。
As seen in Find the memory size of a set of strings vs. set of bytestrings, it's difficult to precisely measure the memory used by a set or list containing strings. But here is a good estimation/upper bound:
import os, psutil
process = psutil.Process(os.getpid())
a = process.memory_info().rss
L = [b"a%09i" % i for i in range(10_000_000)]
b = process.memory_info().rss
print(L[:10]) # [b'a000000000', b'a000000001', b'a000000002', b'a000000003', b'a000000004', b'a000000005', b'a000000006', b'a000000007', b'a000000008', b'a000000009']
print(b-a)
# 568762368 bytes
i.e. 569 MB for 100 MB of actual data.
Solutions to improve this (for example with other data structures) have been found in Memory-efficient data structure for a set of short bytes-strings and Set of 10-char strings in Python is 10 times bigger in RAM as expected, so here my question is not "how to improve", but:
How can we precisely explain this size in the case of a standard list of byte-string?
How many bytes for each byte-string, for each (linked?) list item to finally obtain 569 MB?
This will help to understand the internals of lists and bytes-strings in CPython (platform: Windows 64 bit).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
摘要:
sys.getsizeof(L)
会告诉您列表对象本身约为 89 MB。这是几十个组织字节,每个字节串引用 8 个字节,以及高达 12.5% 的过度分配以允许高效插入。sys.getsizeof(one_of_your_bytestrings)
会告诉您它们每个都是 43 个字节。 这是:在内存中每 43 个字节存储一个对象会跨越内存字边界,速度会更慢。所以它们实际上通常每 48 个字节存储一次。您可以使用 id(one_of_your_bytestrings) 来获取要检查的地址。
(这里和那里存在一些差异,部分原因是发生的确切内存分配,但 569 MB 大约是了解上述原因的预期值,并且它与您测量的值相符。)
Summary:
sys.getsizeof(L)
will tell you the list object itself is about 89 MB. That's a few dozen organizational bytes, 8 bytes per bytestring reference, and up to 12.5% overallocation to allow efficient insertions.sys.getsizeof(one_of_your_bytestrings)
will tell you they're 43 bytes each. That's:Storing the objects every 43 bytes in memory would cross memory word boundaries, which is slower. So they're actually stored usually every 48 bytes. You can use
id(one_of_your_bytestrings)
to get the addresses to check.(There's some variance here and there, partly due to the exact memory allocations that happen, but 569 MB is about what's expected knowing the above reasons, and it matches what you measured.)