查找一组字符串与一组字节串的内存大小
编辑:来自 内存的答案Python 中数百万个字符串列表的使用也可以适用于集合。
通过分析我的机器上的 RAM 使用情况(使用进程管理器),我注意到一组数百万个字符串(如 'abcd'
)比一组数百万个字节 b 占用的内存少得多。 'abcd'
(编辑:我错了,这是由于其他地方的错误所致)。我想测试一下:
import random, string, sys
randomstring = lambda length: ''.join(random.choice(string.ascii_lowercase) for _ in range(length))
s1 = {randomstring(10) for i in range(100_000)}
s2 = {randomstring(50) for i in range(100_000)}
s3 = {randomstring(10).encode() for i in range(100_000)}
s4 = {randomstring(50).encode() for i in range(100_000)}
print(sys.getsizeof(s1), sys.getsizeof(s2), sys.getsizeof(s3), sys.getsizeof(s4))
但这里它总是给出相同的大小:4194528
,而大小应随因子 x5 变化,并且对于字符串与字节的情况可能有所不同。
如何测量这些集合及其所有元素占用的内存大小?
注意:我知道在 Python 中查找结构占用的整个内存并不容易(另请参阅 Python 结构的内存大小),因为我们需要考虑所有链接的元素。
TL;DR:Python中有没有一个工具可以自动测量集合的内存大小+内部引用(指针?)、哈希表桶、元素(这里是字符串)所占用的内存托管在集合中...?简而言之:这组字符串所需的每个字节。有这样的内存测量工具吗?
Edit: the answer from Memory usage of a list of millions of strings in Python can be adapted to sets too.
By analyzing the RAM usage on my machine (with the process manager), I noticed that a set of millions of strings like 'abcd'
takes much less memory than a set of millions of bytes b'abcd'
(Edit: I was wrong, it was due to an error elsewhere). I would like to test this:
import random, string, sys
randomstring = lambda length: ''.join(random.choice(string.ascii_lowercase) for _ in range(length))
s1 = {randomstring(10) for i in range(100_000)}
s2 = {randomstring(50) for i in range(100_000)}
s3 = {randomstring(10).encode() for i in range(100_000)}
s4 = {randomstring(50).encode() for i in range(100_000)}
print(sys.getsizeof(s1), sys.getsizeof(s2), sys.getsizeof(s3), sys.getsizeof(s4))
but here it always gives the same size: 4194528
whereas the size should vary with a factor x5, and probably be different for the string vs bytes case.
How to measure the memory size taken by these sets and all its elements?
Note: I know that finding the whole memory taken by a structure is not easy in Python (see also In-memory size of a Python structure), because we need to take in account all the linked elements.
TL;DR: Is there a tool in Python to automatically measure the memory size of a set + the memory taken by the internal references (pointers?), the hashtable buckets, the elements (strings here) that are hosted in the set...? In short: every byte that is necessary for this set of strings. Does such a memory measurement tool exist?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
sys.getsizeof
不测量完整目标数据结构的大小。它仅测量包含对字符串/字节对象的引用的集合对象占用的内存。 引用不包含在返回的内存消耗中(即,它不会在目标数据结构的每个对象中递归遍历)。在 64 位平台上,引用通常占用 8 个字节,并且 CPython 集不像列表那么紧凑:它的实现类似于 哈希表,有许多存储桶,并且一些存储桶未使用。事实上,为了使数据结构快速,这是必须的(一般来说,占用率应该是 50%-90%)。此外,每个桶都包含一个哈希值,通常需要 8 个字节。字符串本身比存储桶占用更多空间(至少在我的机器上):
在我的机器上,CPython 字符串比字节大 16 个字节。
sys.getsizeof
does not measure the size of the full target data structure. It only measure the memory taken by the set object which contains references to strings/bytes objects. The references are not included in the returned memory consumption (ie. it does not walk recursively in each object of the target data structure). A reference takes typically 8 bytes on a 64-bit platform and a CPython set is not as compact as a list: it is implemented like a hash-table with many buckets and some buckets are unused. In fact, this is mandatory for this data structure to be fast (in general, the occupancy should be 50%-90%). Moreover, each bucket contains a hash which usually takes 8 bytes.The string themselves take much more space than a bucket (at least on my machine):
On my machine, it turns out that CPython strings are 16 bytes bigger than bytes.