Python 中数百万字符串列表的内存使用情况

发布于 2025-01-09 06:18:13 字数 1126 浏览 6 评论 0原文

如查找内存所示一组字符串的大小与一组字节字符串的大小，很难精确测量包含字符串的集合或列表所使用的内存。但这里有一个很好的估计/上限：

import os, psutil
process = psutil.Process(os.getpid())
a = process.memory_info().rss
L = [b"a%09i" % i for i in range(10_000_000)]
b = process.memory_info().rss
print(L[:10])  # [b'a000000000', b'a000000001', b'a000000002', b'a000000003', b'a000000004', b'a000000005', b'a000000006', b'a000000007', b'a000000008', b'a000000009']
print(b-a)
# 568762368 bytes

即 100 MB 的实际数据为 569 MB。

改进此问题的解决方案（例如使用其他数据结构）已在一组短字节字符串的内存高效数据结构和10 套 - Python 中的 char 字符串在 RAM 中的大小是预期的 10 倍，所以这里我的问题不是“如何改进”，而是：

我们如何在标准列表的情况下精确解释这个大小字节串？

每个字节串有多少字节，每个（链接？）列表项最终获得 569 MB？

这将有助于理解 CPython 中列表和字节字符串的内部结构（平台：Windows 64 位）。

原文

As seen in Find the memory size of a set of strings vs. set of bytestrings, it's difficult to precisely measure the memory used by a set or list containing strings. But here is a good estimation/upper bound:

import os, psutil
process = psutil.Process(os.getpid())
a = process.memory_info().rss
L = [b"a%09i" % i for i in range(10_000_000)]
b = process.memory_info().rss
print(L[:10])  # [b'a000000000', b'a000000001', b'a000000002', b'a000000003', b'a000000004', b'a000000005', b'a000000006', b'a000000007', b'a000000008', b'a000000009']
print(b-a)
# 568762368 bytes

i.e. 569 MB for 100 MB of actual data.

Solutions to improve this (for example with other data structures) have been found in Memory-efficient data structure for a set of short bytes-strings and Set of 10-char strings in Python is 10 times bigger in RAM as expected, so here my question is not "how to improve", but:

How can we precisely explain this size in the case of a standard list of byte-string?

How many bytes for each byte-string, for each (linked?) list item to finally obtain 569 MB?

This will help to understand the internals of lists and bytes-strings in CPython (platform: Windows 64 bit).

分享到QQ

分享到微博