Python 中数百万字符串列表的内存使用情况

发布于 2025-01-09 06:18:13 字数 1126 浏览 1 评论 0原文

查找内存所示一组字符串的大小与一组字节字符串的大小,很难精确测量包含字符串的集合或列表所使用的内存。但这里有一个很好的估计/上限:

import os, psutil
process = psutil.Process(os.getpid())
a = process.memory_info().rss
L = [b"a%09i" % i for i in range(10_000_000)]
b = process.memory_info().rss
print(L[:10])  # [b'a000000000', b'a000000001', b'a000000002', b'a000000003', b'a000000004', b'a000000005', b'a000000006', b'a000000007', b'a000000008', b'a000000009']
print(b-a)
# 568762368 bytes

即 100 MB 的实际数据为 569 MB。

改进此问题的解决方案(例如使用其他数据结构)已在 一组短字节字符串的内存高效数据结构10 套 - Python 中的 char 字符串在 RAM 中的大小是预期的 10 倍,所以这里我的问题不是“如何改进”,而是:

我们如何在标准列表的情况下精确解释这个大小字节串?

每个字节串有多少字节,每个(链接?)列表项最终获得 569 MB?

这将有助于理解 CPython 中列表和字节字符串的内部结构(平台:Windows 64 位)。

As seen in Find the memory size of a set of strings vs. set of bytestrings, it's difficult to precisely measure the memory used by a set or list containing strings. But here is a good estimation/upper bound:

import os, psutil
process = psutil.Process(os.getpid())
a = process.memory_info().rss
L = [b"a%09i" % i for i in range(10_000_000)]
b = process.memory_info().rss
print(L[:10])  # [b'a000000000', b'a000000001', b'a000000002', b'a000000003', b'a000000004', b'a000000005', b'a000000006', b'a000000007', b'a000000008', b'a000000009']
print(b-a)
# 568762368 bytes

i.e. 569 MB for 100 MB of actual data.

Solutions to improve this (for example with other data structures) have been found in Memory-efficient data structure for a set of short bytes-strings and Set of 10-char strings in Python is 10 times bigger in RAM as expected, so here my question is not "how to improve", but:

How can we precisely explain this size in the case of a standard list of byte-string?

How many bytes for each byte-string, for each (linked?) list item to finally obtain 569 MB?

This will help to understand the internals of lists and bytes-strings in CPython (platform: Windows 64 bit).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

动听の歌 2025-01-16 06:18:13

摘要:

  • 列表对象 89 MB
  • 字符串对象 480 MB
  • =>总计 569 MB

sys.getsizeof(L) 会告诉您列表对象本身约为 89 MB。这是几十个组织字节,每个字节串引用 8 个字节,以及高达 12.5% 的过度分配以允许高效插入。

sys.getsizeof(one_of_your_bytestrings) 会告诉您它们每个都是 43 个字节。 这是

  • 8个字节引用计数器
  • 8 个字节用于指向类型的指针
  • 8 个字节用于长度(因为字节串不是固定大小)
  • 8 字节散列
  • 10 字节表示实际字节串内容,
  • 1 字节表示终止 0 字节。

在内存中每 43 个字节存储一个对象会跨越内存字边界,速度会更慢。所以它们实际上通常每 48 个字节存储一次。您可以使用 id(one_of_your_bytestrings) 来获取要检查的地址。

(这里和那里存在一些差异,部分原因是发生的确切内存分配,但 569 MB 大约是了解上述原因的预期值,并且它与您测量的值相符。)

Summary:

  • 89 MB for the list object
  • 480 MB for the string objects
  • => total 569 MB

sys.getsizeof(L) will tell you the list object itself is about 89 MB. That's a few dozen organizational bytes, 8 bytes per bytestring reference, and up to 12.5% overallocation to allow efficient insertions.

sys.getsizeof(one_of_your_bytestrings) will tell you they're 43 bytes each. That's:

  • 8 bytes for the reference counter
  • 8 bytes for the pointer to the type
  • 8 bytes for the length (since bytestrings aren't fixed size)
  • 8 bytes hash
  • 10 bytes for your actual bytestring content
  • 1 byte for a terminating 0-byte.

Storing the objects every 43 bytes in memory would cross memory word boundaries, which is slower. So they're actually stored usually every 48 bytes. You can use id(one_of_your_bytestrings) to get the addresses to check.

(There's some variance here and there, partly due to the exact memory allocations that happen, but 569 MB is about what's expected knowing the above reasons, and it matches what you measured.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文