为什么 defaultdict(int) 的字典使用这么多内存? (以及其他简单的 python 性能问题)

发布于 2024-08-30 13:46:58 字数 1149 浏览 1 评论 0原文

我确实明白,按照我的方式查询默认字典中不存在的键会将项目添加到默认字典中。这就是为什么在性能方面将我的第二个代码片段与第一个代码片段进行比较是公平的。

import numpy as num
from collections import defaultdict

topKeys = range(16384)
keys = range(8192)

table = dict((k,defaultdict(int)) for k in topKeys)

dat = num.zeros((16384,8192), dtype="int32")

print "looping begins"
#how much memory should this use? I think it shouldn't use more that a few
#times the memory required to hold (16384*8192) int32's (512 mb), but
#it uses 11 GB!
for k in topKeys:
    for j in keys:
        dat[k,j] = table[k][j]

print "done"

这是怎么回事?此外,与第一个脚本相比,这个类似的脚本需要花费很长时间才能运行,并且还使用大量的内存。

topKeys = range(16384)
keys = range(8192)
table = [(j,0) for k in topKeys for j in keys]

我猜想 python 整数可能是 64 位整数,这可以解释其中的一些原因,但是这些相对自然和简单的结构真的会产生如此巨大的开销吗? 我想这些脚本表明它们确实如此,所以我的问题是:到底是什么导致第一个脚本中的高内存使用率以及第二个脚本的长运行时间和高内存使用率,有什么方法可以避免这些成本?

编辑: 64 位机器上的 Python 2.6.4。

编辑 2:我可以明白为什么我的表应该占用 3 GB 的初步近似值 16384*8192*(12+12)字节 6GB 的默认负载系数迫使其保留双倍的空间。 然后内存分配的低效率又消耗了 2。

所以这是我剩下的问题: 有没有办法让我告诉它以某种方式使用 32 位整数?

与第一个代码片段相比,为什么我的第二个代码片段需要永远运行?第一个大约需要一分钟,第二个80分钟后我就杀掉了。

I do understand that querying a non-existent key in a defaultdict the way I do will add items to the defaultdict. That is why it is fair to compare my 2nd code snippet to my first one in terms of performance.

import numpy as num
from collections import defaultdict

topKeys = range(16384)
keys = range(8192)

table = dict((k,defaultdict(int)) for k in topKeys)

dat = num.zeros((16384,8192), dtype="int32")

print "looping begins"
#how much memory should this use? I think it shouldn't use more that a few
#times the memory required to hold (16384*8192) int32's (512 mb), but
#it uses 11 GB!
for k in topKeys:
    for j in keys:
        dat[k,j] = table[k][j]

print "done"

What is going on here? Furthermore, this similar script takes eons to run compared to the first one, and also uses an absurd quantity of memory.

topKeys = range(16384)
keys = range(8192)
table = [(j,0) for k in topKeys for j in keys]

I guess python ints might be 64 bit ints, which would account for some of this, but do these relatively natural and simple constructions really produce such a massive overhead?
I guess these scripts show that they do, so my question is: what exactly is causing the high memory usage in the first script and the long runtime and high memory usage of the second script and is there any way to avoid these costs?

Edit:
Python 2.6.4 on 64 bit machine.

Edit 2: I can see why, to a first approximation, my table should take up 3 GB
16384*8192*(12+12) bytes
and 6GB with a defaultdict load factor that forces it to reserve double the space.
Then inefficiencies in memory allocation eat up another factor of 2.

So here are my remaining questions:
Is there a way for me to tell it to use 32 bit ints somehow?

And why does my second code snippet take FOREVER to run compared to the first one? The first one takes about a minute and I killed the second one after 80 minutes.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

稀香 2024-09-06 13:46:59

Python int 在内部表示为 C long(实际上比这更复杂),但这并不是问题的根源。

最大的开销是你对字典的使用。 (defaultdicts 和 dicts 在本描述中大致相同)。字典是使用哈希表实现的,这很好,因为它可以快速查找非常通用的键。 (当您只需要查找连续的数字键时,这不是那么必要,因为可以通过简单的方式来布置它们以获取它们。)

字典可以拥有比它拥有的项目更多的插槽。假设您有一个字典,其插槽数量是项目的 3 倍。每个槽都需要空间来容纳指向键的指针和充当链表末尾的指针。这是数字的 6 倍,加上指向您感兴趣的项目的所有指针。考虑到这些指针中的每一个在您的系统上都是 8 个字节,并且在这种情况下您有 16384 个默认字典。粗略地看一下,16384 次出现 *(8192 项/次)* 7(指针/项)* 8(字节/指针)= 7 GB。这是在我获取您正在存储的实际数字(其中每个唯一数字本身就是一个 Python 字典)、外部字典、numpy 数组或 Python 跟踪的内容之前尝试优化一些。

你的开销听起来比我怀疑的要高一点,我有兴趣知道这 11GB 是用于整个进程还是你只是为表计算的。无论如何,我确实希望这个 dict-of-defaultdicts 数据结构的大小比 numpy 数组表示大几个数量级。

至于“有什么办法可以避免这些成本吗?”答案是“使用 numpy 来存储大型、固定大小的连续数值数组,而不是字典!”您必须更具体和具体地说明为什么您发现这样的结构是必要的,以便更好地建议什么是最佳解决方案。

Python ints are internally represented as C longs (it's actually a bit more complicated than that), but that's not really the root of your problem.

The biggest overhead is your usage of dicts. (defaultdicts and dicts are about the same in this description). dicts are implemented using hash tables, which is nice because it gives quick lookup of pretty general keys. (It's not so necessary when you only need to look up sequential numerical keys, since they can be laid out in an easy way to get to them.)

A dict can have many more slots than it has items. Let's say you have a dict with 3x as many slots as items. Each of these slots needs room for a pointer to a key and a pointer serving as the end of a linked list. That's 6x as many points as numbers, plus all the pointers to the items you're interested in. Consider that each of these pointers is 8 bytes on your system and that you have 16384 defaultdicts in this situation. As a rough, handwavey look at this, 16384 occurrences * (8192 items/occurance) * 7 (pointers/item) * 8 (bytes/pointer) = 7 GB. This is before I've gotten to the actual numbers you're storing (each unique number of which is itself a Python dict), the outer dict, that numpy array, or the stuff Python's keeping track of to try to optimize some.

Your overhead sounds a little higher than I suspect and I would be interested in knowing whether that 11GB was for a whole process or whether you calculated it for just table. In any event, I do expect the size of this dict-of-defaultdicts data structure to be orders of magnitude bigger than the numpy array representation.

As to "is there any way to avoid these costs?" the answer is "use numpy for storing large, fixed-size contiguous numerical arrays, not dicts!" You'll have to be more specific and concrete about why you found such a structure necessary for better advice about what the best solution is.

木槿暧夏七纪年 2024-09-06 13:46:59

好吧,看看您的代码实际上在做什么:

topKeys = range(16384)
table = dict((k,defaultdict(int)) for k in topKeys)

这将创建一个包含 16384 个 defaultdict(int) 的字典。字典有一定的开销:字典对象本身在 60 到 120 字节之间(取决于构建中的指针和 ssize_t 的大小)。除非字典少于几个项目,否则数据是一个单独的内存块,介于 12 到 24 字节之间,并且始终填充 1/2 到 2/3 之间。默认字典要大 4 到 8 个字节,因为它们有额外的东西要存储。每个 int 都是 12 个字节,尽管它们在可能的情况下被重用,但该代码片段不会重用其中的大部分。因此,实际上,在 32 位构建中,该代码段将占用 table 字典的 60 +​​ (16384*12) * 1.8(填充因子) 字节,< code>16384 * 64 字节用于存储为值的默认字典,16384 * 12 字节用于整数。因此,在默认字典中没有存储任何内容的情况下,这只是一个半兆字节多一点。这是 32 位版本; 64 位版本的大小将是该大小的两倍。

然后你创建一个 numpy 数组,它实际上对内存非常保守:

dat = num.zeros((16384,8192), dtype="int32")

这会给数组本身带来一些开销,通常的 Python 对象开销加上数组的维度和类型等,但不会比100 字节,且仅适用于一个数组。不过,它确实将 16384*8192 int32 存储在您的 512Mb 中。

然后你有一种相当奇特的方式来填充这个 numpy 数组:

for k in topKeys:
    for j in keys:
        dat[k,j] = table[k][j]

两个循环本身不使用太多内存,并且它们每次迭代都会重新使用它。但是,table[k][j] 为您请求的每个值创建一个新的 Python 整数并将其存储在 defaultdict 中。创建的整数始终为 0,而且它总是会被重用,但存储对它的引用仍然会占用 defaultdict 中的空间:前面提到的每个条目 12 字节乘以填充因子 (介于 1.66 和 2 之间。)这将使您的实际数据接近 3Gb,而在 64 位版本中则为 6Gb。

最重要的是,由于您不断添加数据,默认字典必须不断增长,这意味着它们必须不断重新分配。由于 Python 的 malloc 前端 (obmalloc) 以及它如何在自己的块中分配较小的对象,以及进程内存在大多数操作系统上的工作方式,这意味着您的进程将分配更多内存并且无法释放它;它实际上不会使用所有 11Gb,并且 Python 将重新使用默认字典的大块之间的可用内存,但总映射地址空间将是 11Gb。

Well, look at what your code is actually doing:

topKeys = range(16384)
table = dict((k,defaultdict(int)) for k in topKeys)

This creates a dict holding 16384 defaultdict(int)'s. A dict has a certain amount of overhead: the dict object itself is between 60 and 120 bytes (depending on the size of pointers and ssize_t's in your build.) That's just the object itself; unless the dict is less than a couple of items, the data is a separate block of memory, between 12 and 24 bytes, and it's always between 1/2 and 2/3rds filled. And defaultdicts are 4 to 8 bytes bigger because they have this extra thing to store. And ints are 12 bytes each, and although they're reused where possible, that snippet won't reuse most of them. So, realistically, in a 32-bit build, that snippet will take up 60 + (16384*12) * 1.8 (fill factor) bytes for the table dict, 16384 * 64 bytes for the defaultdicts it stores as values, and 16384 * 12 bytes for the integers. So that's just over a megabyte and a half without storing anything in your defaultdicts. And that's in a 32-bit build; a 64-bit build would be twice that size.

Then you create a numpy array, which is actually pretty conservative with memory:

dat = num.zeros((16384,8192), dtype="int32")

This will have some overhead for the array itself, the usual Python object overhead plus the dimensions and type of the array and such, but it wouldn't be much more than 100 bytes, and only for the one array. It does store 16384*8192 int32's in your 512Mb though.

And then you have this rather peculiar way of filling this numpy array:

for k in topKeys:
    for j in keys:
        dat[k,j] = table[k][j]

The two loops themselves don't use much memory, and they re-use it each iteration. However, table[k][j] creates a new Python integer for each value you request, and stores it in the defaultdict. The integer created is always 0, and it so happens that that always gets reused, but storing the reference to it still uses up space in the defaultdict: the aforementioned 12 bytes per entry, times the fill factor (between 1.66 and 2.) That lands you close to 3Gb of actual data right there, and 6Gb in a 64-bit build.

On top of that the defaultdicts, because you keep adding data, have to keep growing, which means they have to keep reallocating. Because of Python's malloc frontend (obmalloc) and how it allocates smaller objects in blocks of its own, and how process memory works on most operating systems, this means your process will allocate more and not be able to free it; it won't actually use all of the 11Gb, and Python will re-use the available memory inbetween the large blocks for the defaultdicts, but the total mapped address space will be that 11Gb.

千紇 2024-09-06 13:46:59

Mike Graham 很好地解释了为什么字典使用更多内存,但我想我应该解释一下为什么你的 defaultdicts 表字典开始占用这么多内存。

现在设置 defaultdict (DD) 的方式是,每当您检索 DD 中不存在的元素时,您都会获得 DD 的默认值(对于您的情况为 0),而且 DD 现在还存储一个键以前不在 DD 中,默认值为 0。我个人不喜欢这样,但事情就是这样。然而,这意味着对于内部循环的每次迭代,都会分配新的内存,这就是它永远花费的原因。如果您将这些行更改

for k in topKeys:
    for j in keys:
        dat[k,j] = table[k][j]

为,

for k in topKeys:
    for j in keys:
        if j in table[k]:
            dat[k,j] = table[k][j]
        else:
            dat[k,j] = 0

则默认值不会分配给 DD 中的键,因此内存对我来说保持在 540 MB 左右,这主要只是分配给 dat 的内存。 DD 对于稀疏矩阵来说是不错的,但如果你想要的话,你可能应该只使用 Scipy 中的稀疏矩阵。

Mike Graham gives a good explanation of why dictionaries use more memory, but I thought that I'd explain why your table dict of defaultdicts starts to take up so much memory.

The way that the defaultdict (DD) is set-up right now, whenever you retrieve an element that isn't in the DD, you get the default value for the DD (0 for your case) but also the DD now stores a key that previously wasn't in the DD with the default value of 0. I personally don't like this, but that's how it goes. However, it means that for every iteration of the inner loop, new memory is being allocated which is why it is taking forever. If you change the lines

for k in topKeys:
    for j in keys:
        dat[k,j] = table[k][j]

to

for k in topKeys:
    for j in keys:
        if j in table[k]:
            dat[k,j] = table[k][j]
        else:
            dat[k,j] = 0

then default values aren't being assigned to keys in the DDs and so the memory stays around 540 MB for me which is mostly just the memory allocated for dat. DDs are decent for sparse matrices though you probably should just use the sparse matrices in Scipy if that's what you want.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文