操作系统如何处理大于内存的Python字典?
我有一个 python 程序,它会消耗大量内存,主要是在字典中。该字典将负责为一组非常大的键分配唯一的整数值。当我处理大型矩阵时,我需要一个也可以从中恢复的键到索引的对应关系(即,一旦矩阵计算完成,我需要将值映射回原始键)。
我相信这个数量最终将超过可用内存。我想知道如何处理交换空间。也许有更好的数据结构用于此目的。
I have a python program that is going to eat a lot of memory, primarily in a dict. This dict will be responsible for assigning a unique integer value to a very large set of keys. As I am working with large matrices, I need a key-to-index correspondence that can also be recovered from (i.e., once matrix computations are complete, I need to map the values back to the original keys).
I believe this amount will eventually surpass available memory. I am wondering how this will be handled with regards to swap space. Perhaps there is a better data structure for this purpose.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果数据超出内存,则需要数据库。当字典大于内存时,字典索引的设计并不是为了获得良好的性能。
You need a database, if the data will exceed memory. The indexing of dictionaries isn't designed for good performance when a dictionary is bigger than memory.
交换空间是一个内核功能,对用户是透明的(python)。
如果您确实有一个巨大的字典并且不需要一次需要所有数据,您可以查看 redis 这可能会做你想做的事。或者也许不是:)
Swap space is a kernel feature and transparant to the user (python).
If you do have a huge dict and don't need all the data at once, you could look at redis which might do what you want. Or maybe not :)
它最终只会导致交换垃圾,因为哈希表具有非常随机的内存访问模式。
如果您知道映射超出了物理内存的大小,则可以首先考虑使用磁盘上的数据结构。尤其是如果您在计算过程中不需要数据结构的话。当哈希表触发交换时,它也会在哈希表本身之外产生问题。
It will just end up in swap trashing, because a hash table has very much randomized memory access patterns.
If you know that the map exceeds the size of the physical memory, you could consider using a data structure on the disk in the first place. This especially if you don't need the data structure during the computation. When the hash table triggers swapping, it creates problems also outside the hash table itself.
据我所知,当一个字典被扩展时,它只依赖于C的malloc。只要 malloc 持续成功,程序就会继续运行。只要有足够的内存,并且只要有可以交换的页面,大多数操作系统都会保持 malloc 工作。在任何一种情况下,当 malloc 失败时,Python 都会抛出 MemoryError 异常,根据 文档。就数据结构而言,dict 在空间方面将非常高效。真正做得更好的唯一方法是使用分析函数来回映射值。
As far as I can remember, when a dict is expanded it just relies on C's malloc. The program will keep running as long as malloc keeps succeeding. Most OS's will keep malloc working as long as there is enough memory, and then as long as there are pages that can be swapped in. In either case Python will throw a MemoryError exception when malloc fails, as per the documentation. As far as the data structure goes, dict is going to be very efficient space-wise. The only way to really do better is to use an analytical function to map the values back and forth.