编译为字节码占用内存过多

发布于 2024-11-27 07:48:00 字数 823 浏览 6 评论 0原文

我需要将一个非常大的字典导入到 python 中，并且遇到了一些意想不到的内存瓶颈。字典的形式如下，

d = {(1,2,3):(1,2,3,4), (2,5,6)=(4,2,3,4,5,6), ... }

因此每个键都是一个 3 元组，每个值都是一个相对较小的任意大小的元组（可能永远不会超过 30 个元素）。字典变大的原因是键的数量。我正在使用的一个较小的示例大约有 247257 个键。我通过模拟生成这个字典，这样我就可以写出一个定义这个字典的文本文件，对于我刚才提到的例子，这是一个 94MB 的文件。我遇到的瓶颈是初始编译为 python 字节代码占用了大约 14GB 的内存。因此，当我第一次导入字典时，我看到 RAM 使用量激增，10 秒后所有内容都已加载。如果 .pyc 文件已生成，则导入几乎是即时的。使用 pympler，我确定这本字典在内存中只有大约 200 MB。这是怎么回事？关于如何将此字典加载到 python 中或至少编译为字节代码，我还有其他选择吗？我正在用 C++ 运行生成模拟，但无法以我需要的任何格式编写文件。有没有什么选择（python 库等）？我正在与一些需要此数据作为字典的软件进行交互，因此请不要在该领域提出其他建议。另外，以防万一您想知道，我已经在文本文件中定义了字典，就像上面的定义一样，

d = {}
d[1,2,3] = (1,2,3,4)
d[2,5,6] = (4,2,3,4,5,6)
...

两者在编译为字节代码时都给出了相同的内存峰值。事实上，第二个似乎稍微差一些，这让我感到惊讶。必须有某种方法来控制初始编译所需的内存量。看起来它应该能够一次编译一个键值对。有什么想法吗？

其他信息：使用Python 2.6.5

原文

I need to import a very large dictionary into python and I'm running into some unexpected memory bottlenecks. The dictionary has the form,

d = {(1,2,3):(1,2,3,4), (2,5,6)=(4,2,3,4,5,6), ... }

So each key is a 3-tuple and each value is a relatively small tuple of arbitrary size (probably never more than 30 elements). What makes the dictionary large is the number of keys. A smaller example of what I'm working with has roughly 247257 keys. I generate this dictionary through a simulation so I can write out a text file that defines this dictionary and for the example I just mentioned this is a 94MB file. The bottleneck I am running into is that the initial compile to python byte code eats up about 14GB of ram. So the first time I import the dictionary I see the RAM usage spike up and after a good 10 seconds everything is loaded. If the .pyc file is already generated the import is nearly instant. Using pympler, I've determined that this dictionary is only about 200 MB in memory. What is the deal here? Do I have any other options on how get this dictionary loaded into python or at least compiled to byte code. I'm running the generating simulations in C++ and I can't write files an whatever format I need. Are there any options there (python libraries, etc.)? I'm interfacing with some software that needs this data as a dictionary so please no other suggestions in that realm. Also just in case you are wondering, I have defined the dictionary in the text file like the definition above as well as like so,

d = {}
d[1,2,3] = (1,2,3,4)
d[2,5,6] = (4,2,3,4,5,6)
...

Both give the same memory spike in compile to byte code. In fact, the second one seems to be slightly worse, which is surprising to me. There's got to be some way to tame the amount of ram the initial compile needs. It seems like it should somehow be able to do the compile one key-value pair at a time. Any ideas?

Other info:
using python 2.6.5

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

゛时过境迁 2024-12-04 07:48:00

我想问题是，在解析文件时，会生成一个巨大的语法树，每个元素加起来的开销很小。生成字节码后，不再需要语法树并转储，从而生成 200MB 的数据。

您是否尝试过将数据以以下格式存储在单独的文件中，然后在 python 中动态加载？

1,2,3=1,2,3
2,5,6=4,2,3,4,5,6

Python 脚本应如下所示：

file = open("filename")
d = {}

for line in file:
    key, val = line.split("=")
    key = tuple(key.split(","))
    d[key] = tuple(val.split(","))

file.close()

I guess the problem is that while parsing your file an enormous syntax tree is made with a small overhead for each element that all add up. Once the bytecode is generated the syntax tree is no longer necessary and dumped, resulting in your 200MB data.

Have you tried storing the data in a seperate file in the following format and then dynamically load it in python?

1,2,3=1,2,3
2,5,6=4,2,3,4,5,6

The Python script should look something like this:

file = open("filename")
d = {}

for line in file:
    key, val = line.split("=")
    key = tuple(key.split(","))
    d[key] = tuple(val.split(","))

file.close()

回复收藏 0 原文

赠意 2024-12-04 07:48:00

http://docs.python.org/library/shelve.html

回复收藏 0 原文

对风讲故事 2024-12-04 07:48:00

我猜测当您执行“import module_containing_humungous_dict_statement”时，您的编译峰值会发生。那么，无论您只有一条语句还是 247257 个单独的赋值语句，整个模块仍然会立即编译。您可以尝试使用单独赋值语句形式，然后打开文件，一次读取一行，然后执行它。那么你一次只会编译一行。可能需要一段时间。

回复收藏 0 原文

七月上 2024-12-04 07:48:00

我怀疑创建用作密钥的列表是昂贵的。定义一个函数，将三元组的三个部分作为输入并返回管道分隔的字符串。用它作为你的钥匙。

回复收藏 0 原文

心在旅行 2024-12-04 07:48:00

我阅读你的问题的方式是，你正在模拟器中生成Python源代码，并且生成的源代码具有硬编码的巨型字典的内容。如果这是真的，那么您可能很容易生成这个：

def giantdict():
  d0 = {(1, 2): (3, 4), (3, 4): (5, 6), ...}  # first 1000 key/value pairs here
  d1 = {(1, 2): (3, 4), (3, 4): (5, 6), ...}  # next 1000 key/value pairs
  d2 = {(1, 2): (3, 4), (3, 4): (5, 6), ...}  # next 1000 key/value pairs
  d3 = {(1, 2): (3, 4), (3, 4): (5, 6), ...}  # next 1000 key/value pairs
  # ... until you're done
  bigd = d0
  bigd.update(d1)
  del d1
  bigd.update(d2)
  del d2
  # ... continue updating with all the dN dictionaries
  return bigd

我不确定这是否会缩短编译时间，但这将是值得尝试的事情。如果在编译时将所有内容放入一个数据结构中会产生惩罚，那么将其拆分并在运行时组装各个部分可能会解决这个问题。

虽然这种代码（我的或你的）如果是人类编写的，会引起我的愤怒和愤怒，但我认为生成的代码不需要“好”，只要你知道没有人类永远需要阅读或维护它。

The way I read your question, you are generating Python source in your simulator, and the generated source has the contents of the giant dictionary hard-coded. If that is true, then you might as easily generate this:

def giantdict():
  d0 = {(1, 2): (3, 4), (3, 4): (5, 6), ...}  # first 1000 key/value pairs here
  d1 = {(1, 2): (3, 4), (3, 4): (5, 6), ...}  # next 1000 key/value pairs
  d2 = {(1, 2): (3, 4), (3, 4): (5, 6), ...}  # next 1000 key/value pairs
  d3 = {(1, 2): (3, 4), (3, 4): (5, 6), ...}  # next 1000 key/value pairs
  # ... until you're done
  bigd = d0
  bigd.update(d1)
  del d1
  bigd.update(d2)
  del d2
  # ... continue updating with all the dN dictionaries
  return bigd

I'm not sure that this will improve the compile time, but it would be something to try. If there is a penalty to putting everything in one data structure at compile time, splitting it up and assembling the pieces at run time may work around it.

While this kind of code (mine or yours) would draw my fury and ire if a human wrote it, I see no need for generated code to be "nice", as long as you know that no human will ever need to read it or maintain it.

回复收藏 0 原文

花落人断肠 2024-12-04 07:48:00

这是一个使用 defaultdict 自动嵌套索引值的类，并使用一些特殊的 __getitem__ 和 __setitem__ 方法来接受元组作为参数：

from collections import defaultdict

defdict3level = (lambda : defaultdict(lambda : 
                            defaultdict( lambda : 
                                defaultdict(tuple))))

class dict3level(object):
    def __init__(self):
        self.defdict = defdict3level()

    def __getitem__(self, key):
        if isinstance(key, tuple):
            if len(key)==3:
                return self.defdict[key[0]][key[1]][key[2]]
            elif len(key)==2:
                return self.defdict[key[0]][key[1]]
            elif len(key)==1:
                return self.defdict[key[0]]
        else:
            return self.defdict[key]

    def __setitem__(self, key, value):
        if isinstance(key, tuple) and len(key)==3:
            self.defdict[key[0]][key[1]][key[2]] = value
        else:
            self.defdict[key] = value

    def __getattr__(self, attr):
        return getattr(self.defdict, attr)

现在像以前一样执行所有分配：

d = dict3level()
d[1,2,3] = (1,2,3,4)
d[1,2,7] = (3,4,5,6)
d[2,5,6] = (4,2,3,4,5,6)

您仍然可以获得特定元组的特定条目：

# get a specific entry
print d[1,2,3]

但您也可以按级别导航您的字典：

# get all different 0'th index values
print d.keys()

# get all sub values in d[1,2,*]
print d[1,2].keys()
for key in d[1,2]:
    print "d[1,2,%d] = %s" % (key, d[1,2][key])

# no such entry, return empty tuple
print d[1,2,0]

给出：（

print d[1,2,3] -> (1, 2, 3, 4)
print d.keys() -> [1, 2]
print d[1,2].keys() -> [3, 7]
for key in d[1,2]:... -> 
    d[1,2,3] = (1, 2, 3, 4)
    d[1,2,7] = (3, 4, 5, 6)
print d[1,2,0] -> ()

不知道这将如何影响您的内存和/或酸洗问题，但生成的结构有很多更多的能力。）

Here's a class that uses a defaultdict for the automatic nesting of indexed values, with some special __getitem__ and __setitem__ methods to accept tuples as arguments:

from collections import defaultdict

defdict3level = (lambda : defaultdict(lambda : 
                            defaultdict( lambda : 
                                defaultdict(tuple))))

class dict3level(object):
    def __init__(self):
        self.defdict = defdict3level()

    def __getitem__(self, key):
        if isinstance(key, tuple):
            if len(key)==3:
                return self.defdict[key[0]][key[1]][key[2]]
            elif len(key)==2:
                return self.defdict[key[0]][key[1]]
            elif len(key)==1:
                return self.defdict[key[0]]
        else:
            return self.defdict[key]

    def __setitem__(self, key, value):
        if isinstance(key, tuple) and len(key)==3:
            self.defdict[key[0]][key[1]][key[2]] = value
        else:
            self.defdict[key] = value

    def __getattr__(self, attr):
        return getattr(self.defdict, attr)

Now exec all your assignments like before:

d = dict3level()
d[1,2,3] = (1,2,3,4)
d[1,2,7] = (3,4,5,6)
d[2,5,6] = (4,2,3,4,5,6)

You can still get a specific entry for a specific tuple:

# get a specific entry
print d[1,2,3]

But you can also navigate your dict by levels:

# get all different 0'th index values
print d.keys()

# get all sub values in d[1,2,*]
print d[1,2].keys()
for key in d[1,2]:
    print "d[1,2,%d] = %s" % (key, d[1,2][key])

# no such entry, return empty tuple
print d[1,2,0]

Gives:

print d[1,2,3] -> (1, 2, 3, 4)
print d.keys() -> [1, 2]
print d[1,2].keys() -> [3, 7]
for key in d[1,2]:... -> 
    d[1,2,3] = (1, 2, 3, 4)
    d[1,2,7] = (3, 4, 5, 6)
print d[1,2,0] -> ()

(Don't know how this will affect your memory and/or pickling issues, but the resulting structure has a lot more capability to it.)

回复收藏 0 原文