概率解析器的内存使用
我正在为范围串联语法编写 CKY 解析器。我想使用树库作为语法,因此语法会很大。我用 Python 编写了一个原型 1 ,当我模拟一对树库时,它似乎运行良好几十个句子,但是内存占用却无法接受。我尝试用 C++ 编写它,但到目前为止,这非常令人沮丧,因为我以前从未使用过 C++。这里有一些数据(n 是语法所基于的句子的数量):
n mem
9 173M
18 486M
36 836M
这种增长模式是考虑到最佳优先算法所期望的,但开销的数量是我关心的。根据 heapy 的内存使用量比这些数字小十倍,valgrind 报告了类似的情况。是什么导致了这种差异?我可以在 Python(或 Cython)中对此做些什么吗?也许是由于碎片化?或者也许是Python字典的开销?
一些背景知识:两个重要的数据结构是将边映射到概率的议程,以及将非终结符和位置映射到边的字典的图表。该议程是通过 heapdict(内部使用 dict 和 heapq 列表)实现的,该图表带有将非终结符和位置映射到边缘的字典。议程经常被插入和删除,图表仅进行插入和查找。我用这样的元组表示边:
(("S", 111), ("NP", 010), ("VP", 100, 001))
字符串是语法中的非终结符标签,位置被编码为位掩码。当成分不连续时,可以有多个位置。所以这条边可以代表对“is Mary happy”的分析,其中“is”和happy都属于VP。图表字典由这条边的第一个元素(“S”,111)索引在一个新版本中,我尝试转置这种表示形式,希望它能够由于重用而节省内存:
(("S", "NP", "VP), (111, 100, 011))
我认为如果第一部分与不同的位置组合出现,Python 只会存储一次,尽管我实际上不是。当然,无论哪种情况,它似乎都没有任何区别,
所以基本上我想知道是否值得进一步追求我的 Python 实现,包括使用 Cython 和不同的数据结构,或者编写它。从头开始使用 C++ 是唯一可行的选择
:经过一些改进,我不再遇到内存使用问题,我将奖励给它。对于提高代码效率最有用的建议是 http:// /student.science.uva.nl/~acranenb/plcfrs_cython.html
1 https://github.com/andreasvc/disco-dop/ -- 运行 test.py 来解析一些句子。需要 python 2.6、nltk 和 heapdict
I am writing a CKY parser for a Range Concatenation Grammar. I want to use a treebank as grammar, so the grammar will be large. I've written a prototype 1 in Python and it seems to work well when I simulate a treebank of a couple tens of sentences, but the memory usage is unacceptable. I tried writing it in C++ but so far that has been very frustrating as I have never used C++ before. Here's some data (n is number of sentences the grammar is based on):
n mem
9 173M
18 486M
36 836M
This growth pattern is what is to be expected given the best-first algorithm, but the amount of overhead is what concerns me. The memory usage according to heapy is a factor ten smaller than these numbers, valgrind reported something similar. What causes this discrepancy and is there anything I can do about it in Python (or Cython)? Perhaps it's due to fragmentation? Or maybe it is the overhead of python dictionaries?
Some background: the two important datastructures are the agenda mapping edges to probabilities, and the chart, which is a dictionary mapping nonterminals and positions to edges. The agenda is implemented with a heapdict (which internally uses a dict and a heapq list), the chart with a dictionary mapping nonterminals and positions to edges. The agenda is frequently inserted and removed from, the chart only gets insertions and lookups. I represent edges with tuples like this:
(("S", 111), ("NP", 010), ("VP", 100, 001))
The strings are the nonterminal labels from the grammar, the positions are encoded as a bitmask. There can be multiple positions when a constituent is discontinuous. So this edge could be represent an analysis of "is Mary happy", where "is" and happy" both belong to the VP. The chart dictionary is indexed by the first element of this edge, ("S", 111) in this case. In a new version I tried transposing this representation in the hope that it would save memory due to reuse:
(("S", "NP", "VP), (111, 100, 011))
I figured that Python would store the first part only once if it would occur in combination with different positions, although I'm not actually sure this is true. In either case, it didn't seem to make any difference.
So basically what I am wondering is if it is worth pursuing my Python implementation any further, including doing things with Cython and different datastructures, or that writing it from the ground up in C++ is the only viable option.
UPDATE: After some improvements I no longer have issues with memory usage. I'm working on an optimized Cython version. I'll award the bounty to the most useful suggestion for increasing efficiency of the code. There is an annotated version at http://student.science.uva.nl/~acranenb/plcfrs_cython.html
1 https://github.com/andreasvc/disco-dop/
-- run test.py to parse some sentences. Requires python 2.6, nltk and heapdict
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
不一定:
您可能想要
intern
所有引用非终结符的字符串,因为您似乎在rcgrules.py
。如果您想intern
一个元组,那么首先将其转换为字符串:否则,您将不得不“复制”元组而不是重新构造它们。
(如果您是 C++ 新手,那么在其中重写这样的算法不太可能提供太多内存优势。您必须首先评估各种哈希表实现并了解其容器中的复制行为。我已经发现 boost::unordered_map 对于大量小哈希表来说非常浪费。)
Not necessarily:
You might want to
intern
all strings referring to non-terminals, since you seem to be creating a lot of these inrcgrules.py
. If you want tointern
a tuple, then turn it into a string first:Otherwise, you'll have to "copy" the tuples instead of constructing them afresh.
(If you're new to C++, then rewriting such an algorithm in it is unlikely to provide much of a memory benefit. You'd have to evaluate various hash table implementations first and learn about the copying behavior in its containers. I've found
boost::unordered_map
to be quite wasteful with lots of small hashtables.)您是否尝试过使用 PyPy 而不是 CPython 来运行您的应用程序?
PyPy 比 CPython 聪明得多,可以注意到共性并避免与不必要的重复操作相关的内存开销。
无论如何,值得一试:http://pypy.org/
Have you tried running your application with PyPy rather than CPython?
PyPy is a lot smarter than CPython about noticing commonalities and avoiding the memory overhead associated with duplicating things unnecessarily.
It's worth trying, anyway: http://pypy.org/
在这些情况下,首先要做的始终是分析:
我注意到的第一件事是很少有函数来自 cython 模块本身。
其中大部分来自 tree.py 模块,这可能是瓶颈。
重点关注 cython 方面,我看到 richcmp 函数:
我们可以简单地通过在方法声明中添加值的类型来优化它,
这会降低值
添加 elif 语法而不是单个 if 将启用cython 的 switch 优化
获取:
试图找出 tree.py:399 转换来自哪里我发现 dopg.py 中的这个函数花费了所有时间
现在我不确定树中的每个节点是否是 ChartItem 以及 getitem 值是否
正在其他地方使用,但添加了此更改:
并且在mostprobableparse内部:
我得到:
所以下一步是优化heapify,并且deduce_from
deduce_from可以进一步优化:
我将在这里停止,尽管我相信我们可以继续优化对这个问题有了更多的了解。
一系列的单元测试将有助于断言每个优化不会引入任何微妙的错误。
附注,尝试使用空格而不是制表符。
The first to do in these cases is always to profile:
The First thing I noticed is that very few functions are from the cython module itself.
Most of them come from the tree.py module and maybe is that the bottleneck.
Focusing on the cython side I see the richcmp function:
we can optimize it simply by adding the type of the values in the method declaration
This brings down the value
Adding the elif syntax instead of the single if will enable the switch optimization of cython
obtaining:
trying to figure out where that tree.py:399 convert comes from I found out that this function inside dopg.py takes all that time
Now I am not sure if each node in the tree is a ChartItem and if the getitem value
is being used somewhere else but adding this changes :
and inside of mostprobableparse:
I get:
so the next steps are to optimize heapify and deduced_from
deduce_from can be optimized a bit more:
I will stop here although I am confident that we can keep optimizing as more insight is acquired on the problem.
A series of unittest would be useful to assert that each optimization don't introduce any subtle error.
A side note, try to use spaces instead of tabs.