概率解析器的内存使用

发布于 2024-10-25 01:34:03 字数 1601 浏览 5 评论 0原文

我正在为范围串联语法编写 CKY 解析器。我想使用树库作为语法,因此语法会很大。我用 Python 编写了一个原型 1 ,当我模拟一对树库时,它似乎运行良好几十个句子,但是内存占用却无法接受。我尝试用 C++ 编写它,但到目前为止,这非常令人沮丧,因为我以前从未使用过 C++。这里有一些数据(n 是语法所基于的句子的数量):

n    mem
9    173M
18   486M
36   836M

这种增长模式是考虑到最佳优先算法所期望的,但开销的数量是我关心的。根据 heapy 的内存使用量比这些数字小十倍,valgrind 报告了类似的情况。是什么导致了这种差异?我可以在 Python(或 Cython)中对此做些什么吗?也许是由于碎片化?或者也许是Python字典的开销?

一些背景知识:两个重要的数据结构是将边映射到概率的议程,以及将非终结符和位置映射到边的字典的图表。该议程是通过 heapdict(内部使用 dict 和 heapq 列表)实现的,该图表带有将非终结符和位置映射到边缘的字典。议程经常被插入和删除,图表仅进行插入和查找。我用这样的元组表示边:

(("S", 111), ("NP", 010), ("VP", 100, 001))

字符串是语法中的非终结符标签,位置被编码为位掩码。当成分不连续时,可以有多个位置。所以这条边可以代表对“is Mary happy”的分析,其中“is”和happy都属于VP。图表字典由这条边的第一个元素(“S”,111)索引在一个新版本中,我尝试转置这种表示形式,希望它能够由于重用而节省内存:

(("S", "NP", "VP), (111, 100, 011))

我认为如果第一部分与不同的位置组合出现,Python 只会存储一次,尽管我实际上不是。当然,无论哪种情况,它似乎都没有任何区别,

所以基本上我想知道是否值得进一步追求我的 Python 实现,包括使用 Cython 和不同的数据结构,或者编写它。从头开始使用 C++ 是唯一可行的选择

:经过一些改进,我不再遇到内存使用问题,我将奖励给它。对于提高代码效率最有用的建议是 http:// /student.science.uva.nl/~acranenb/plcfrs_cython.html

1 https://github.com/andreasvc/disco-dop/ -- 运行 test.py 来解析一些句子。需要 python 2.6、nltkheapdict

I am writing a CKY parser for a Range Concatenation Grammar. I want to use a treebank as grammar, so the grammar will be large. I've written a prototype 1 in Python and it seems to work well when I simulate a treebank of a couple tens of sentences, but the memory usage is unacceptable. I tried writing it in C++ but so far that has been very frustrating as I have never used C++ before. Here's some data (n is number of sentences the grammar is based on):

n    mem
9    173M
18   486M
36   836M

This growth pattern is what is to be expected given the best-first algorithm, but the amount of overhead is what concerns me. The memory usage according to heapy is a factor ten smaller than these numbers, valgrind reported something similar. What causes this discrepancy and is there anything I can do about it in Python (or Cython)? Perhaps it's due to fragmentation? Or maybe it is the overhead of python dictionaries?

Some background: the two important datastructures are the agenda mapping edges to probabilities, and the chart, which is a dictionary mapping nonterminals and positions to edges. The agenda is implemented with a heapdict (which internally uses a dict and a heapq list), the chart with a dictionary mapping nonterminals and positions to edges. The agenda is frequently inserted and removed from, the chart only gets insertions and lookups. I represent edges with tuples like this:

(("S", 111), ("NP", 010), ("VP", 100, 001))

The strings are the nonterminal labels from the grammar, the positions are encoded as a bitmask. There can be multiple positions when a constituent is discontinuous. So this edge could be represent an analysis of "is Mary happy", where "is" and happy" both belong to the VP. The chart dictionary is indexed by the first element of this edge, ("S", 111) in this case. In a new version I tried transposing this representation in the hope that it would save memory due to reuse:

(("S", "NP", "VP), (111, 100, 011))

I figured that Python would store the first part only once if it would occur in combination with different positions, although I'm not actually sure this is true. In either case, it didn't seem to make any difference.

So basically what I am wondering is if it is worth pursuing my Python implementation any further, including doing things with Cython and different datastructures, or that writing it from the ground up in C++ is the only viable option.

UPDATE: After some improvements I no longer have issues with memory usage. I'm working on an optimized Cython version. I'll award the bounty to the most useful suggestion for increasing efficiency of the code. There is an annotated version at http://student.science.uva.nl/~acranenb/plcfrs_cython.html

1 https://github.com/andreasvc/disco-dop/
-- run test.py to parse some sentences. Requires python 2.6, nltk and heapdict

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

扎心 2024-11-01 01:34:03

我认为如果第一部分与不同位置组合出现,Python 只会存储一次

不一定:

>>> ("S", "NP", "VP") is ("S", "NP", "VP")
False

您可能想要intern 所有引用非终结符的字符串,因为您似乎在 rcgrules.py。如果您想intern一个元组,那么首先将其转换为字符串:

>>> intern("S NP VP") is intern(' '.join('S', 'NP', 'VP'))
True

否则,您将不得不“复制”元组而不是重新构造它们。

(如果您是 C++ 新手,那么在其中重写这样的算法不太可能提供太多内存优势。您必须首先评估各种哈希表实现并了解其容器中的复制行为。我已经发现 boost::unordered_map 对于大量小哈希表来说非常浪费。)

I figured that Python would store the first part only once if it would occur in combination with different positions

Not necessarily:

>>> ("S", "NP", "VP") is ("S", "NP", "VP")
False

You might want to intern all strings referring to non-terminals, since you seem to be creating a lot of these in rcgrules.py. If you want to intern a tuple, then turn it into a string first:

>>> intern("S NP VP") is intern(' '.join('S', 'NP', 'VP'))
True

Otherwise, you'll have to "copy" the tuples instead of constructing them afresh.

(If you're new to C++, then rewriting such an algorithm in it is unlikely to provide much of a memory benefit. You'd have to evaluate various hash table implementations first and learn about the copying behavior in its containers. I've found boost::unordered_map to be quite wasteful with lots of small hashtables.)

莫多说 2024-11-01 01:34:03

您是否尝试过使用 PyPy 而不是 CPython 来运行您的应用程序?

PyPy 比 CPython 聪明得多,可以注意到共性并避免与不必要的重复操作相关的内存开销。

无论如何,值得一试:http://pypy.org/

Have you tried running your application with PyPy rather than CPython?

PyPy is a lot smarter than CPython about noticing commonalities and avoiding the memory overhead associated with duplicating things unnecessarily.

It's worth trying, anyway: http://pypy.org/

寻找我们的幸福 2024-11-01 01:34:03

在这些情况下,首先要做的始终是分析:

15147/297    0.032    0.000    0.041    0.000 tree.py:102(__eq__)
15400/200    0.031    0.000    0.106    0.001 tree.py:399(convert)
        1    0.023    0.023    0.129    0.129 plcfrs_cython.pyx:52(parse)
6701/1143    0.022    0.000    0.043    0.000 heapdict.py:45(_min_heapify)
    18212    0.017    0.000    0.023    0.000 plcfrs_cython.pyx:38(__richcmp__)
10975/10875    0.017    0.000    0.035    0.000 tree.py:75(__init__)
     5772    0.016    0.000    0.050    0.000 tree.py:665(__init__)
      960    0.016    0.000    0.025    0.000 plcfrs_cython.pyx:118(deduced_from)
    46938    0.014    0.000    0.014    0.000 tree.py:708(_get_node)
25220/2190    0.014    0.000    0.016    0.000 tree.py:231(subtrees)
    10975    0.013    0.000    0.023    0.000 tree.py:60(__new__)
    49441    0.013    0.000    0.013    0.000 {isinstance}
    16748    0.008    0.000    0.015    0.000 {hasattr}

我注意到的第一件事是很少有函数来自 cython 模块本身。
其中大部分来自 tree.py 模块,这可能是瓶颈。

重点关注 cython 方面,我看到 richcmp 函数:

我们可以简单地通过在方法声明中添加值的类型来优化它,

def __richcmp__(ChartItem self, ChartItem other, int op):
        ....

这会降低值

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
....
18212    0.011    0.000    0.015    0.000 plcfrs_cython.pyx:38(__richcmp__)

添加 elif 语法而不是单个 if 将启用cython 的 switch 优化

    if op == 0: return self.label < other.label or self.vec < other.vec
    elif op == 1: return self.label <= other.label or self.vec <= other.vec
    elif op == 2: return self.label == other.label and self.vec == other.vec
    elif op == 3: return self.label != other.label or self.vec != other.vec
    elif op == 4: return self.label > other.label or self.vec > other.vec
    elif op == 5: return self.label >= other.label or self.vec >= other.vec

获取:

17963    0.002    0.000    0.002    0.000 plcfrs_cython.pyx:38(__richcmp__)

试图找出 tree.py:399 转换来自哪里我发现 dopg.py 中的这个函数花费了所有时间

  def removeids(tree):
""" remove unique IDs introduced by the Goodman reduction """
result = Tree.convert(tree)
for a in result.subtrees(lambda t: '@' in t.node):
    a.node = a.node.rsplit('@', 1)[0]
if isinstance(tree, ImmutableTree): return result.freeze()
return result

现在我不确定树中的每个节点是否是 ChartItem 以及 getitem 值是否
正在其他地方使用,但添加了此更改:

cdef class ChartItem:
cdef public str label
cdef public str root
cdef public long vec
cdef int _hash
__slots__ = ("label", "vec", "_hash")
def __init__(ChartItem self, label, int vec):
    self.label = intern(label) #.rsplit('@', 1)[0])
    self.root = intern(label.rsplit('@', 1)[0])
    self.vec = vec
    self._hash = hash((self.label, self.vec))
def __hash__(self):
    return self._hash
def __richcmp__(ChartItem self, ChartItem other, int op):
    if op == 0: return self.label < other.label or self.vec < other.vec
    elif op == 1: return self.label <= other.label or self.vec <= other.vec
    elif op == 2: return self.label == other.label and self.vec == other.vec
    elif op == 3: return self.label != other.label or self.vec != other.vec
    elif op == 4: return self.label > other.label or self.vec > other.vec
    elif op == 5: return self.label >= other.label or self.vec >= other.vec
def __getitem__(ChartItem self, int n):
    if n == 0: return self.root
    elif n == 1: return self.vec
def __repr__(self):
    #would need bitlen for proper padding
    return "%s[%s]" % (self.label, bin(self.vec)[2:][::-1]) 

并且在mostprobableparse内部:

from libc cimport pow
def mostprobableparse...
            ...
    cdef dict parsetrees = <dict>defaultdict(float)
    cdef float prob
    m = 0
    for n,(a,prob) in enumerate(derivations):
        parsetrees[a] += pow(e,prob)
        m += 1

我得到:

         189345 function calls (173785 primitive calls) in 0.162 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
6701/1143    0.025    0.000    0.037    0.000 heapdict.py:45(_min_heapify)
        1    0.023    0.023    0.120    0.120 plcfrs_cython.pyx:54(parse)
      960    0.018    0.000    0.030    0.000 plcfrs_cython.pyx:122(deduced_from)
 5190/198    0.011    0.000    0.015    0.000 tree.py:102(__eq__)
     6619    0.006    0.000    0.006    0.000 heapdict.py:67(_swap)
     9678    0.006    0.000    0.008    0.000 plcfrs_cython.pyx:137(concat)

所以下一步是优化heapify,并且deduce_from

deduce_from可以进一步优化:

cdef inline deduced_from(ChartItem Ih, double x, pyCx, pyunary, pylbinary, pyrbinary, int bitlen):
cdef str I = Ih.label
cdef int Ir = Ih.vec
cdef list result = []
cdef dict Cx = <dict>pyCx
cdef dict unary = <dict>pyunary
cdef dict lbinary = <dict>pylbinary
cdef dict rbinary = <dict>pyrbinary
cdef ChartItem Ilh
cdef double z
cdef double y
cdef ChartItem I1h 
for rule, z in unary[I]:
    result.append((ChartItem(rule[0][0], Ir), ((x+z,z), (Ih,))))
for rule, z in lbinary[I]:
    for I1h, y in Cx[rule[0][2]].items():
        if concat(rule[1], Ir, I1h.vec, bitlen):
            result.append((ChartItem(rule[0][0], Ir ^ I1h.vec), ((x+y+z, z), (Ih, I1h))))
for rule, z in rbinary[I]:
    for I1h, y in Cx[rule[0][1]].items():
        if concat(rule[1], I1h.vec, Ir, bitlen):
            result.append((ChartItem(rule[0][0], I1h.vec ^ Ir), ((x+y+z, z), (I1h, Ih))))
return result

我将在这里停止,尽管我相信我们可以继续优化对这个问题有了更多的了解。

一系列的单元测试将有助于断言每个优化不会引入任何微妙的错误。

附注,尝试使用空格而不是制表符。

The first to do in these cases is always to profile:

15147/297    0.032    0.000    0.041    0.000 tree.py:102(__eq__)
15400/200    0.031    0.000    0.106    0.001 tree.py:399(convert)
        1    0.023    0.023    0.129    0.129 plcfrs_cython.pyx:52(parse)
6701/1143    0.022    0.000    0.043    0.000 heapdict.py:45(_min_heapify)
    18212    0.017    0.000    0.023    0.000 plcfrs_cython.pyx:38(__richcmp__)
10975/10875    0.017    0.000    0.035    0.000 tree.py:75(__init__)
     5772    0.016    0.000    0.050    0.000 tree.py:665(__init__)
      960    0.016    0.000    0.025    0.000 plcfrs_cython.pyx:118(deduced_from)
    46938    0.014    0.000    0.014    0.000 tree.py:708(_get_node)
25220/2190    0.014    0.000    0.016    0.000 tree.py:231(subtrees)
    10975    0.013    0.000    0.023    0.000 tree.py:60(__new__)
    49441    0.013    0.000    0.013    0.000 {isinstance}
    16748    0.008    0.000    0.015    0.000 {hasattr}

The First thing I noticed is that very few functions are from the cython module itself.
Most of them come from the tree.py module and maybe is that the bottleneck.

Focusing on the cython side I see the richcmp function:

we can optimize it simply by adding the type of the values in the method declaration

def __richcmp__(ChartItem self, ChartItem other, int op):
        ....

This brings down the value

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
....
18212    0.011    0.000    0.015    0.000 plcfrs_cython.pyx:38(__richcmp__)

Adding the elif syntax instead of the single if will enable the switch optimization of cython

    if op == 0: return self.label < other.label or self.vec < other.vec
    elif op == 1: return self.label <= other.label or self.vec <= other.vec
    elif op == 2: return self.label == other.label and self.vec == other.vec
    elif op == 3: return self.label != other.label or self.vec != other.vec
    elif op == 4: return self.label > other.label or self.vec > other.vec
    elif op == 5: return self.label >= other.label or self.vec >= other.vec

obtaining:

17963    0.002    0.000    0.002    0.000 plcfrs_cython.pyx:38(__richcmp__)

trying to figure out where that tree.py:399 convert comes from I found out that this function inside dopg.py takes all that time

  def removeids(tree):
""" remove unique IDs introduced by the Goodman reduction """
result = Tree.convert(tree)
for a in result.subtrees(lambda t: '@' in t.node):
    a.node = a.node.rsplit('@', 1)[0]
if isinstance(tree, ImmutableTree): return result.freeze()
return result

Now I am not sure if each node in the tree is a ChartItem and if the getitem value
is being used somewhere else but adding this changes :

cdef class ChartItem:
cdef public str label
cdef public str root
cdef public long vec
cdef int _hash
__slots__ = ("label", "vec", "_hash")
def __init__(ChartItem self, label, int vec):
    self.label = intern(label) #.rsplit('@', 1)[0])
    self.root = intern(label.rsplit('@', 1)[0])
    self.vec = vec
    self._hash = hash((self.label, self.vec))
def __hash__(self):
    return self._hash
def __richcmp__(ChartItem self, ChartItem other, int op):
    if op == 0: return self.label < other.label or self.vec < other.vec
    elif op == 1: return self.label <= other.label or self.vec <= other.vec
    elif op == 2: return self.label == other.label and self.vec == other.vec
    elif op == 3: return self.label != other.label or self.vec != other.vec
    elif op == 4: return self.label > other.label or self.vec > other.vec
    elif op == 5: return self.label >= other.label or self.vec >= other.vec
def __getitem__(ChartItem self, int n):
    if n == 0: return self.root
    elif n == 1: return self.vec
def __repr__(self):
    #would need bitlen for proper padding
    return "%s[%s]" % (self.label, bin(self.vec)[2:][::-1]) 

and inside of mostprobableparse:

from libc cimport pow
def mostprobableparse...
            ...
    cdef dict parsetrees = <dict>defaultdict(float)
    cdef float prob
    m = 0
    for n,(a,prob) in enumerate(derivations):
        parsetrees[a] += pow(e,prob)
        m += 1

I get:

         189345 function calls (173785 primitive calls) in 0.162 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
6701/1143    0.025    0.000    0.037    0.000 heapdict.py:45(_min_heapify)
        1    0.023    0.023    0.120    0.120 plcfrs_cython.pyx:54(parse)
      960    0.018    0.000    0.030    0.000 plcfrs_cython.pyx:122(deduced_from)
 5190/198    0.011    0.000    0.015    0.000 tree.py:102(__eq__)
     6619    0.006    0.000    0.006    0.000 heapdict.py:67(_swap)
     9678    0.006    0.000    0.008    0.000 plcfrs_cython.pyx:137(concat)

so the next steps are to optimize heapify and deduced_from

deduce_from can be optimized a bit more:

cdef inline deduced_from(ChartItem Ih, double x, pyCx, pyunary, pylbinary, pyrbinary, int bitlen):
cdef str I = Ih.label
cdef int Ir = Ih.vec
cdef list result = []
cdef dict Cx = <dict>pyCx
cdef dict unary = <dict>pyunary
cdef dict lbinary = <dict>pylbinary
cdef dict rbinary = <dict>pyrbinary
cdef ChartItem Ilh
cdef double z
cdef double y
cdef ChartItem I1h 
for rule, z in unary[I]:
    result.append((ChartItem(rule[0][0], Ir), ((x+z,z), (Ih,))))
for rule, z in lbinary[I]:
    for I1h, y in Cx[rule[0][2]].items():
        if concat(rule[1], Ir, I1h.vec, bitlen):
            result.append((ChartItem(rule[0][0], Ir ^ I1h.vec), ((x+y+z, z), (Ih, I1h))))
for rule, z in rbinary[I]:
    for I1h, y in Cx[rule[0][1]].items():
        if concat(rule[1], I1h.vec, Ir, bitlen):
            result.append((ChartItem(rule[0][0], I1h.vec ^ Ir), ((x+y+z, z), (I1h, Ih))))
return result

I will stop here although I am confident that we can keep optimizing as more insight is acquired on the problem.

A series of unittest would be useful to assert that each optimization don't introduce any subtle error.

A side note, try to use spaces instead of tabs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文