PLY：快速解析长列表？

发布于 2024-11-16 11:23:24 字数 630 浏览 5 评论 0原文

我正在 PLY 中使用一个相当简单的解析器，我的规则之一采用以下形式：

def p_things(p):
    '''
    things : thing things
    things : thing
    '''
    p[0] = [p[1]]
    if len(p) == 3:
        p[0] += p[2]

输入文件通常是简单的事物列表，因此解析本身并不复杂。然而，我的一些输入文件非常大（经常超过 100,000 行，在极端情况下超过 1,000,000 行）。在分析中（通过 cProfile 和 pstats），大部分运行时间被重复占用调用 p_things - 大概是对 things 列表中的每一项进行一次调用。

有没有办法减少这个时间，或者有更有效的方法来构建这个规则？到目前为止，我看到的大多数答案（以及我找到的规范编译器信息）都将此方法列为构造可解析项目列表的普遍接受的方法，无论长度如何。

原文

I'm working with a fairly simple parser in PLY, and one of my rules takes on the following form:

def p_things(p):
    '''
    things : thing things
    things : thing
    '''
    p[0] = [p[1]]
    if len(p) == 3:
        p[0] += p[2]

Input files are generally simple lists of things, and so the parsing itself is not complex. Some of my input files are very large, however (exceeding 100,000 lines fairly regularly, and over 1,000,000 in extreme cases). In profiling (via cProfile and pstats), the bulk of the runtime is taken up by repeated calls to p_things - presumably, one call for each item in a things list.

Is there a way to reduce this time, or a more efficient way to structure this rule? Most answers I've seen so far (and the canonical compilers info I've found) have listed this method as the generally accepted way to construct a list of parseable items, no matter the length.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萌逼全场 2024-11-23 11:23:24

事实证明我忘记了一些基本的编译器理论。 PLY 是一个 LALR(1) 解析器，因此最好将规则编写为：

def p_things(p):
    '''
    things : things thing
    things : thing
    '''
    if len(p) == 2:
        p[0] = [p[1]]
    else:
        p[0] = p[1]
        p[0].append(p[2])

虽然它可能看起来更冗长，但实际上有一个显着的改进 - 在 PLY 或 Python 中的某个地方，解析器能够在左侧应用一些优化- 递归形式。我发现在较大的输入文件上，性能从指数下降到线性；一个示例的 things 列表中包含超过一百万个项目，其运行时间不到 20%。

Turns out I'm forgetting some of my basic compilers theory. PLY is a LALR(1) parser, and so it's better to write the rule as:

def p_things(p):
    '''
    things : things thing
    things : thing
    '''
    if len(p) == 2:
        p[0] = [p[1]]
    else:
        p[0] = p[1]
        p[0].append(p[2])

Though it may look more verbose, there's actually a significant improvement - somewhere in either PLY or Python, the parser was able to apply some optimization on the left-recursive form. I've seen performance drop from exponential to linear on my larger input files; one sample, with over a million items in the things list, ran in under 20% of the time.

回复收藏 0 原文

挽清梦 2024-11-23 11:23:24

在不更改代码的情况下，您可以尝试使用“PyPy”版本的 python 进行即时编译——它可能比普通 CPython 更快地运行您的代码。

回复收藏 0 原文

寄与心 2024-11-23 11:23:24

总结一下：

性能问题与解析器本身无关（而是与 += 的使用有关）
通过使 RHS 明确，使 LALR(1) 解析器的事情变得更容易。
如果可能的话，最好通过将规则拆分来避免规则中不必要的选择语句 (if)。

为了更好地理解Ioannis Filippidis的评论，更容易实际可视化它。我认为这就是我的意思，也是我最终得到的结果。

def p_things_iter(p):
    '''
    things_iter : things thing
    '''
        p[0] = p[1]
        p[0].append(p[2])

def p_things_end(p):
    '''
    things_iter : thing
    '''
    p[0] = [p[1]]

def p_things(p):
    '''
    things : things_iter
    things : things_end
    '''
    p[0] = p[1]

So to summarize:

The performance issue was not related to the parser per se (but the usage of +=)
Make things easier for LALR(1) parsers by making RHSs unambiguous.
It's better to avoid unnecessary selection statements (ifs) within a rule by splitting it up if possible.

To better understand Ioannis Filippidis' comment it's easier to actually visualize it. This is what I think was meant and approximately what I ended up with as well.

def p_things_iter(p):
    '''
    things_iter : things thing
    '''
        p[0] = p[1]
        p[0].append(p[2])

def p_things_end(p):
    '''
    things_iter : thing
    '''
    p[0] = [p[1]]

def p_things(p):
    '''
    things : things_iter
    things : things_end
    '''
    p[0] = p[1]

回复收藏 0 原文

~没有更多了~