pyparsing 后下一步做什么?
我有一个为 pyparsing 开发的庞大语法,作为大型纯 Python 应用程序的一部分。 我已经达到了性能调整的极限,而我正处于收益递减的地步,让我开始寻找其他地方。是的,我想我知道大部分的提示和技巧,并且我已经概述了我的语法和我的应用程序。
接下来怎么办?
我希望找到一个解析器,它能给我同样的可读性、可用性(我正在使用 pyparsing 的许多高级功能,例如解析操作来启动正在解析的输入的后处理)和 python 集成,但是 at 10 倍性能。
我喜欢它的语法是纯 Python 的。
我所有的基本块都是正则表达式,所以重用它们会很好。
我知道我无法拥有一切,因此我愿意放弃目前拥有的一些功能,以达到所要求的 10 倍性能。
我该去哪里?
I have a huge grammar developed for pyparsing as part of a large, pure Python application.
I have reached the limit of performance tweaking and I'm at the point where the diminishing returns make me start to look elsewhere. Yes, I think I know most of the tips and tricks and I've profiled my grammar and my application to dust.
What next?
I hope to find a parser that gives me the same readability, usability (I'm using many advanced features of pyparsing such as parse-actions to start the post processing of the input which is being parsed) and python integration but at 10× the performance.
I love the fact the the grammar is pure Python.
All my basic blocks are regular expressions, so reusing them would be nice.
I know I can't have everything so I am willing to give up on some of the features I have today to get to the requested 10× performance.
Where do I go from here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
看起来 pyparsing 的人已经预料到了你的问题。来自 https://github.com/pyparsing/pyparsing/blob/master /docs/HowToUsePyparsing.rst:
然而,正如 Vangel 在下面的评论中指出的那样,截至 2012 年 3 月,
psyco
已是一个过时的项目。它的后继者是 PyPy 项目,它从相同的基本性能方法开始:使用 JIT 本机代码编译器而不是字节码解释器。如果切换 Python 实现对您有用,您应该能够使用 PyPy 获得类似或更大的收益。如果你确实是一个速度恶魔,但想保持一些易读性和声明性语法,我建议看看 ANTLR。可能不是Python生成后端;我怀疑它是否足够成熟或高性能足以满足您的需求。我说的是货物:开始这一切的 C 后端。
将 Python C 扩展模块包裹在解析器的入口点周围,然后将其松开。
话虽如此,在这个转变中您将放弃很多:基本上您想要在解析器中执行的任何 Python 操作都必须通过 C API 来完成(并不完全漂亮)。此外,您还必须习惯非常不同的做事方式。 ANTLR 有其魅力,但它不是基于组合器,因此语法和语言之间的关系不像 pyparsing 那样简单且流畅。另外,它有自己的 DSL,很像 lex/yacc,可以提供学习曲线 - 但是,因为它是基于 LL 的,您可能会发现它更容易适应您的需求。
It looks like the pyparsing folks have anticipated your problem. From https://github.com/pyparsing/pyparsing/blob/master/docs/HowToUsePyparsing.rst :
However, as Vangel noted in the comments below,
psyco
is an obsolete project as of March 2012. Its successor is the PyPy project, which starts from the same basic approach to performance: use a JIT native-code compiler instead of a bytecode interpreter. You should be able to achieve similar or greater gains with PyPy if switching Python implementations will work for you.If you're really a speed demon, but want to keep some of the legibility and declarative syntax, I'd suggest having a look at ANTLR. Probably not the Python-generating backend; I'm skeptical whether that's mature or high-performance enough for your needs. I'm talking about the goods: the C backend that started it all.
Wrap a Python C extension module around the entry point to the parser, and turn it loose.
Having said that, you'll be giving up a lot in this transition: basically any Python you want to do in your parser will have to be done through the C API (not altogether pretty). Also, you'll have to get used to very different ways of doing things. ANTLR has its charms, but it's not based on combinators, so there's not the easy and fluid relationship between your grammar and your language that there is in pyparsing. Plus, it's its own DSL, much like lex/yacc, which can present a learning curve – but, because it's LL based, you'll probably find it easier to adapt to your needs.
切换到生成的 C/C++ 解析器(使用 ANTLR、flex/bison 等)。如果您可以将所有操作规则延迟到完成解析之后,您也许可以使用简单的代码构建 AST,然后通过 SWIG 之类的东西将其传递回 python 代码并对其进行处理符合您当前的操作规则。 OTOH,为了提高速度,解析必须是繁重的工作。如果你的动作规则是很大的成本,那么除非你也用 C 语言编写你的动作规则,否则这不会给你带来任何好处(但你可能必须这样做,以避免为 python 和 C 代码之间的任何阻抗不匹配付出代价) 。
Switch to a generated C/C++ parser (using ANTLR, flex/bison, etc.). If you can delay all the action rules until after you are done parsing, you might be able to build an AST with trivial code and then pass that back to your python code via something like SWIG and process on it with your current actions rules. OTOH, for that to give you a speed boost, the parsing has to be the heavy lifting. If your action rules are the big cost, then this will buy you nothing unless you write your action rules in C as well (but you might have to do it anyway to avoid paying for whatever impedance mismatch you get between the python and C code).
如果您确实想要大型语法的性能,那么SimpleParse(它本身依赖于 mxTextTools,一个 C扩大)。然而,现在知道它的代价是更加神秘,并要求您精通 EBNF 。
这绝对不是更 Pythonic 的路线,并且您必须从 EBNF 语法重新开始才能使用 SimpleParse。
If you really want performance for large grammars, look no farther than SimpleParse (which itself relies on mxTextTools, a C extension). However, know now that it comes at the cost of being more cryptic and requiring that you be well-versed in EBNF.
It's definitely not the more Pythonic route, and you're going to have to start all over with an EBNF grammar to use SimpleParse.
聚会有点晚了,但是 PLY (Python Lex-Yacc) 为我提供了很好的帮助。 PLY 为您提供了一个纯 Python 框架,用于构建基于 lex 的分词器和基于 yacc 的 LR 解析器 。
当我遇到 pyparsing 的性能问题时,我就走了这条路。
这是一篇关于 Python 解析的有点老但仍然有趣的文章,其中包括 ANTLR、PLY 和 pyparsing 的基准。在此测试中,PLY 大约比 pyparsing 快 4 倍。
A bit late to the party, but PLY (Python Lex-Yacc), has served me very well. PLY gives you a pure Python framework for constructing lex-based tokenizers, and yacc-based LR parsers.
I went this route when I hit performance issues with pyparsing.
Here is a somewhat old but still interesting article on Python parsing which includes benchmarks for ANTLR, PLY and pyparsing. PLY is roughly 4 times faster than pyparsing in this test.
如果不进行测试,就无法知道您会获得什么样的好处,但仅使用 Unladen Swallow 如果您的流程长时间运行且重复。 (另外,如果你有很多东西要解析,并且通常会为每个东西启动一个新的解释器,那么在某种程度上,运行进程的时间越长,Unladen Swallow 就会变得越快,因此,虽然解析一个输入可能不会显示出太多增益,但你可能会在同一过程中的第二个和第三个输入上获得显着的收益)。
(注意:从 SVN 中提取最新版本 - 您将获得比最新 tarball 更好的性能)
There's no way to know what kind of benefit you'll get without just testing it, but it's within the range of possibility that you could get 10x benefit just from using Unladen Swallow if your process is long-running and repetitive. (Also, if you have many things to parse and you typically start a new interpreter for each one, Unladen Swallow gets faster - to a point - the longer you run your process, so while parsing one input might not show much gain, you might get significant gains on the 2nd and 3rd inputs in the same process).
(Note: pull the latest out of SVN - you'll get far better performance than the latest tarball)