用 python 编写一个快速解析器

发布于 2024-08-30 20:37:30 字数 583 浏览 0 评论 0原文

我已经为某些文件格式编写了一个实用的递归纯Python解析器(ARFF)我们在一个讲座中使用。现在运行我的练习提交非常慢。事实证明,到目前为止,大部分时间都花在我的解析器上。它消耗了大量的CPU时间,HD不是瓶颈。

我想知道用 python 编写解析器有什么高性能的方法?我不想用 C 重写它。我尝试使用 jython,但这会大大降低性能!我解析的文件部分很大(> 150 MB),并且行很长。

我当前的解析器只需要前瞻一个字符。我会在这里发布源代码,但我不知道这是否是一个好主意。毕竟提交截止日期还没有结束。但是,这个练习的重点不是解析器。您可以选择您想要使用的任何语言,并且已经有一个用于 Java 的解析器。

注意:我有一个 x86_64 系统,所以 psyco (似乎也是 PyPy)是没有选择的。

更新:我现在将解析器/写入器上传到 bitbucket

I've written a hands-on recursive pure python parser for a some file format (ARFF) we use in one lecture. Now running my exercise submission is awfully slow. Turns out by far the most time is spent in my parser. It's consuming a lot of CPU time, the HD is not the bottleneck.

I wonder what performant ways are there to write a parser in python? I'd rather not rewrite it in C. I tried to use jython, but that decreased performance a lot! The files I parse are partially huge (> 150 MB) with very long lines.

My current parser only needs a look-ahead of one character. I'd post the source here but I don't know if that's such a good idea. After all the submission deadline has not ended yet. But then, the focus in this exercise is not the parser. You can choose whatever language you want to use and there already is a parser for Java.

Note: I've a x86_64 system so psyco (and it seems also PyPy) is no option.

Update: I now uploaded my parser/writer to bitbucket.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

远昼 2024-09-06 20:37:30

在没有更多信息的情况下,我给出的最一般的提示是立即将整个文件或至少其中的很大一部分读入内存。你不想一次读一个字符并到处寻找;不管幕后发生的缓冲如何,将整个内容都存储在内存中可能是一个好主意,这样您就可以根据需要对其进行操作。

我已经用 Python 编写了解析器,并且没有特别要求它们比用任何其他语言编写的解析器特别慢。与此类事情一样,您更有可能正在做不需要做的工作。在这些类别的项目中,创建、销毁和重新创建同一个对象比仅仅将其存储在某个地方的成本更高。一遍又一遍地重新计算一个值比仅仅将其存储在某个地方的成本更高。等等。

具体来说,在 Python 中,人们容易陷入的一个陷阱是进行大量不必要的字符串操作。不要一次向字符串追加一个字符;当您构建令牌时,请在“master”字符串上进行工作并一次性删除令牌。 (换句话说,索引到“master”字符串,找出起点和终点,然后使用 token = master[start:end] 抓取它。)时间是导致绩效痛苦的捷径。我怀疑,即使您出于某种原因想要/需要执行 for c in master: newstr += c ,您可能会更幸运地将 'c' 填充到列表中,然后 newstr = '' .join(newstr_charlist)

The most general tip I'd give without further information would be to read the entire file, or at least a substantial section of it, into memory at once. You don't want to be reading it one character at a time and seeking here and there; regardless of the buffering that's going on under the hood it's probably a good idea just to have the whole thing in memory so you can operate on it however you want.

I have written parsers in Python and there's no particular requirement for them to be particularly slower than a parser written in any other language. As it is with these sorts of things, it's more likely that you're doing work you don't need to do. Of those class of item, creating and destroying and recreating the same object is more costly than just storing it off somewhere. Recomputing a value over and over again is more costly than just storing it somewhere. Etc., etc.

In Python specifically, one trap that people fall into is doing a lot of unnecessary string manipulation. Don't append to strings one character at a time; when you're building up your tokens do your work on the "master" string and strip out the token in one fell swoop. (In other words, index into the "master" string, figure out the start and end points, and then grab it with token = master[start:end].) Doing string concatenation one character at a time is a short path to performance misery. I suspect even if you want/need for some reason to do for c in master: newstr += c you might have better luck stuffing the 'c's into a list and then newstr = ''.join(newstr_charlist).

为你鎻心 2024-09-06 20:37:30

您可以使用 ANTLRpyparsing,它们可能会加速你的解析过程。

如果您想保留当前代码,您可能需要查看 Cython/PyPy,它可以提高您的性能(有时高达 4 倍)。

You could use ANTLR or pyparsing, they might speed up your parsing process.

And if you want to keep your current code, you might want to look at Cython/PyPy, which increases your perfomance (sometimes upto 4x).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文