递归下降解析器是否简单?
我正在为模板语言编写一个解析器,它可以编译成 JS(如果相关的话)。我从一些简单的正则表达式开始,这似乎可以工作,但是正则表达式非常脆弱,所以我决定编写一个解析器。我首先编写了一个简单的解析器,它通过从堆栈中推入/弹出来记住状态,但事情不断升级,直到我手上有了一个递归下降解析器。
不久之后,我比较了之前所有解析方法的性能。递归下降解析器是迄今为止最慢的。我陷入困境:是否值得使用递归下降解析器来完成简单的事情,或者我有理由走捷径吗?我很想走纯正则表达式路线,它速度非常快(几乎比 RD 解析器快 3 倍),但在某种程度上非常老套且难以维护。我认为性能并不是非常重要,因为编译的模板会被缓存,但是递归下降解析器是否是每个任务的正确工具?我想我的问题可以被视为更多的哲学问题:为了性能而牺牲可维护性/灵活性到什么程度值得?
I'm writing a parser for a templating language which compiles into JS (if that's relevant). I started out with a few simple regexes, which seemed to work, but regexes are very fragile, so I decided to write a parser instead. I started by writing a simple parser that remembered state by pushing/popping off of a stack, but things kept escalating until I had a recursive descent parser on my hands.
Soon after, I compared the performance of all my previous parsing methods. The recursive descent parser was by far the slowest. I'm stuck: Is it worth using a recursive descent parser for something simple, or am I justified in taking shortcuts? I would love to go the pure regex route, which is insanely fast (almost 3 times faster than the RD parser), but is very hacky and unmaintainable to a degree. I suppose performance isn't terribly important because compiled templates are cached, but is a recursive descent parser the right tool for every task? I guess my question could be viewed as more of a philosophical one: to what degree is it worth sacrificing maintainability/flexibility for performance?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
递归下降解析器可以非常快。
这些通常使用词法分析器进行组织,该词法分析器使用正则表达式来识别提供给解析器的语言标记。处理源文本的大部分工作都是由词法分析器使用 RE 经常编译成的极其快速的 FSA 逐个字符完成的。
与词法分析器看到字符的速率相比,解析器只是偶尔看到标记,因此它的速度通常并不重要。然而,当比较解析器与解析器的速度时,忽略词法分析所需的时间,递归下降解析器可以非常快,因为它们使用函数调用来实现解析器堆栈,与一般解析器推送当前状态相比,这些函数调用已经非常高效。在模拟堆栈上。
所以,你可以鱼与熊掌兼得。对词位使用正则表达式。使用解析器(任何类型的递归下降都可以)来处理词位。您应该对表现感到满意。
这种方法也满足了其他答案的观察结果:以使其可维护的方式编写它。我向你保证,词法分析器/解析器分离可以很好地做到这一点。
Recursive descent parsers can be extremely fast.
These are usually organized with a lexer, that uses regular expressions to recognize langauge tokens that are fed to the parser. Most of the work in processing the source text is done character-by-character by the lexer using the insanely fast FSAs that the REs are often compiled into.
The parser only sees tokens occasionally compared to the rate at which the lexer sees characters, and so its speed often doesn't matter. However, when comparing parser-to-parser speeds, ignoring time required to lex the tokens, recursive descent parsers can be very fast because they implement the parser stack using function calls which are already very efficient compared to general parser push-current-state-on-simulated-stack.
So, you can have you cake and eat it, too. Use regexps for the lexemes. Use the parser (any kind, recursive descent are just fine) to process lexemes. You should be pleased with the performance.
This approach also satisifies the observation made by other answers: write it in a way to make it maintainable. Lexer/Parser separation does this very nicely, I assure you.
首先是可读性,其次是性能……
因此,如果您的解析器使代码更具可读性,那么它就是正确的工具
Readability first, performance later...
So if your parser makes code more readable then it is the right tool
我认为将编写清晰的可维护代码作为首要任务非常重要。除非您的代码不仅表明它是瓶颈,而且您的应用程序性能也受到影响,否则您应该始终认为清晰的代码是最好的代码。
同样重要的是不要重新发明轮子。关于查看另一个解析器的评论非常好。通常可以找到编写此类例程的通用解决方案。
当应用于适用的事物时,回避是非常优雅的。根据我自己的经验,由于递归而导致的缓慢代码是一个例外,而不是常态。
I think it's very important to write clear maintanable code as a first priority. Until your code not only indicates that it is a bottleneck, but that your application performance also suffers from it, you should always consider clear code to be the best code.
It's also important to not reinvent the wheel. The comment on taking a look at another parser is a very good one. There often found common solutions for writing routines such as this.
Recusion is very elegant when applied to something applicible. In my own experiance slow code due to recursion is an exception, not the norm.
递归下降解析器应该更快
...或者你做错了什么。
首先,您的代码应该分为两个不同的步骤。词法分析器+解析器。
一些在线参考示例首先将整个语法标记为大型中间数据结构,然后将其传递给解析器。虽然有利于演示;不要这样做,它会使时间和内存复杂性加倍。相反,一旦词法分析器确定匹配,就通知解析器状态转换或状态转换+数据。
至于词法分析器。这可能是您当前的瓶颈所在。如果词法分析器与解析器完全分离,您可以尝试在正则表达式和非正则表达式实现之间进行切换以比较性能。
无论如何,正则表达式并不比读取原始字符串更快。它只是默认避免了一些常见的错误。具体来说,不必要地创建字符串对象。理想情况下,您的词法分析器应该扫描您的代码并生成具有零中间数据的输出,除了在解析器中跟踪状态所需的最低限度之外。就内存而言,您应该拥有:
例如,如果您当前的词法分析器匹配非终端并逐一复制每个字符,直到到达下一个终端;您实际上是为每个匹配的字母重新创建该字符串。请记住,字符串数据类型是不可变的,concat 将始终创建一个新字符串。您应该使用指针算术或其他等效方法来扫描文本。
为了解决这个问题,你需要从非终结符的startPos开始扫描到非终结符的末尾,只有当匹配完成时才复制。
Regex 默认情况下开箱即用地支持所有这些,这就是为什么它是编写词法分析器的首选工具。不要尝试编写一个解析整个语法的正则表达式,而是编写一个仅关注匹配终端和终端的正则表达式。非终结符作为捕获组。跳过标记化,并将结果直接传递到解析器/状态机中。
这里的关键是,不要尝试使用 Regex 作为状态机。它充其量只适用于正则(即 Chomsky Type III,无堆栈)声明性语法——因此得名正则表达式。例如,HTML 是一种上下文无关(即 Chomsky Type II,基于堆栈)的声明性语法,这就是为什么仅靠 Rexeg 永远不足以解析它。您的语法,以及通常所有模板语法,都属于这一类。显然您已经达到了正则表达式的极限,因此您走在正确的道路上。
仅使用正则表达式进行标记化。如果您确实关心性能,请重写您的词法分析器以消除任何和所有不必要的字符串复制和/或中间数据。看看您是否可以超越正则表达式版本。
关键是。正则表达式版本更容易理解和维护,而如果编写正确,您的手动词法分析器可能会稍微快一些。传统观点认为,帮自己一个忙,选择前者。就 Big-O 复杂性而言,两者之间应该没有任何区别。它们是同一事物的两种形式。
A Recursive Descent Parser should be faster
...or you're doing something wrong.
First off, your code should be broken into 2 distinct steps. Lexer + Parser.
Some reference examples online will first tokenize the entire syntax first into a large intermediate data structure, then pass that along to the parser. While good for demonstration; don't do this, it doubles time and memory complexity. Instead, as soon as a match is determined by the lexer, notify the parser of either a state transition or state transition + data.
As for the lexer. This is probably where you'll find your current bottleneck. If the lexer is cleanly separated from your parser, you can try wapping between Regex and non-Regex implementations to compare performance.
Regex isn't, by any means, faster than reading raw strings. It just avoids some common mistakes by default. Specifically, the unnecessary creation of string objects. Ideally, your lexer should scan your code and produce an output with zero intermediate data except the bare minimum required to track state within your parser. Memory-wise you should have:
For instance, if your current lexer matches a non-terminal and copies every char over one-by one until it reaches the next terminal; you're essentially recreating that string for every letter matched. Keep in mind that string data types are immutable, concat will always create a new string. You should be scanning the text using pointer arithmetic or some equivalent.
To fix this problem, you need to scan from the startPos of the non-terminal to the end of the non-terminal and copy only when a match is complete.
Regex supports all of this by default out of the box, which is why it's a preferred tool for writing lexers. Instead of trying to write a Regex that parses your entire grammer, write one that only focuses on matching terminals & non-terminals as capture groups. Skip tokenization, and pass the results directly into your parser/state machine.
The key here is, don't try to use Regex as a state machine. At best it will only work for Regular (ie Chomsky Type III, no stack) declarative syntaxes -- hence the name Regular Expression. For example, HTML is a Context-Free (ie Chomsky Type II, stack based) declarative syntax, which is why a Rexeg alone is never enough to parse it. Your grammar, and generally all templating syntaxes, fall into this category. You've clearly hit the limit of Regex already, so you're on the right track.
Use Regex for tokenization only. If you're really concerned with performance, rewrite your lexer to eliminate any and all unnecessary string copying and/or intermediate data. See if you can outperform the Regex version.
The key being. The Regex version is easier to understand and maintain, whereas your hand-rolled lexer will likely be just a tinge faster if written correctly. Conventional wisdom says, do yourself a favor and prefer the former. In terms of Big-O complexity, there shouldn't be any difference between the two. They're two forms of the same thing.