尝试编写标记生成器时应该使用正则表达式吗?
我编写了一个小而简单的标记生成器,但没有使用正则表达式。
我从第一个索引开始,迭代每个字符直到结束并创建所需的标记。
我向一位同事展示了它,他说使用正则表达式来做到这一点会简单得多,而无需深入研究。
那么我应该重写并期望它“更好”吗?
I've written a small and simple tokenizer but without use of regular expressions.
I starts at first index and iterates through every character until end and creates the required tokens.
I showed it to a colleague that said it would've been much simpler to do /that/ with regex without going into any depths.
So should I rewrite and expect it to be "better"?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
很大程度上取决于解析的语言和您对“更好”的定义。
Depends very much on the language parsed and your definition of "better".
我不这么认为。正则表达式必须具有非常丰富的功能,因此您的程序可能运行得更慢。
I don't think so. Regex must be very feature-rich, and because of it your program may run slower.
通常标记化可以使用有限状态机来完成,这相当于规范的正则表达式。如果你制作合理的正则表达式,它将比自制的 FSA 更容易阅读和维护。使用像 flex 或 jflex 这样的工具,它们会将正则表达式编译成最小的 FSA,从而提供非常好的性能。手动执行此操作只能作为练习。
词法分析器有多种实现方式,很可能适用于您最喜欢的语言。
Usually the tokenization can be done with a finite state machine, which is equivalent to canonical regular expressions. If you make sane regexps it will be much easier to read and maintain than a homebrewed FSA. Use tools like flex or jflex and such, they will compile the regexps into minimal FSAs, giving very good performance. Doing this manually should only be done as an exercise.
Lexers exists in several implementations, quite possibly for your favourite language.
您应该问两个问题:
a) 如果需要更改某些内容,哪一个最容易维护?
b) 如果它正在工作并且您不期望有任何改变,您真的想在上面花更多时间吗?
我确信性能差异小到可以忽略不计。编程体验以及最小化潜在错误是最重要的问题。
Two questions you should ask:
a) If something should change, which one would be the easiest to maintain?
b) If it is working and you don't expect any change, do you really want to spend more time on it?
I'm sure the performance differences are small enough to ignore. The programming experience, and minimizing potential bugs, is the most important issue.