用于将正则表达式转换为 NFA 的库?
是否有一个好的库可以将正则表达式转换为NFA? 我看到了很多关于这个主题的学术论文,它们很有帮助,但在工作代码方面却没有太多帮助。
我的问题部分是出于好奇,部分是由于在我正在开发的生产系统上加速正则表达式匹配的实际需要。 尽管为了学习而探索这个主题可能很有趣,但我不确定这是否是加速模式匹配的“实用”解决方案。 我们是一家 Java 商店,但很乐意接受任何语言的优秀代码的指导。
编辑:
有趣的是,我不知道Java的正则表达式已经是NFA了。 本文的标题让我相信事实并非如此。 顺便说一句,我们目前正在 Postgres 中进行正则表达式匹配; 如果简单的解决方案是将匹配移至 Java 代码中那就太好了。
Is there a good library for converting Regular Expressions into NFAs? I see lots of academic papers on the subject, which are helpful, but not much in the way of working code.
My question is due partially to curiosity, and partially to an actual need to speed up regular expression matching on a production system I'm working on. Although it might be fun to explore this subject for learning's sake, I'm not sure it's a "practical" solution to speeding up our pattern matching. We're a Java shop, but would happily take pointers to good code in any language.
Edit:
Interesting, I did not know that Java's regexps were already NFAs. The title of this paper lead me to believe otherwise. Incidentally, we are currently doing our regexp matching in Postgres; if the simple solution is to move the matching into the Java code that would be great.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
满足您加速正则表达式的需求:
Java 的正则表达式引擎实现是基于 NFA 的。 因此,为了调整你的正则表达式,我想说你会受益于对引擎如何实现的更深入的了解。
因此,我指导您:掌握正则表达式这本书对 NFA 引擎进行了大量处理,并且它如何执行匹配,包括如何调整特定于 NFA 引擎的正则表达式。
此外,请查看原子分组来调整您的正则表达式。
Addressing your need to speed up your regexes:
Java's implementation of its regex engine is NFA based. As such, to tune your regexes, I would say that you would benefit from a deeper understanding of how the engine is implemented.
And as such I direct you to: Mastering Regular Expressions The book gives substantial treatment to the NFA engine and how it performs matches, including how to tune your regex specific to the NFA engine.
Additionally, look into Atomic Grouping for tuning your regex.
免责声明:我不是 java+regexes 方面的专家。 但是,如果我理解正确的话...
如果 Java 的正则表达式匹配器与大多数其他匹配器类似,那么它确实使用 NFA - 但不是您所期望的方式。 它不是您可能听说过的仅前向实现,而是使用回溯解决方案,该解决方案简化了子表达式匹配,并且可能是后向引用使用所必需的。 然而,它的交替性能很差。
您想看到: http://swtch.com/~rsc/regexp/regexp1.html (关于在这种改变的架构上表现不佳的边缘情况)。
我还写了一个问题,我认为可以归结为同一件事:
可以处理机器生成的正则表达式的正则表达式实现:*非回溯*,O(n)?
但基本上,由于一些非常奇怪的原因,所有常见的主要供应商正则表达式实现都非常糟糕在某些正则表达式上使用时的性能,即使这是不必要的。
Disclaimer: I'm not an expert on java+regexes. But, if I understand correctly...
If Java's regular expression matcher is similar to most others, it does use NFA's - but not the way you might expect. Instead of the forward-only implementation you may have heard about, it's using a backtracking solution which simplifies subexpression matching, and is probably required for Backreference usage. However, it performs alternation poorly.
You want to see: http://swtch.com/~rsc/regexp/regexp1.html (concerning edge cases which perform poorly on this altered architecture).
I've also written a question which I suppose comes down to the same thing:
Regex implementation that can handle machine-generated regex's: *non-backtracking*, O(n)?
But basically, it looks like for some very odd reason all common major-vendor regex implementaions have terrible performance when used on certain regexes, even though this is unnecessary.
免责声明:我是一名谷歌员工,而不是正则表达式专家。
有一堆比 JDK 更快的正则表达式库,其中之一是 dk.brics.automaton。 根据文章中链接的基准,它是比 JDK 实现快大约 20 倍。
该库由 Anders Møller 编写,并且已mavenized。
Disclaimer: I'm a googler, not an expert on regexes.
There is a bunch of faster-than-JDK regex libraries one of which is dk.brics.automaton. According to the benchmark linked in the article, it is approximately x20 faster than the JDK implementation.
This library was written by Anders Møller and had also been mavenized.