用于将多个正则表达式与多个字符串的优先级进行匹配的Java工具

发布于 2024-11-30 13:55:48 字数 300 浏览 9 评论 0原文

我有无限的字符串序列和大量按优先级排序的正则表达式。对于序列中的每个字符串,我必须找到第一个匹配的正则表达式和匹配的子字符串。字符串不是很长(<1Kb),而正则表达式的数量可能从数百到数千不等。

我正在寻找一个可以有效完成这项工作的 Java 工具。我想该技术应该是提前构建 DFA。

我当前的选择是 JFLEX。我在 JFLEX 中无法解决的问题是它的规则没有优先级,并且 JFLEX 会查找与文本最长部分匹配的规则。

我的问题是我的问题是否可以通过 JFLEX 解决?如果没有,您能推荐另一种可以做到的 Java 工具/技术吗?

I have an unlimited sequence of strings and numerous regular expressions ordered by priorities. For each string in a sequence I have to to find the first matching regular expression and the matched substring. Strings are not very long (<1Kb) while the number of regular expressions may vary from hundreds to thousands.

I'm looking for a Java tool that would do this job efficiently. I guess the technique should be building DFA ahead.

My current option is JFLEX. The problem I can't workaround in JFLEX is that its rules have no priorities and JFLEX looks for the rule matching the longest part of text.

My question is whether my problem could be solved with JFLEX? If not, can you suggest another Java tool/technique that would do?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

聚集的泪 2024-12-07 13:55:48

您可以使用 Java 正则表达式。将替代项构建成 RE 字符串,每个替代项都用“(”和“)+?”包围并用“|”分隔,优先级最高的 RE 在前面。第一个构造使子 RE 变得贪婪,因此它们不会回溯并且“|”替代方案是从左到右评估的,因此优先级最高的 RE 将首先被尝试。

例如,给定字符串“zeroonetwothirdfour”,

'(one)+?|(onetwo)+?' will match 'one'
'(onetwo)+?|(one)+?' will match 'onetwo'
'(twothree)+?|(onetwothree)+?' will match 'twothree'

请特别注意,在最后一个示例中,“twothird”匹配,即使它出现在目标字符串中并且比“onetwothird”匹配短。

You could use Java regexp's. Build up the alternatives into a RE string with each alternative surrounded with '(' and ')+?' and separated by '|', with the highest priority REs first. The first construct makes the sub-REs greedy so they won't backtrack and '|' alternatives are evaluated left-to-right so the highest priority REs will be tried first.

For example, given a string of "zeroonetwothreefour"

'(one)+?|(onetwo)+?' will match 'one'
'(onetwo)+?|(one)+?' will match 'onetwo'
'(twothree)+?|(onetwothree)+?' will match 'twothree'

Note especially that in the last example, 'twothree' matches even though it occurs later in the target string and is shorter than the 'onetwothree' match.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文