我可以确定正则表达式模式匹配的第一个字符集吗?

发布于 2024-07-17 06:03:32 字数 498 浏览 6 评论 0原文

我希望能够计算给定的 java.util.regex.Pattern 实例可以与字符串中的第一个字符匹配的所有字符的集合。 更正式地说,给定 DFA 相当于某个正则表达式,我想要从起始状态开始的所有传出转换的集合。

示例:

Pattern p = Pattern.compile("[abc]def|daniel|chris|\\s+");
Set<Character> first = getFirstSet(p);

集合 first 应包含以下元素:

{ 'a', 'b', 'c', 'd', ' ', '\n', '\r', '\t' }

有什么想法吗? 我很清楚我可以自己构建 DFA 并以这种方式确定相关状态,但我想避免这种麻烦(阅读:这对我来说不值得那么多)。 请注意,我的主机语言实际上是 Scala,因此我可以访问所有核心 Scala 库(无论其价值如何)。

I would like to be able to compute the set of all characters which may be matched as the first character in a string by a given instance of java.util.regex.Pattern. More formally, given the DFA equivalent to a certain regular expression, I want the set of all outgoing transitions from the start state.

An example:

Pattern p = Pattern.compile("[abc]def|daniel|chris|\\s+");
Set<Character> first = getFirstSet(p);

The set first should contain the following elements:

{ 'a', 'b', 'c', 'd', ' ', '\n', '\r', '\t' }

Any ideas? I'm well aware that I could construct the DFA myself and determine the relevant states that way, but I'd like to avoid that kind of hassle (read: it's not worth that much to me). Note that my host language is actually Scala, so I have access to all of the core Scala libs (for what it's worth).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

没︽人懂的悲伤 2024-07-24 06:03:32

我认为您可以解析正则表达式并定义一些递归函数,该函数以从左到右的方式对解析后的正则表达式进行操作,从而构建这样一组第一。

有些事情很简单:

  • 序列:first(r1r2)=first(r1)+(if''infirst(r1)first(r2)else空集)
  • 交替:first(r1|r2)=first(r1)+first( r2)
  • 迭代:first(r*) = first(r) + ''
  • 字符:first(c) = c
  • 字符类:first([c1-cn]) = set(c1, c2, ..., cn)
    ...

将其扩展到您的正则表达式方言知道的所有原语和特殊标志,您就可以开始了。

I think you could parse the regular expression and define some recursive function which operates on the parsed regular expression in a left-to-right-manner, building up such a set of firsts.

Some things are simple:

  • Sequence: first(r1r2) = first(r1) + ( if '' in first(r1) first(r2) else empty set )
  • Alternation: first(r1|r2) = first(r1) + first(r2)
  • Iteration: first(r*) = first(r) + ''
  • Characters: first(c) = c
  • Characterclasses: first([c1-cn]) = set(c1, c2, ..., cn)
    ...

Extend this to all primitives and special flags your regular expression dialect knows and you are good to go.

浮华 2024-07-24 06:03:32

你可以递归地解决它......

  • 去掉括号并递归调用。
  • 在顶层替代方案中进行拆分,并为每个部分递归调用。
  • 如果没有其他选择,
    • 输出从左侧开始到第一个非可选符号的所有符号。
    • 如果存在字符组,则输出所有符号。

这个想法可能有很多错误,但这就是我会尝试的。 你必须去掉断言、组名和数千个其他东西。 如果你发现像 [^0-9] 这样的倒置字符类,你就必须输出很多字符。

所以我认为这确实是一个复杂的问题。

You could solve it recursivly ...

  • Strip of enclosing parenthesis and call recursivly.
  • Split at toplevel alternatives and call recursivly for each part.
  • If there are no alternatives,
    • output all symbols starting from the left up to the first none optional symbol.
    • If there are charachter groups, output all symbols.

There are probably a lot of errors in this idea, but this is what I would try. You have to strip out assertion, group names and thousand other things. And if you find an inverted character class like [^0-9] you have to output a lot of characters.

So I assume it is really a complex problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文