如何修复 BBcode 正则表达式

发布于 2024-11-28 17:30:57 字数 2178 浏览 4 评论 0原文

我有一个获取 BBcode 标签的正则表达式。除了一个小故障之外，它工作得很好。

这是当前的表达式：

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](.+)\[/\1\]

这是它成功匹配的一些文本以及它构建的组：

[url=http://www.google.com]转到 Google！[/网址]
1：网址
2：http://www.google.com
3：转到谷歌！
[img]http://www.somesite.com/someimage.jpg[ /img]
1：img
2：空
3：http://www.somesite.com/someimage.jpg
[quote][quote]第一个嵌套引用[/quote][quote]第二个嵌套引用[/quote][/quote]
1：报价
2：空
3: [quote]第一个嵌套引用[/quote][quote]第二个嵌套引用[/quote]

所有这一切都很棒。我可以通过针对相同的正则表达式运行第三个匹配组来处理嵌套标签，并递归地处理所有嵌套的标签。问题在于使用 [quote] 标签的示例。请注意，第三个匹配组是一组两个引号标记，因此我们期望有两个匹配。然而，我们得到了一场比赛，如下所示：

[quote]第一个嵌套引用[/quote][quote]第二个嵌套引用[/quote]
1：报价
2：空
3: 第一个嵌套引用[/quote][quote]第二个嵌套引用

啊啊！这根本不是我们想要的。有一个相当简单的方法来修复它，我将正则表达式修改为：

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](.+)\[/\1\]

为此：

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](((?!\[/\1\]).)+)\[/\1\]

通过添加 ((?!\[/\1\]).) 我们使整个匹配无效，如果第三组比赛包含结束 BBcode 标签。现在可以了，我们得到两个匹配项：

[quote]第一个嵌套引用[/quote][quote]第二个嵌套引用[/quote]
[quote]第一个嵌套引号[/quote]
1：报价
2：空
3：第一个嵌套引用
[quote]第二个嵌套引号[/quote]
1：报价
2：空 3：第二个嵌套引用

我很高兴修复了它，但现在我们有另一个问题。这个新的正则表达式在第一个正则表达式中失败，我们将两个引号标签嵌套在一个更大的引号标签下。我们得到两场比赛而不是一场：

[quote][quote]第一个嵌套引用[/quote][quote]第二个嵌套引用[/quote][/quote]
[quote][quote]第一个嵌套引号[/quote]
1：报价
2：空
3: [quote]第一个嵌套引号
[quote]第二个嵌套引号[/quote]
1：报价
2：空
3：第二个嵌套引用

第一个匹配全部错误，第二个匹配虽然格式良好，但不是所需的匹配。我们想要一个大匹配，第三个匹配组是两个嵌套的引号标签，就像我们使用第一个表达式时一样。

有什么建议吗？如果我能跨越这个差距，我应该有一个相当强大的 BBcode 表达式。

原文

I have a regular expression that grabs BBcode tags. It works great except for a minor glitch.

Here is the current expression:

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](.+)\[/\1\]

Here is some text it successfully matches against and the groups it builds:

[url=http://www.google.com]Go to google![/url]
1: url
2: http://www.google.com
3: Go to google!
[img]http://www.somesite.com/someimage.jpg[/img]
1: img
2: NULL
3: http://www.somesite.com/someimage.jpg
[quote][quote]first nested quote[/quote][quote]second nested quote[/quote][/quote]
1: quote
2: NULL
3: [quote]first nested quote[/quote][quote]second nested quote[/quote]

All of this is great. I can handle nested tags by running the 3rd match group against the same regex and recursively handle all tags that are nested. The problem is with the example using the [quote] tags. Notice that the 3rd match group is a set of two quote tags, so we would expect two matches. However, we get one match, like this:

[quote]first nested quote[/quote][quote]second nested quote[/quote]
1: quote
2: NULL
3: first nested quote[/quote][quote]second nested quote

Ahhhh! That's not what we wanted at all. There is a fairly simple way to fix it, I modify the regex from this:

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](.+)\[/\1\]

To this:

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](((?!\[/\1\]).)+)\[/\1\]

By adding ((?!\[/\1\]).) we invalidate the entire match if the 3rd match group contains the closing BBcode tag. So now this works, we get two matches:

[quote]first nested quote[/quote][quote]second nested quote[/quote]
[quote]first nested quote[/quote]
1: quote
2: NULL
3: first nested quote
[quote]second nested quote[/quote]
1: quote
2: NULL
3: second nested quote

I was happy that fixed it, but now we have another problem. This new regex fails on the first one where we nest the two quote tags under one larger quote tag. We get two matches instead of one:

[quote][quote]first nested quote[/quote][quote]second nested quote[/quote][/quote]
[quote][quote]first nested quote[/quote]
1: quote
2: NULL
3: [quote]first nested quote
[quote]second nested quote[/quote]
1: quote
2: NULL
3: second nested quote

The first match is all wrong and the second match, while well-formed, is not a desired match. We wanted one big match with the 3rd match group being the two nested quote tags, like when we used the first expression.

Any suggestions? If I can just cross this gap I should have a fairly powerful BBcode expression.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冬天旳寂寞 2024-12-05 17:30:58

使用平衡组，您可以构建如下正则表达式：

(?>
  \[ (?<tag>[^][/=\s]+) \s*
  (?: = \s* (?<val>[^][]*) \s*)?
  ]
)

(?<content>
  (?>
    \[(?<innertag>[^][/=\s]+)[^][]*]
    |
    \[/(?<-innertag>\k<innertag>)]
    |
    [^][]+
  )*
  (?(innertag)(?!))
)

\[/\k<tag>]

根据 Kobi 的示例进行简化。

如下所示：

[foo=bar]baz[/foo]
[b]foo[/b]
[i][i][foo=bar]baz[/foo]foo[/i][/i]
[i][i][i][i]foo[/i][/i][/i][i][i]foo[/i][/i][/i]
[quote][quote][b][img]foo[/img][b]bold[/b][b][b]deep[/b][/b][/b][/quote]bar[quote]baz[/quote][/quote]

它找到这些匹配项：

[foo=bar]baz[/foo]
[b]foo[/b]
[i][i][foo=bar]baz[/foo]foo[/i][/i]
[i][i][i][i]foo[/i][/i][/i][i][i]foo[/i][/i][/i]
[quote][quote][b][img]foo[/img][b]粗体[/b][b][b]深[/b][/b][/b][/quote ]bar[quote]baz[/quote][/quote]

完整示例位于 http://ideone.com/uULOs

（旧版本http://ideone.com /AXzxW)

Using balancing groups you can construct a regex like this:

(?>
  \[ (?<tag>[^][/=\s]+) \s*
  (?: = \s* (?<val>[^][]*) \s*)?
  ]
)

(?<content>
  (?>
    \[(?<innertag>[^][/=\s]+)[^][]*]
    |
    \[/(?<-innertag>\k<innertag>)]
    |
    [^][]+
  )*
  (?(innertag)(?!))
)

\[/\k<tag>]

Simplified according to Kobi's example.

In the following:

[foo=bar]baz[/foo]
[b]foo[/b]
[i][i][foo=bar]baz[/foo]foo[/i][/i]
[i][i][i][i]foo[/i][/i][/i][i][i]foo[/i][/i][/i]
[quote][quote][b][img]foo[/img][b]bold[/b][b][b]deep[/b][/b][/b][/quote]bar[quote]baz[/quote][/quote]

It finds these matches:

[foo=bar]baz[/foo]
[b]foo[/b]
[i][i][foo=bar]baz[/foo]foo[/i][/i]
[i][i][i][i]foo[/i][/i][/i][i][i]foo[/i][/i][/i]
[quote][quote][b][img]foo[/img][b]bold[/b][b][b]deep[/b][/b][/b][/quote]bar[quote]baz[/quote][/quote]

Full example at http://ideone.com/uULOs

(Old version http://ideone.com/AXzxW)

回复收藏 0 原文

~没有更多了~