PHP：如何删除嵌套标签，并以非嵌套方式重新定位它们？

发布于 2024-10-24 07:40:07 字数 930 浏览 5 评论 0原文

我需要从字符串中删除所有出现的 bb 样式标记。标签可以嵌套，这就是我失败的地方。我还需要将每个标签和内容重新定位到字符串的末尾，并用 HTML 元素替换该标签。我尝试使用正则表达式和 preg_replace_callback，但到目前为止我还没有成功。我也尝试修改以下内容，但也没有成功：删除 PHP 中嵌套的 bbcode（引号）和如何删除 html 元素并它的内容使用 RegEx 我不认为我可以使用 HTML 解析器像这样因为HTML 格式错误（子元素位于不能有子元素的元素中）。

该字符串如下所示：

This is some 
[tag] attribute=1 attribute2=1 
     [tag] attribute=1 attribute2=1 [/tag] 
     [tag] attribute=1 attribute2=1 [/tag]
[/tag]
 text.

结果应如下所示：

This is some text.
<br attribute=1 attribute2=1>
<br attribute=1 attribute2=1>
<br attribute=1 attribute2=1>

任何帮助将不胜感激。

原文

I need to remove all occurrences of a bb style tag from a string. The tags can be nested, and this is where I am failing. I also need to relocate each tag and contents to the end of the string, and replace the tag with an HTML element. I have tried to play with regex and preg_replace_callback, but I have only been so far unsuccessful. I also tried to modify the following, and have also had no luck:
Removing nested bbcode (quotes) in PHP
and
How can I remove an html element and it's contents using RegEx I don't think I can use an HTML parser like this because the HTML is malformed (children in elements that can't have children).

Here is what the string looks like:

This is some 
[tag] attribute=1 attribute2=1 
     [tag] attribute=1 attribute2=1 [/tag] 
     [tag] attribute=1 attribute2=1 [/tag]
[/tag]
 text.

The result should look like this:

This is some text.
<br attribute=1 attribute2=1>
<br attribute=1 attribute2=1>
<br attribute=1 attribute2=1>

Any help would be appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

记忆消瘦 2024-10-31 07:40:07

街头信誉：我曾在 Infopop（后来称为 Groupee，现在的 Social Strata）工作，他们是 UBBCode 的创造者，UBBCode 被复制并转变为简单的老式常规“BBCode”。

tl;dr: 是时候编写您自己的非正则表达式解析器了。

大多数 BBCode 解析器使用正则表达式，这适用于大多数情况，但您在这里做一些自定义的事情。普通的旧正则表达式不会对您有帮助。正则表达式有两种妨碍我们的操作模式：我们可以在“贪婪”模式或“非贪婪”模式下匹配两个标签之间的所有内容。

在“贪婪”模式下，我们将捕获第一个开始任务和最后结束标记之间的所有内容。这会严重破坏事情。以这种情况为例：

[a][b][c]...[/c][/b][/a]...[a]...[/a]

像 \[a\].+\[/a\] 这样的贪婪正则表达式将抓取从第一个开始标记到最后一个结束标记的所有内容，忽略了关闭器没有关闭开启器的事实。

另一种选择更糟糕。以这种情况为例：

[a][b][a]...[/a][/b][/a]

像 \[a\].+?\[/a\] 这样的不贪心正则表达式（唯一的变化是问号）将匹配第一个开始标记，但随后它'将匹配第一个结束标记，再次忽略结束标记不属于开始标记。

早在原始时代，我解决这个问题的方法就是完全忽略开始和结束标记不匹配的事实。我只是循环了整个标签转换正则表达式链，直到输出停止变化。它简单而有效，主要是因为可用的标签集被有意限制，因此嵌套从来都不是问题。

一旦允许嵌套相同的标签，盲目的暴力就不再是合适的工具。

如果现有的 BBCode 解析引擎都不适合您，您可能需要编写自己的引擎。检查全部。 PEAR 上有一些，有一个 PECL 扩展等。还要检查其他语言以获取灵感，Perl 的 CPAN 有十几种不同的实现，其中一些非常强大和复杂（如果该组合中没有适当的递归下降解析器），我会很沮丧）。这是一个很好的挑战，但并不太难。话又说回来，我现在已经写了大约五个（我都不能发布），所以也许我有偏见？

首先分解 [ 和 ] 上的字符串。遍历结果数组，跟踪左括号后面和下一个右括号之前的数组索引何时看起来像有效的标签和/或属性。您需要考虑当属性可以包含括号时会发生什么，或者更糟的是，当 URL 包含大量括号（如 PHP 数组语法）时会发生什么。您还需要考虑一般属性，包括如何（如果？）它们被引用，每个标签是否允许多个属性（如您的示例中），以及如何处理无效属性。

当您继续处理字符串时，您还需要跟踪打开的标签以及打开的顺序。您必须考虑其他标签中允许使用哪些标签。您还必须处理错误嵌套的问题，例如 [a][b][/a][/b]。您的选择是在外部标签关闭后重新打开内部标签，或者在外部标签关闭后立即关闭内部标签。更糟糕的是，根据具体情况，不同的行为可能是有意义的。更糟糕的是像 [list] 内的 [*] 之类的古怪标签，传统上它没有结束标签！

一旦您处理了字符串并创建了开始和结束标记的列表（并且可能重新平衡了开始和结束），那么您可以将结果转换为 HTML，或者任何最终的输出。这是您将这些特定标签的输出移动到新文档末尾的时间和方式。

完成后，编写一千个测试用例。尝试破坏它，将其炸成小块，产生 XSS 漏洞，否则尽你最大的努力让你的生活变得一团糟。这是值得的，因为结果将是一个 BBCode 引擎，它将完成您想要做的事情。

Street cred: I worked for Infopop (later known as Groupee, now Social Strata), the creators of UBBCode, the thing that was copied and transformed into just plain old regular "BBCode."

tl;dr: Time to write your own non-regex parser.

Most BBCode parsers use regexes, and that works for most cases, but you're doing something custom here. Plain old regular expressions are not going to help you. Regexes have two modes of operation that get in our way: we can either match everything between two tags in "greedy" mode, or in "not greedy" mode.

In "greedy" mode, we'll capture everything between the very first opening task and the very last closing tag. This breaks things horribly. Take this case:

[a][b][c]...[/c][/b][/a]...[a]...[/a]

A greedy regex like \[a\].+\[/a\] is going to grab everything from that first opening tag to that last closing tag, ignoring the fact that the closer isn't closing the opener.

The other option is worse. Take this case:

[a][b][a]...[/a][/b][/a]

An ungreedy regex like \[a\].+?\[/a\] (the only change is the question mark) is going to match the first opening tag, but then it'll match the first closing tag, again ignoring that the closing tag doesn't belong to the opening tag.

The way I solved this way, way back in the primitive days was to completely ignore the fact that the opening and closing tags didn't match. I simply looped the entire chain of tag transformation regexes until the output stopped changing. It was simple and effective, mainly because the available tag set was intentionally limited, so nesting was never an issue.

The instant you allow nesting of identical tags, blind, brute force is no longer a suitable tool.

If none of the BBCode parsing engines out there are going to work for you, you might have to write your own. Check all of them out. There are some on PEAR, there's a PECL extension, etc. Also check other languages for inspiration, Perl's CPAN has a dozen different implementations, some of which are very powerful and complex (if there isn't a proper recursive descent parser in that mix, I'll be depressed). This is a good challenge, but it's not too hard. Then again, I've written like five now (none of which I can release), so maybe I'm biased?

Start by exploding the string on [ and ]. Go through the resulting array, keeping track of when the array index following the opening bracket and before the next closing bracket happens to look like a valid tag and/or attributes. You're going to need to think about what happens when an attribute can contain a bracket, or worse, are URLs that are bracket-heavy (like PHP array syntax). You'll also need to think about attributes in general, including how (if?) they are quoted, if multiple attributes per tag are allowed (as in your example), and what to do with invalid attributes.

As you continue to process the string, you will also need to keep track of what tags are open, and in what order. You'll have to think about what tags are permitted inside other tags. You'll also have to deal with mis-nesting, like [a][b][/a][/b]. Your options will be either re-opening the inner tag after the outer closes, or closing the inner as soon as the outer does. Worse, different behavior might make sense depending on the situation. Worse-worse are wacky tags like [*] inside [list], which traditionally doesn't have a closing tag!

Once you've processed the string and have created a list of open and closing tags (and possibly re-balanced the opens and closes), then you can transform the result into HTML, or whatever your output ends up being. This is when and how you'd move the output of those specific tags to the end of the new document.

Once you've finished up, write a thousand test cases. Try to break it, blow it into itty bitty chunks, produce XSS vulnerabilities, and otherwise do your best to make your life hell. It will be worth it, because the result will be a BBCode engine that will do what you're trying to do.

回复收藏 0 原文

~没有更多了~