使用正则表达式使 XHTML 文件有效

发布于 2024-10-17 05:12:42 字数 285 浏览 4 评论 0原文

我正在尝试使用 PHP 和 SimpleXML 来解析 XHTML 文件，但是该文件包含 <和>不属于标记一部分并导致解析失败的符号（开始标记和结束标记不匹配）。

如何在解析之前将它们转换为 HTML 实体而不更改文件或影响标记？

示例：

<p> a < b </p>

将变为：

<p> a &lt; <b> </p>

原文

I'm trying to use PHP with SimpleXML to parse an XHTML file, however the file contains < and > signs which are not part of the markup and cause parsing to fail (opening and end tag mismatches).

How can I convert these to HTML entities before parsing without changing the file or affecting the markup?

Example:

<p> a < b </p>

Would become:

<p> a < <b> </p>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不奢求什么 2024-10-24 05:12:42

简单的回答是：你不能用正则表达式解析 html。

也许您可以尝试使用另一个不会被 < 和 > 阻塞的 xml 解析器？

更好的是，不要尝试将 xhtml 文件解析为 xml，因为正如您自己已经指出的那样，它并不是真正的 xml 文件，并且其中包含非法字符。

回复收藏 0 原文

自由如风 2024-10-24 05:12:42

正如 Martin Jespersen 已经说过的，没有好的方法可以使用正则表达式解析（无效或有效）标记，至少不能使用 PHP 正则表达式。

也就是说，如果您只是在寻找一种方法来删除

不平衡的尖括号
有效标签之间
，而这些标签的属性值内不包含尖括号

，那么您可能会这样做：

$intermediate = preg_replace('/(>[^<>]*)<([^<>]*<)/', '\1<\2', $subject);
$result = preg_replace('/(>[^<>]*)>([^<>]*<)/', '\1>\2', $intermediate);

但是您必须运行多次，直到不再有匹配项为止，因为这一次只会捕获标记之间的一个杂散 < 或 >。它也会在像

这样的伪平衡括号上失败。一个<> b

。

As Martin Jespersen already said, there is no good way to parse (invalid or valid) markup with regexes, at least not with PHP regexes.

That said, if you're only looking for a way to remove

unbalanced angle brackets
that are between valid tags
which do not contain angle brackets somewhere inside their attribute values

then you might get away with doing this:

$intermediate = preg_replace('/(>[^<>]*)<([^<>]*<)/', '\1<\2', $subject);
$result = preg_replace('/(>[^<>]*)>([^<>]*<)/', '\1>\2', $intermediate);

but you'd have to run this several times until there are no more matches because this will only catch one stray < or > between tags at a time. It will also fail on pseudo-balanced brackets like <p> a <> b </p>.

回复收藏 0 原文

~没有更多了~