使用正则表达式使 XHTML 文件有效
我正在尝试使用 PHP 和 SimpleXML 来解析 XHTML 文件,但是该文件包含 <和>不属于标记一部分并导致解析失败的符号(开始标记和结束标记不匹配)。
如何在解析之前将它们转换为 HTML 实体而不更改文件或影响标记?
示例:
<p> a < b </p>
将变为:
<p> a < <b> </p>
I'm trying to use PHP with SimpleXML to parse an XHTML file, however the file contains < and > signs which are not part of the markup and cause parsing to fail (opening and end tag mismatches).
How can I convert these to HTML entities before parsing without changing the file or affecting the markup?
Example:
<p> a < b </p>
Would become:
<p> a < <b> </p>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
简单的回答是:你不能用正则表达式解析 html。
也许您可以尝试使用另一个不会被
<
和>
阻塞的 xml 解析器?更好的是,不要尝试将 xhtml 文件解析为 xml,因为正如您自己已经指出的那样,它并不是真正的 xml 文件,并且其中包含非法字符。
Well the short answer is: you can't parse html with regex.
Maybe you could try using another xml parser that doesnt' choke on the
<
and>
?Better yet, don't try to parse an xhtml file as xml, since as you already point out yourself, it isn't really an xml file, and has illegal characters in it.
正如 Martin Jespersen 已经说过的,没有好的方法可以使用正则表达式解析(无效或有效)标记,至少不能使用 PHP 正则表达式。
也就是说,如果您只是在寻找一种方法来删除
,那么您可能会这样做:
但是您必须运行多次,直到不再有匹配项为止,因为这一次只会捕获标记之间的一个杂散
<
或>
。它也会在像这样的伪平衡括号上失败。一个<> b
。
As Martin Jespersen already said, there is no good way to parse (invalid or valid) markup with regexes, at least not with PHP regexes.
That said, if you're only looking for a way to remove
then you might get away with doing this:
but you'd have to run this several times until there are no more matches because this will only catch one stray
<
or>
between tags at a time. It will also fail on pseudo-balanced brackets like<p> a <> b </p>
.