从写得不好的 XML 字符串中找到的所有结束标记中删除所有标记属性
我正在尝试使用 preg_replace()
来清理写得不好的 XML。
$x = '<abc x="y"><def x="g">more test</def x="g"><blah>test data</blah></abc x="y">';
逻辑是检查结束标记 内是否有空格,并删除从空格到标记末尾的所有内容。
期望的结果:
<abc x="y"><def x="g">more test</def><blah>test data</blah></abc>
I'm trying to use preg_replace()
to sanitize poorly written XML.
$x = '<abc x="y"><def x="g">more test</def x="g"><blah>test data</blah></abc x="y">';
The logic is to check if there's a space within a closing tag </ >
and delete everything from the space to the end of the tag.
Desired result:
<abc x="y"><def x="g">more test</def><blah>test data</blah></abc>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这应该可以做到:
This should do it:
在这种情况下,正则表达式实际上可能是可行的:
编辑:根据@netcoder的提示进行修复。在垃圾之前强制添加空格。
明显的陷阱当然是注释(对于数据 XML 来说不太可能)和 CDATA 部分(从 xml 的外观来看也不太可能)。
尽管您仍然可以尝试 QueryPath,但它应该也可以使用 XML,并且可能对这些情况具有弹性。怎么就乱码了呢?
A regex might actually be feasible in this case:
Edit: fixed as per @netcoder's hint. Made space mandatory before garbage.
The obvious pitfalls are of course comments (unlikely for data XML), and CDATA sections (from the looks of your xml also not likely).
Though you could still try QueryPath, it's supposed to work with XML too and might be resilient about these cases. How did it get garbled anyway?
编辑:经过测试,有效。
Edit: tested, works.
尝试:
代码未测试
Try:
Code not tested
您还可以使用 T-Regx 库:
这与 @Jonah 示例:
PS:请注意,使用
with ()
会引用占位符。You can also use T-Regx library:
This with @Jonah example:
PS: Notice that using
with()
would quote the placeholders.将结束标记的前导部分与
匹配,然后用
\K
忽略这些字符,然后匹配文字空格后跟零个或多个非更大值-than 符号使用[^>]*
,然后使用(?=>)
向前查找结束大于符号。用空字符串替换该匹配项。 (演示)Match the leading portion of the closing tag with
</\w+
, then forget those characters with\K
, then match the literal space followed by zero or more non-greater-than symbols with[^>]*
, then lookahead for the literal closing greater-than symbol with(?=>)
. Replace that match with an empty string. (Demo)