从写得不好的 XML 字符串中找到的所有结束标记中删除所有标记属性

发布于 2024-10-08 17:29:24 字数 408 浏览 17 评论 0原文

我正在尝试使用 preg_replace() 来清理写得不好的 XML。

$x = '<abc x="y"><def x="g">more test</def x="g"><blah>test data</blah></abc x="y">';

逻辑是检查结束标记内是否有空格，并删除从空格到标记末尾的所有内容。

期望的结果：

<abc x="y"><def x="g">more test</def><blah>test data</blah></abc>

原文

I'm trying to use preg_replace() to sanitize poorly written XML.

$x = '<abc x="y"><def x="g">more test</def x="g"><blah>test data</blah></abc x="y">';

The logic is to check if there's a space within a closing tag </ > and delete everything from the space to the end of the tag.

Desired result:

<abc x="y"><def x="g">more test</def><blah>test data</blah></abc>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

戏剧牡丹亭 2024-10-15 17:29:24

这应该可以做到：

preg_replace('/<\/(\w+)\s*[^>]*>/', '</\1>', $x);

This should do it:

preg_replace('/<\/(\w+)\s*[^>]*>/', '</\1>', $x);

回复收藏 0 原文

少跟Wǒ拽 2024-10-15 17:29:24

在这种情况下，正则表达式实际上可能是可行的：

$xml = preg_replace("#(</(\w+:)?\w+)\s[^>]+>#", "$1>", $xml);

编辑：根据@netcoder的提示进行修复。在垃圾之前强制添加空格。

明显的陷阱当然是注释（对于数据 XML 来说不太可能）和 CDATA 部分（从 xml 的外观来看也不太可能）。

尽管您仍然可以尝试 QueryPath，但它应该也可以使用 XML，并且可能对这些情况具有弹性。怎么就乱码了呢？

A regex might actually be feasible in this case:

$xml = preg_replace("#(</(\w+:)?\w+)\s[^>]+>#", "$1>", $xml);

Edit: fixed as per @netcoder's hint. Made space mandatory before garbage.

The obvious pitfalls are of course comments (unlikely for data XML), and CDATA sections (from the looks of your xml also not likely).

Though you could still try QueryPath, it's supposed to work with XML too and might be resilient about these cases. How did it get garbled anyway?

回复收藏 0 原文

雨巷深深 2024-10-15 17:29:24

preg_replace('/<\/(.*?)\s+[^>]+>/', '</$1>', $string);

编辑：经过测试，有效。

preg_replace('/<\/(.*?)\s+[^>]+>/', '</$1>', $string);

Edit: tested, works.

回复收藏 0 原文

如歌彻婉言 2024-10-15 17:29:24

尝试：

preg_replace("/<\/((\w)([^<].*)?)\>/","</$2>",$x);

代码未测试

Try:

preg_replace("/<\/((\w)([^<].*)?)\>/","</$2>",$x);

Code not tested

回复收藏 0 原文

所有深爱都是秘密 2024-10-15 17:29:24

您还可以使用 T-Regx 库：

这与 @Jonah 示例：

pattern('<\/(.*?)\s+[^>]+>')->replace($string)->all()->withReferences('</$1>');

PS：请注意，使用 with () 会引用占位符。

You can also use T-Regx library:

This with @Jonah example:

pattern('<\/(.*?)\s+[^>]+>')->replace($string)->all()->withReferences('</$1>');

PS: Notice that using with() would quote the placeholders.

回复收藏 0 原文

就像说晚安 2024-10-15 17:29:24

将结束标记的前导部分与 匹配，然后用 \K 忽略这些字符，然后匹配文字空格后跟零个或多个非更大值-than 符号使用 [^>]*，然后使用 (?=>) 向前查找结束大于符号。用空字符串替换该匹配项。（演示）

$x = '<abc x="y"><def x="g">more test</def x="g"><blah>test data</blah></abc x="y">';

echo preg_replace('#</\w+\K [^>]*(?=>)#', '', $x);
// <abc x="y"><def x="g">more test</def><blah>test data</blah></abc>

Match the leading portion of the closing tag with </\w+, then forget those characters with \K, then match the literal space followed by zero or more non-greater-than symbols with [^>]*, then lookahead for the literal closing greater-than symbol with (?=>). Replace that match with an empty string. (Demo)

$x = '<abc x="y"><def x="g">more test</def x="g"><blah>test data</blah></abc x="y">';

echo preg_replace('#</\w+\K [^>]*(?=>)#', '', $x);
// <abc x="y"><def x="g">more test</def><blah>test data</blah></abc>

回复收藏 0 原文

~没有更多了~