删除 XML 标签及其内容之间的任何内容

发布于 2024-07-27 22:41:34 字数 539 浏览 8 评论 0原文

我需要删除 XML 标记之间的任何内容，尤其是空格和换行符。

例如，从以下位置删除空格和新闻行：
\n<节点 id="任意">

获得：

这并不是为了手动解析 XML，而是为了在工具解析 XML 数据之前准备好 XML 数据。更具体地说，我正在使用 Hpricot (Ruby) 来解析 XML，不幸的是我们目前停留在版本 0.6.164 上，所以......我不知道更新的版本，但这经常返回奇怪的节点仅包含空格和换行符的（对象）。因此，我们的想法是先清理 XML，然后再将其转换为 Hpricot 文档。替代解决方案受到赞赏。

测试示例：NoMethodError: undefined method `children' for "\n ":Hpricot::Text
这里有趣的部分不是 NoMethodError，因为这很好，而是 Hpricot::Text 元素只包含一个换行符，仅此而已。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不弃不离 2024-08-03 22:41:34

请不要使用正则表达式来解析 XML。这是非常容易出错的。

使用适当的 XML 库，这将使这变得微不足道。几乎所有您需要的编程平台都有可用的 XML 库 - 确实没有理由对 XML 使用正则表达式。

回复收藏 0 原文

苦妄 2024-08-03 22:41:34

解决方案是选择所有“空白”文本节点并将其删除。

doc = Nokogiri(xml_source)
doc.xpath('//text()[not(normalize-space())]').remove

A solution is to select all "blank" text nodes and remove them.

doc = Nokogiri(xml_source)
doc.xpath('//text()[not(normalize-space())]').remove

回复收藏 0 原文

行至春深 2024-08-03 22:41:34

使用正则表达式解析 XML 通常不是一个好主意。 XML 的主要优点之一是，对于您可能需要的任何语言/框架，都有许多经过良好测试的解析器。 XML 中存在一些棘手的规则，它们会阻止任何正则表达式正确解析 XML。

也就是说，类似：（

s/>.*?</></gs

即 perl 语法）可能会满足您的要求。这就是说，取出从大于到小于的任何内容，然后将其删除。末尾的“g”表示根据需要多次执行替换，“s”表示“.”。匹配包括换行符在内的所有字符（否则将不包括换行符，因此该模式需要为每行运行一次，并且不会覆盖跨越多行的标签）。

It is generally not a good idea to parse XML using regular expressions. One of the major benefits of XML is that there are dozens of well-tested parsers out there for any language/framework that you might ever want. There are some tricky rules within XML that prevent any regular expression from being able to properly parse XML.

That said, something like:

s/>.*?</></gs

(that is perl syntax) might do what you want. That says take anything from a greater than up to a less than, and strip it away. The "g" at the end says to perform the substitution as many times as needed, and the "s" makes the "." match all characters INCLUDING newlines (otherwise newlines would not be included, so the pattern would need to be run once for each line, and it would not cover tags that span multiple lines).

回复收藏 0 原文