删除 XML 标签及其内容之间的任何内容
我需要删除 XML 标记之间的任何内容,尤其是空格和换行符。
例如,从以下位置删除空格和新闻行:
\n<节点 id="任意">
获得:
这并不是为了手动解析 XML,而是为了在工具解析 XML 数据之前准备好 XML 数据。 更具体地说,我正在使用 Hpricot (Ruby) 来解析 XML,不幸的是我们目前停留在版本 0.6.164 上,所以......我不知道更新的版本,但这经常返回奇怪的节点仅包含空格和换行符的(对象)。 因此,我们的想法是先清理 XML,然后再将其转换为 Hpricot 文档。 替代解决方案受到赞赏。
测试示例:NoMethodError: undefined method `children' for "\n ":Hpricot::Text
这里有趣的部分不是 NoMethodError,因为这很好,而是 Hpricot::Text 元素只包含一个换行符,仅此而已。
I would need to remove anything between XML tags, especially whitespace and newlines.
For example removing whitespace and newslines from:
</node> \n<node id="whatever">
to get:
</node><node id="whatever">
This is not meant for parsing XML by hand, but rather to prepare XML data before it's getting parsed by a tool. To be more specific, I'm using Hpricot (Ruby) to parse XML and unfortunately we're currently stuck on version 0.6.164, so ... I don't know about more recent versions, but this one often returns weird nodes (Objects) that only contain whitespace and line breaks. So the idea is to clean up the XML before converting it into an Hpricot document. Alternative solutions appreciated.
An example from a test: NoMethodError: undefined method `children' for "\n ":Hpricot::Text
The interesting part here is not the NoMethodError, because that's just fine, but that the Hpricot::Text element only contains a newline and nothing more.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
请不要使用正则表达式来解析 XML。 这是非常容易出错的。
使用适当的 XML 库,这将使这变得微不足道。 几乎所有您需要的编程平台都有可用的 XML 库 - 确实没有理由对 XML 使用正则表达式。
Please don't use regular expressions to parse XML. It's horribly error prone.
Use a proper XML library, which will make this trivial. There are XML libraries available for just about every programming platform you could ask for - there's really no excuse to use a regular expression for XML.
解决方案是选择所有“空白”文本节点并将其删除。
A solution is to select all "blank" text nodes and remove them.
使用正则表达式解析 XML 通常不是一个好主意。 XML 的主要优点之一是,对于您可能需要的任何语言/框架,都有许多经过良好测试的解析器。 XML 中存在一些棘手的规则,它们会阻止任何正则表达式正确解析 XML。
也就是说,类似:(
即 perl 语法)可能会满足您的要求。 这就是说,取出从大于到小于的任何内容,然后将其删除。 末尾的“g”表示根据需要多次执行替换,“s”表示“.”。 匹配包括换行符在内的所有字符(否则将不包括换行符,因此该模式需要为每行运行一次,并且不会覆盖跨越多行的标签)。
It is generally not a good idea to parse XML using regular expressions. One of the major benefits of XML is that there are dozens of well-tested parsers out there for any language/framework that you might ever want. There are some tricky rules within XML that prevent any regular expression from being able to properly parse XML.
That said, something like:
(that is perl syntax) might do what you want. That says take anything from a greater than up to a less than, and strip it away. The "g" at the end says to perform the substitution as many times as needed, and the "s" makes the "." match all characters INCLUDING newlines (otherwise newlines would not be included, so the pattern would need to be run once for each line, and it would not cover tags that span multiple lines).
您不应该使用正则表达式来解析 XML 或 HTML,它不可靠,而且有太多的边缘情况。 您应该使用 XML/HTML 解析器来处理此类内容。
You shouldn't use regex to parse XML or HTML, it's just not reliable and there are way too many edge cases. You should use a XML/HTML parser for this kind of stuff instead.
不要使用正则表达式。 尝试将 XML 解析为 DOM,然后从那里进行操作(您使用什么语言/框架?);
Don't use regex. Try parsing the XML into a DOM, and manipulating from there (what language/framework are you using?);