Java 库来转义/清理 XML?
我收到一些格式错误的 xml 文本输入,例如:
"<Tag>something</Tag> 8 > 3, 2 < 3, ... <Tag>something</Tag>"
我想清理输入,以便得到:
"<Tag>something</Tag> 8 > 3, 2 < 3, ... <Tag>something</Tag>"
也就是说,转义那些特殊符号,例如 <,>并保留有效的标签(“
,注意,情况相同)
你知道有什么java库可以做到这一点吗?可能是xml/html解析器? (虽然我真的不需要解析器,简单的“干净”程序)
I get some malformed xml text input like:
"<Tag>something</Tag> 8 > 3, 2 < 3, ... <Tag>something</Tag>"
I want to clean the input so to get:
"<Tag>something</Tag> 8 > 3, 2 < 3, ... <Tag>something</Tag>"
That is, escape those special symbols like <,> and yet keep the valid tags ("<Tag>something</Tag>
, note, with the same case)
Do you know of any java library to do this? Probably a xml/html parser? (though I don't really need a parser, simple a "clean" procedure)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
JTidy 是“HTML 语法检查器和漂亮的打印机。像它的非 Java 同类一样,JTidy 可以用作工具用于清理格式错误和有问题的 HTML”,
但它也可以与 xml 一起使用。检查文档。它非常聪明,它可能会为你工作。
JTidy is "HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML"
But it can also be used with xml. Check the documentation. It's incredible smart, it will probably work for you.
我不知道有哪个图书馆可以做到这一点。您的输入是格式错误的 XML,并且没有适当的 XML 解析器会接受它。更重要的是,并不总是能够将实际标签与看起来像标签但实际上是文本的东西区分开来。因此,您为解决问题所做的任何基于启发式的尝试都将是脆弱的;即它偶尔会产生格式错误的 XML。
最好的方法是在组装 XML 之前解决问题。
StringEscapeUtils.escapeXml
的内容...在 XML 标记合并之前。如果将问题留到“XML”组装完成之后才解决,则无法正确修复该问题。
I don't know of any library that would do that. Your input is malformed XML, and no proper XML parser would accept it. More important, it is not always possible to distinguish an actual tag from something that looks-like-a-tag-but-is-really-text. Therefore any heuristic-based attempt that you make to solve the problem will be fragile; i.e. it could occasionally produce malformed XML.
The best approach is address the problem before you assemble the XML.
StringEscapeUtils.escapeXml
on the relevant text chunks ... before the XML tags get incorporated.If you leave the problem until after the "XML" has been assembled, it cannot be properly fixed.
最好的解决方案是修复生成文本输入的程序。最简单的此类修复将涉及一个转义实用程序,就像其他答案所建议的那样。如果这不是一个选项,我会使用正则表达式
来匹配预期的标签,然后将字符串拆分为标签(您希望不改变地传递)和标签之间的文本(您想要对其应用转义)方法。)
我不会指望 XML 解析器能够为您完成此任务,因为您正在处理的不是有效的 XML。由于现有的转义缺乏可能会产生歧义,因此您可能也无法完成完美的工作。
The best solution is to fix the program generating your text input. The easiest such fix would involve an escape utility like the other answers suggested. If that's not an option, I'd use a regular expression like
to match the expected tags, and then split the string up into tags (which you want to pass through unchanged) and text between tags (against which you want to apply an escape method.)
I wouldn't count on an XML parser to be able to do it for you because what you're dealing with isn't valid XML. It is possible for the existing lack of escaping to produce ambiguities, so you might not be able to do a perfect job either.
查看 Guava 的 XmlEscaper。它是版本 11 的预发布版,但代码可用。
Check out Guava's XmlEscaper. It is in pre-release for version 11 but the code is available.
Apache Commons Lang 包含一个名为 StringEscapeUtils 的类,它完全可以满足您的需求!您想要使用的方法是 escapeXml,我猜。
Apache Commons Lang contains a class named StringEscapeUtils which does exactly what you want! The method you'd want to use is escapeXml, I presume.