如何加载未经净化的 XML?

发布于 07-30 03:43 字数 671 浏览 2 评论 0原文

当前发行版中的应用程序生成了各种 XML 文件。 其中一些文件已被证明包含无效字符,从而使它们呈现无效的 XML,除非关闭所有验证,否则在大多数情况下不会加载它们,然后,仅在 XmlDocument 实例中,而不是 XDocument

由于这个应用程序已经存在,我们必须处理它生成的文件。 现在,我可以继续添加一个 Sanitizer 类型,它知道在尝试加载文档之前要查找什么以及如何修复它,但我希望有人可能已经付出了努力来生成某些东西可能已经以有效的方式做到了这一点(例如 SanitizedXmlReader 类)。

这个问题涉及同一主题,但我在那里没有找到令人满意的答案。 我们想要的只是删除 XML 文件中任何位置无效的内容(而不是仅在 CDATA 中有效或未在 QName 中使用时有效的数据)。

那么,是否存在这样的东西,可以将“几乎”的 XML 文件变成“至少没有无效字符”的 XML 文件呢? 如果没有,下一个选择是我们自己推出。 在这种情况下,与其花时间解释 XML 规范来确定哪些字符在所有情况下都是非法的,不如在某个地方有一个明确的列表?

We have various XML files produced by an application in current distribution. Some of these files have turned out to contain invalid characters, rendering them invalid XML that won't load in most instances unless all validation is turned off, and then, only in XmlDocument instances, not XDocument.

As this app is already out there, we have to cope with the files it produces. Now, I could keep adding to a Sanitizer type that knows what to look for and how to fix it before trying to load the document, but I was hoping that someone may have already put the effort in to produce something that maybe did this already in an efficient manner (such as a SanitizedXmlReader class).

This question touches on the same topic but I didn't find a satisfactory answer there. All we want is to remove the content that is invalid in any place in an XML file (rather than data that is valid in say CDATA only or when not used in a QName).

So, does such a thing exist that can take an "almost" XML file and turn it into a "at least there are no invalid characters" XML file? If not, rolling our own is the next option. In this instance, instead of spending time interpreting the XML specification to determine what characters are illegal in all situations, is there a definitive list somewhere?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

债姬2024-08-06 03:43:44

几年前,我使用 SGMLReader 来加载蹩脚的 HTML 代码。 这也可以帮助您解析无效的 XML。

PS:同时还有一个 NuGet 包,源代码可在 Github

I used SGMLReader a few years ago to load crappy HTML code. That may help you too to parse invalid XML.

PS: Meanwhile there's a NuGet package, and the sources are available at Github.

醉生梦死2024-08-06 03:43:44

问题

如果您最终自己编写了一些字符,那么了解哪些字符是有效的肯定有点棘手。

XML 1.1 更改了规则,但我们假设没有人使用它(因为几乎没有人这样做),并坚持使用 1.0。

XML 1.0 修订版 5 还更改了 早期版本,但您无法从文档本身中看出任何情况。 它简化了有关 Unicode 的一些事情,但违背了一些原始规范作者的建议。 我们也假设这个问题不存在。

答案

Java 有一个漂亮的小类 XmlChar,它具有可用于确定哪些字符对于哪些构造有效的方法。 .Net 没有,但 Mono 项目包含 System.Xml.XmlChar 这可能会帮助您。

您可以首先过滤掉所有绝对不允许的字符。 上述 Mono 类中的 XmlChar.IsValid(char c) 方法应该会有所帮助。

了解应用程序生成的其他类型的不良 XML 会很有趣。

Problems

If you do end up writing your own, knowing which characters are valid is definitely a little tricky.

XML 1.1 changed the rules, but let's assume that nobody uses it ('cause hardly anyone does), and stick to 1.0.

XML 1.0 revision 5 changed the rules also from earlier versions, but not in any way you can tell from the document itself. It simplified some things as regards to Unicode, but against the recommendations of some of the original spec authors. Let's also pretend this issue doesn't exist.

Answer

Java has this nice little class, XmlChar, which has methods that you can use to determine which characters are valid for which constructs. .Net doesn't, but the Mono project includes the source to a System.Xml.XmlChar which might help you out.

You could probably start by filtering out all characters which are definitely not allowed anywhere. The XmlChar.IsValid(char c) method from the above Mono class should help.

It would be interesting to know what other types of bad XML that application produces.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文