使用 Saxon 进行转换时处理 XML 中的特殊字符
我正在尝试使用 Saxon 将样式表应用到 XML 文档。给定一个在 Microsoft Word 中生成的 XML 文件,并且该文件具有 Microsoft Word 样式的引号(例如以下文档中的 FOO 周围),
<?xml version="1.0" encoding="UTF-8"?>
<doc>
<act>
<performer typeCode=“FOO“ />
<performer typeCode="BAR" />
</act>
</doc>
Saxon 会抛出以下错误:
SXXP0003: Error reported by XML parser: Invalid byte 1 of 1-byte UTF-8 sequence.
What is the best way to process those type of "special" characters in XML that本来是有效的,但在实际解析/转换中却中断了?
I'm attempting to apply a stylesheet to an XML document using Saxon. Given an XML file that was generated in Microsoft Word and that has Microsoft Word-style quotes, such as around FOO in the following document
<?xml version="1.0" encoding="UTF-8"?>
<doc>
<act>
<performer typeCode=“FOO“ />
<performer typeCode="BAR" />
</act>
</doc>
Saxon throws the following error:
SXXP0003: Error reported by XML parser: Invalid byte 1 of 1-byte UTF-8 sequence.
What is the best way to handle these type of "special" characters in XML that were intended to be valid but break in actual parsing/transformation?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
由于上面的内容不是有效的 XML,因此您必须对输入进行一些预处理(例如使用 FilterReader),因为几乎任何 XML 解析器都会指示错误(通常是致命错误,因此您无法处理错误并继续)。
如果特殊引号仅在 xml 中,您可以简单地将特殊引号替换为普通引号(如果您必须检查编码类型的前导码,则需要做更多工作)。如果您想在文档的其他地方保留特殊引号,您将不得不做一些更复杂的事情(主要是跟踪您是否在标签中)。
Since the above is not valid XML, you will have to do some preprocessing of the input (say with a
FilterReader
), as just about any XML parser will indicate an error (and typically a fatal error, so you cannot handle the error and continue).If the special quotes are only in the xml you can do a simple replace of the special quotes with plain quotes (a little more work if you have to check the preamble for the encoding type). If you want to keep special quotes elsewhere in the document you will have to do something a bit more complicated (mostly keep track as to whether you are in a tag or not).
问题是那些“特殊”引号不是有效的 xml。 Saxon 或任何其他 xml 解析器都会丢弃这些内容并且不解析文档。
我唯一可以建议的是搜索并替换它们,并将它们替换为预期的引号。
trouble is those 'special' quotes are not valid xml. Saxon or any other xml parser is going to throw that stuff out and not parse the document.
Only thing I can suggest is a search and replace for those and replace them with the expected quotes.