使用 SAX 进行 XML 解析 |如何处理特殊字符?
我们有一个 JAVA 应用程序,可以从 SAP 系统中提取数据、解析数据并呈现给用户。 使用 SAP JCo 连接器提取数据。
最近我们抛出了一个异常:
org.xml.sax.SAXParseException:字符引用“�”是无效的 XML 字符。
因此,我们计划编写一个新的间接级别,在解析 XML 之前替换所有特殊/非法字符。
我的问题是:
- 是否有任何现有的(开源)实用程序可以完成替换 XML 中的非法字符的工作?
- 或者如果我必须编写这样的实用程序,我应该如何处理它们?
- 为什么会抛出上面的异常呢?
谢谢。
We have a JAVA application that pulls the data from SAP system, parses it and renders to the users.
The data is pulled using SAP JCo connector.
Recently we were thrown an exception:
org.xml.sax.SAXParseException: Character reference "" is an invalid XML character.
So, we are planning to write a new level of indirection where ALL special/illegal characters are replaced BEFORE parsing the XML.
My questions here are :
- Is there any existing (open source) utility that does this job of replacing illegal characters in XML?
- Or if I had to write such utility, how should I handle them?
- Why is the above exception thrown?
Thank You.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
从我的角度来看,源头(SAP)应该进行替换。否则,它传输给您的程序的内容可能看起来像 XML,但实际上并非如此。
替换 '&' 时通过“&”可以通过简单的 String.replaceAll(...) 到 toXML() 调用的字符串来完成,其他字符可能更难替换(例如 '<' 和 '>')。
问候
纪尧姆
From my point of view, the source (SAP) should do the replacement. Otherwise, what it transmits to your programm may looks like XML, but is not.
While replacing the '&' by '&' can be done by a simple String.replaceAll(...) to the string from to toXML() call, others characters can be harder to replace (the '<' and '>' for exemple).
regards
Guillaume
听起来像是他们逃跑过程中的一个bug。根据上下文,您可能最好只编写自己版本的 XMLWriter 类,该类使用真正的 XML 库,而不是像 SAP 开发人员那样尝试编写自己的 XML 实用程序。
或者,查看字符代码 �,您也许可以将其全部替换为空字符串:
It sounds like a bug in their escaping. Depending on context you might be best off just writing your own version of their XMLWriter class that uses a real XML library rather than trying to write your own XML utilities like the SAP developers did.
Alternatively, looking at the character code, �, you might be able to get away with a replace all on it with the empty string:
我遇到了一个相关但相反的问题,其中我试图将字符 1 插入到 XSLT 转换的输出中。我考虑过后处理将标记替换为零,但选择使用 xsl:param。
如果我处于您的情况,我会提出一种定制编码,替换 XML 中无效的字符,并在解析中将它们作为特殊情况处理,或者如果可能,将它们替换为空格。
我没有使用 JCO 的经验,因此无法建议如何或在何处替换无效字符。
I've had a related, but opposite problem, where I was trying to insert character 1 into the output of an XSLT transformation. I considered post-processing to replace a marker with the zero, but instead chose to use an xsl:param.
If I was in your situation, I'd either come up with a bespoke encoding, replacing the characters which are invalid in XML, and handling them as special cases in your parsing, or if possible, replace them with whitespace.
I don't have experience with JCO, so can't advise on how or where I'd replace the invalid characters.
您可以使用 Apache Commons Lang 类 StringEscapeUtils escapeXML 方法对 XML 中的非 ASCII 字符进行编码/解码。请参阅:
http://commons.apache.org/lang/api-2.4/index .html
要了解 XML 字符引用的工作原理,请在维基百科上搜索“数字字符引用”。
You can encode/decode non-ASCII characters in XML by using the Apache Commons Lang class StringEscapeUtils escapeXML method. See:
http://commons.apache.org/lang/api-2.4/index.html
To read about how XML character references work, search for "numeric character references" on wikipedia.