来自 Servlet 的字符串,带有 XML CDATA 中的控制字符
我的问题类似于 为什么“控制”字符在 XML 1.0 中非法? - 但是我正在寻找以下问题的解决方案,而不是为什么 XML 规范不允许 XML 中的控制字符。
我有一个 servlet,它根据用户请求打印包含 XML 的字符串。一个特定元素包含 CDATA 部分,该部分需要包含一些用户输入文本。
现在碰巧在一种特定情况下,我们的用户输入包含字符 U+0001(控制字符)。即使我将字符集指定为 UTF-8,Servlet 也会抛出错误:
Error: not well-formed
Location:
<![CDATA[
有没有办法可以处理 Java 字符串以使其“XML 安全”?特别是,在放入 CDATA 部分时要使其安全吗?
我希望我的问题很清楚!
提前致谢, 拉吉
My question is similar to Why are "control" characters illegal in XML 1.0? - however I'm looking for a solution to the problem below, rather than why the XML spec disallows control characters in XML.
I have a servlet, which prints a String containing an XML upon user request. One particular element contains a CDATA section which is required to contain some user input text.
Now it so happens that in one particular case, our user input contains the character U+0001 (a control character). And even though I specify the charset as UTF-8, the servlet throws an error:
Error: not well-formed
Location:
<![CDATA[
Is there a way I can process the Java String to make it "XML safe" ? Particularly, to make it safe when put in the CDATA section?
I hope my question is clear!
Thanks in advance,
Raj
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使 XML 安全的唯一一致方法是添加您自己的编码。
您可以执行这两种操作之一(例如):
\u0001
表示 U+0001。 或这两种方法都需要消费者和生产者的明确支持。第二种方法的优点是使用具有广泛支持的明确定义的数据类型,但如果您的内容实际上是文本,则需要指定(或传达)字节流中使用的编码(这是 XML 本身所否定的必要性) )。
如果删除所有不可转移字符是合适的,那么这个正则表达式应该可以解决问题:
请注意,规范建议文档作者对注释中允许的字符集更加严格。该正则表达式会更长一些。
The only conforming way to make this XML-safe is to add your own encoding.
You can do one of those two (for example):
\u0001
for U+0001. orBoth of those approaches need explicit support in both the consumer and the producer. The second approach has the advantage of using well-defined data types with wide support, but if your content is actually text, you need to specify (or communicate) the encoding used in the byte stream (a necessity that is otherwise negated by XML itself).
If removing all non-transferable characters would be appropriate, then this regex should do the trick:
Note that the spec suggests that document authors be even more restrictive with the set of characters allowed in a note. That regex would be a bit longer.