来自 Servlet 的字符串，带有 XML CDATA 中的控制字符

发布于 2024-11-17 10:38:46 字数 525 浏览 3 评论 0原文

我的问题类似于为什么“控制”字符在 XML 1.0 中非法？ - 但是我正在寻找以下问题的解决方案，而不是为什么 XML 规范不允许 XML 中的控制字符。

我有一个 servlet，它根据用户请求打印包含 XML 的字符串。一个特定元素包含 CDATA 部分，该部分需要包含一些用户输入文本。

现在碰巧在一种特定情况下，我们的用户输入包含字符 U+0001（控制字符）。即使我将字符集指定为 UTF-8，Servlet 也会抛出错误：

Error: not well-formed
Location: 

<![CDATA[

有没有办法可以处理 Java 字符串以使其“XML 安全”？特别是，在放入 CDATA 部分时要使其安全吗？

我希望我的问题很清楚！

提前致谢，拉吉

原文

My question is similar to Why are "control" characters illegal in XML 1.0? - however I'm looking for a solution to the problem below, rather than why the XML spec disallows control characters in XML.

I have a servlet, which prints a String containing an XML upon user request. One particular element contains a CDATA section which is required to contain some user input text.

Now it so happens that in one particular case, our user input contains the character U+0001 (a control character). And even though I specify the charset as UTF-8, the servlet throws an error:

Error: not well-formed
Location: 

<![CDATA[

Is there a way I can process the Java String to make it "XML safe" ? Particularly, to make it safe when put in the CDATA section?

I hope my question is clear!

Thanks in advance,
Raj

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

三月梨花 2024-11-24 10:38:46

使 XML 安全的唯一一致方法是添加您自己的编码。

您可以执行这两种操作之一（例如）：

将所有数据存储为文本数据，并使用某种 unicode 转义机制（而不是 XML 本身定义的机制！）替换所有禁止的字符。例如，您可以使用 Java 使用的：\u0001 表示 U+0001。或
将数据存储为二进制数据并使用base64Binary hexBinary 将数据存储在 XML 中。

这两种方法都需要消费者和生产者的明确支持。第二种方法的优点是使用具有广泛支持的明确定义的数据类型，但如果您的内容实际上是文本，则需要指定（或传达）字节流中使用的编码（这是 XML 本身所否定的必要性））。

如果删除所有不可转移字符是合适的，那么这个正则表达式应该可以解决问题：

Pattern XML_INVALID_CHARS = Pattern.compile("[^\u0009\n\r\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF ]+");
String xmlSafe = XML_INVALID_CHARS.matcher(input).replaceAll("");

请注意，规范建议文档作者对注释中允许的字符集更加严格。该正则表达式会更长一些。

The only conforming way to make this XML-safe is to add your own encoding.

You can do one of those two (for example):

Store all data as textual data and replace all forbidden characters with some unicode-escape mechanism (other than the one defined in XML itself!). For example you could use the one used by Java: \u0001 for U+0001. or
store the data as binary data and use base64Binary of hexBinary to store your data in XML.

Both of those approaches need explicit support in both the consumer and the producer. The second approach has the advantage of using well-defined data types with wide support, but if your content is actually text, you need to specify (or communicate) the encoding used in the byte stream (a necessity that is otherwise negated by XML itself).

If removing all non-transferable characters would be appropriate, then this regex should do the trick:

Pattern XML_INVALID_CHARS = Pattern.compile("[^\u0009\n\r\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF ]+");
String xmlSafe = XML_INVALID_CHARS.matcher(input).replaceAll("");

Note that the spec suggests that document authors be even more restrictive with the set of characters allowed in a note. That regex would be a bit longer.

回复收藏 0 原文

~没有更多了~