发现无效的 XML 字符(Unicode:0xc)

发布于 2024-11-02 19:50:31 字数 545 浏览 1 评论 0原文

使用 Java DOM 解析器解析 XML 文件会产生:

[Fatal Error] os__flag_8c.xml:103:135: An invalid XML character (Unicode: 0xc) was found in the element content of the document.
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0xc) was found in the element content of the document.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)

Parsing an XML file using the Java DOM parser results in:

[Fatal Error] os__flag_8c.xml:103:135: An invalid XML character (Unicode: 0xc) was found in the element content of the document.
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0xc) was found in the element content of the document.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

美人如玉 2024-11-09 19:50:31

即使您将数据封装在 CDATA 块中,也有一些字符在 XML 文档中是不允许的。

如果您生成了文档,则需要对其进行实体编码或删除。如果您有错误的文档,您应该在尝试解析它之前删除这些字符。

请参阅 dolmens 在此线程中的回答:XML 中的无效字符

他链接到本文的位置:http://www.w3.org/TR/xml/#charsets

基本上,所有字符以下不允许使用 0x20,但 0x9 (TAB)、0xA (CR?)、0xD (LF?) 除外

There are a few characters that are dissallowed in XML documents, even when you encapsulate data in CDATA-blocks.

If you generated the document you will need to entity encode it or strip it out. If you have an errorneous document, you should strip away these characters before trying to parse it.

See dolmens answer in this thread: Invalid Characters in XML

Where he links to this article: http://www.w3.org/TR/xml/#charsets

Basically, all characters below 0x20 is disallowed, except 0x9 (TAB), 0xA (CR?), 0xD (LF?)

尘世孤行 2024-11-09 19:50:31
public String stripNonValidXMLCharacters(String in) {
    StringBuffer out = new StringBuffer(); // Used to hold the output.
    char current; // Used to reference the current character.

    if (in == null || ("".equals(in))) return ""; // vacancy test.
    for (int i = 0; i < in.length(); i++) {
        current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
        if ((current == 0x9) ||
            (current == 0xA) ||
            (current == 0xD) ||
            ((current >= 0x20) && (current <= 0xD7FF)) ||
            ((current >= 0xE000) && (current <= 0xFFFD)) ||
            ((current >= 0x10000) && (current <= 0x10FFFF)))
            out.append(current);
    }
    return out.toString();
}    
public String stripNonValidXMLCharacters(String in) {
    StringBuffer out = new StringBuffer(); // Used to hold the output.
    char current; // Used to reference the current character.

    if (in == null || ("".equals(in))) return ""; // vacancy test.
    for (int i = 0; i < in.length(); i++) {
        current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
        if ((current == 0x9) ||
            (current == 0xA) ||
            (current == 0xD) ||
            ((current >= 0x20) && (current <= 0xD7FF)) ||
            ((current >= 0xE000) && (current <= 0xFFFD)) ||
            ((current >= 0x10000) && (current <= 0x10FFFF)))
            out.append(current);
    }
    return out.toString();
}    
眼眸里的那抹悲凉 2024-11-09 19:50:31

每当无效的 xml 字符出现在 xml 中时,就会出现这样的错误。当你在记事本++中打开它时,它看起来像 VT、SOH、FF,就像这些是无效的 xml 字符。我正在使用 xml 版本 1.0,并且在按模式将文本数据输入数据库之前验证文本数据

Pattern p = Pattern.compile("[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFF]+"); 
retunContent = p.matcher(retunContent).replaceAll("");

它将确保不会在 xml 中输入无效的特殊字符

Whenever invalid xml character comes xml, it gives such error. When u open it in notepad++ it look like VT, SOH,FF like these are invalid xml chars. I m using xml version 1.0 and i validate text data before entering it in database by pattern

Pattern p = Pattern.compile("[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFF]+"); 
retunContent = p.matcher(retunContent).replaceAll("");

It will ensure that no invalid special char will enter in xml

痞味浪人 2024-11-09 19:50:31

The character 0x0C is be invalid in XML 1.0 but would be a valid character in XML 1.1. So unless the xml file specifies the version as 1.1 in the prolog it is simply invalid and you should complain to the producer of this file.

嘿看小鸭子会跑 2024-11-09 19:50:31

您可以使用自定义 FilterReader 类过滤所有“无效”字符:

public class InvalidXmlCharacterFilter extends FilterReader {

    protected InvalidXmlCharacterFilter(Reader in) {
        super(in);
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
        int read = super.read(cbuf, off, len);
        if (read == -1) return read;

        for (int i = off; i < off + read; i++) {
            if (!XMLChar.isValid(cbuf[i])) cbuf[i] = '?';
        }
        return read;
    }
}

并像这样运行它:

InputStream fileStream = new FileInputStream(xmlFile);
Reader reader = new BufferedReader(new InputStreamReader(fileStream, charset));
InvalidXmlCharacterFilter filter = new InvalidXmlCharacterFilter(reader);
InputSource is = new InputSource(filter);
xmlReader.parse(is);

You can filter all 'invalid' chars with a custom FilterReader class:

public class InvalidXmlCharacterFilter extends FilterReader {

    protected InvalidXmlCharacterFilter(Reader in) {
        super(in);
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
        int read = super.read(cbuf, off, len);
        if (read == -1) return read;

        for (int i = off; i < off + read; i++) {
            if (!XMLChar.isValid(cbuf[i])) cbuf[i] = '?';
        }
        return read;
    }
}

And run it like this:

InputStream fileStream = new FileInputStream(xmlFile);
Reader reader = new BufferedReader(new InputStreamReader(fileStream, charset));
InvalidXmlCharacterFilter filter = new InvalidXmlCharacterFilter(reader);
InputSource is = new InputSource(filter);
xmlReader.parse(is);
挽容 2024-11-09 19:50:31

在 UTF-8 上,不允许使用这些范围内的所有代码,对于 XML 1.0:

  • 0..8
  • B..C
  • E..1F
  • D800..DFFF
  • FFFE..FFFF

可以删除 then 的正则表达式为:

text.replaceAll('[\\x{0}-\\x{8}]|[\\x{B}-\\x{C}]|[\\x{E}-\\x{1F}]|[\\x{D800}-\\x{DFFF}]|[\\x{FFFE}-\\x{FFFF}]', "")

注意:如果如果您正在使用 XML 1.1,您还需要删除这些间隔:

  • 7F..84
  • 86..9F

参考:

On UTF-8, all the codes on these ranges are not allowed, for XML 1.0:

  • 0..8
  • B..C
  • E..1F
  • D800..DFFF
  • FFFE..FFFF

A regex that can remove then is:

text.replaceAll('[\\x{0}-\\x{8}]|[\\x{B}-\\x{C}]|[\\x{E}-\\x{1F}]|[\\x{D800}-\\x{DFFF}]|[\\x{FFFE}-\\x{FFFF}]', "")

Note: if you are working with XML 1.1, you also need to remove these intervals:

  • 7F..84
  • 86..9F

Refs:

尝蛊 2024-11-09 19:50:31

我刚刚使用了这个项目,发现它非常方便: https://github.com/rwitzel/streamflyer

如文档所述,使用 InvalidXmlCharacterModifier。

就像这个例子:

public String stripNonValidXMLCharacters(final String in) {

  final Modifier modifier = new InvalidXmlCharacterModifier("",
    InvalidXmlCharacterModifier.XML_10_VERSION);

  final ModifyingReader modifyingReader = 
         new ModifyingReader(new StringReader(in), modifier);

  return IOUtils.toString(modifyingReader);
}

I just used this project, and found it very handy: https://github.com/rwitzel/streamflyer

Using the InvalidXmlCharacterModifier, as the documentation says.

Like this example:

public String stripNonValidXMLCharacters(final String in) {

  final Modifier modifier = new InvalidXmlCharacterModifier("",
    InvalidXmlCharacterModifier.XML_10_VERSION);

  final ModifyingReader modifyingReader = 
         new ModifyingReader(new StringReader(in), modifier);

  return IOUtils.toString(modifyingReader);
}
青衫负雪 2024-11-09 19:50:31

我遇到了类似的问题,XML 包含控制字符。查看代码后,我发现使用了一个已弃用的类 StringBufferInputStream 来读取字符串内容。

http://docs.oracle.com/javase/7 /docs/api/java/io/StringBufferInputStream.html

This class does not properly convert characters into bytes. As of JDK 1.1, the preferred way to create a stream from a string is via the StringReader class.

我将其更改为 ByteArrayInputStream 并且工作正常。

I faced a similar issue where XML was containing control characters. After looking into the code, I found that a deprecated class,StringBufferInputStream, was used for reading string content.

http://docs.oracle.com/javase/7/docs/api/java/io/StringBufferInputStream.html

This class does not properly convert characters into bytes. As of JDK 1.1, the preferred way to create a stream from a string is via the StringReader class.

I changed it to ByteArrayInputStream and it worked fine.

豆芽 2024-11-09 19:50:31

对于将字节数组读入 String 并尝试使用 JAXB 转换为对象的人,您可以通过从字节数组创建 String 来添加“iso-8859-1”编码,如下所示:

String JAXBallowedString= new String(byte[] input, " ISO-8859-1”);

这会将冲突的字节替换为 JAXB 可以处理的单字节编码。显然这个解决方案只是解析xml。

For people who are reading byte array into String and trying to convert to object with JAXB, you can add "iso-8859-1" encoding by creating String from byte array like this:

String JAXBallowedString= new String(byte[] input, "iso-8859-1");

This would replace the conflicting byte to single-byte encoding which JAXB can handle. Obviously this solution is only to parse the xml.

耀眼的星火 2024-11-09 19:50:31

所有这些答案似乎都假设用户正在生成错误的 XML,而不是从 gSOAP 接收它,gSOAP 应该更清楚!

All of these answers seem to assume that the user is generating the bad XML, rather than receiving it from gSOAP, which should know better!

于我来说 2024-11-09 19:50:31

今天,我遇到了类似的错误:

Servlet.service() for servlet [remoting] in context with path [/***] throwed exception [请求处理失败;嵌套异常是 java.lang.RuntimeException: buildDocument failed.] 其根本原因
org.xml.sax.SAXParseException;行号:19;列数:91;在属性“text”的值中发现无效的 XML 字符(Unicode:0xc),元素为“label”。


在我第一次遇到错误后,我手动重新输入了整行,所以特殊字符无法潜入,并且 Notepad++ 没有显示任何不可打印的字符(白底黑字),尽管如此,我一遍又一遍地遇到相同的错误。

当我查找我所做的与前任不同的事情时,结果发现这是在结束之前多了一个空格 /> (据我所知,推荐用于较旧的解析器,但根据 XML 标准,无论如何它都不会有任何区别):

当我删除空格时:

一切正常。


所以这绝对是一个误导性的错误消息。

Today, I've got a similar error:

Servlet.service() for servlet [remoting] in context with path [/***] threw exception [Request processing failed; nested exception is java.lang.RuntimeException: buildDocument failed.] with root cause
org.xml.sax.SAXParseException; lineNumber: 19; columnNumber: 91; An invalid XML character (Unicode: 0xc) was found in the value of attribute "text" and element is "label".


After my first encouter with the error, I had re-typed the entire line by hand, so that there was no way for a special character to creep in, and Notepad++ didn't show any non-printable characters (black on white), nevertheless I got the same error over and over.

When I looked up what I've done different than my predecessors, it turned out it was one additional space just before the closing /> (as I've heard was recommended for older parsers, but it shouldn't make any difference anyway, by the XML standards):

<label text="this label's text" layout="cell 0 0, align left" />

When I removed the space:

<label text="this label's text" layout="cell 0 0, align left"/>

everything worked just fine.


So it's definitely a misleading error message.

半枫 2024-11-09 19:50:31

org.xml.sax.SAXParseException 包含现有的无效字符行号和列号。

要捕获并记录此详细信息:

try {
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

        DocumentBuilder builder = factory.newDocumentBuilder();
        InputStream inputStream = new FileInputStream(path);
        Document document = builder.parse(inputStream);

    } catch (SAXParseException e) {
        logger.error("Xml parse error, cause of the line number: {}, column number: {} .", e.getLineNumber(), e.getColumnNumber(), e);
    }

org.xml.sax.SAXParseException contains existing invalid character line number and column number.

For catching and logging this details:

try {
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

        DocumentBuilder builder = factory.newDocumentBuilder();
        InputStream inputStream = new FileInputStream(path);
        Document document = builder.parse(inputStream);

    } catch (SAXParseException e) {
        logger.error("Xml parse error, cause of the line number: {}, column number: {} .", e.getLineNumber(), e.getColumnNumber(), e);
    }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文