如何检查 XML 数据是否为有效的 UTF-8 并检测不正确的字符?

发布于 2024-10-19 15:58:36 字数 1910 浏览 6 评论 0原文

在我的应用程序中,我必须验证 XML 数据并拾取所有无效字符(将它们放入 CDATA 中)

我的问题很简单... ^^ 如何做到这一点?

我从 Character.UnicodeBlock 方法开始,但是对于编码为多个字节的字符 - 例如“ï”或“é”,它是如何工作的?

这是我目前的代码(进行测试):

public static void main(String[] args) {

try {
    byte[] data = "J'ai prïé et `".getBytes("UTF-8");

    System.out.print("Data: ");
    for (int i = 0; i < data.length; i++) {
    System.out.print((char) data[i]);
    }

    System.out.println("");

    UnicodeBlock myBlock = null;

    for (int i = 0; i < data.length; i++) {
    System.out.println("[" + i + " => '" + (char) data[i]
        + "'] Is defined: "
        + Character.isDefined(new Byte(data[i]).intValue()));
    try {
        myBlock = Character.UnicodeBlock.of(new Byte(data[i])
            .intValue());
    } catch (IllegalArgumentException e) {
        System.out
            .println("Count => "
                + Character.charCount(new Byte(data[i])
                    .intValue()));
    }
    }
} catch (UnsupportedEncodingException e) {
    System.err.println("Unsupported encoding: " + e.getMessage());
}
System.out.println("Finished");
}

这就是我在执行时得到的结果:

Data: J'ai pr???? et `
[0 => 'J'] Is defined: true
[1 => '''] Is defined: true
[2 => 'a'] Is defined: true
[3 => 'i'] Is defined: true
[4 => ' '] Is defined: true
[5 => 'p'] Is defined: true
[6 => 'r'] Is defined: true
[7 => '?'] Is defined: false
Count => 1
[8 => '?'] Is defined: false
Count => 1
[9 => '?'] Is defined: false
Count => 1
[10 => '?'] Is defined: false
Count => 1
[11 => ' '] Is defined: true
[12 => 'e'] Is defined: true
[13 => 't'] Is defined: true
[14 => ' '] Is defined: true
[15 => '`'] Is defined: true
Finished

我正在尝试找到一种方法来检测多个字节字符,并且对于真正的不正确字符只有“假”结果。

也许 Java 中已经存在一个库可以做到这一点?

如果有人能帮助我,我会非常高兴。 提前致谢。

问候。

In my application i have to validate XML data and pickup all invalid characters (put them in CDATA)

My question is quite simple... ^^ how to do it?

I started with Character.UnicodeBlock methods, but for characters incoded into several bytes - for example 'ï' or 'é', how does it works ?

This my code at the moment (to make tests):

public static void main(String[] args) {

try {
    byte[] data = "J'ai prïé et `".getBytes("UTF-8");

    System.out.print("Data: ");
    for (int i = 0; i < data.length; i++) {
    System.out.print((char) data[i]);
    }

    System.out.println("");

    UnicodeBlock myBlock = null;

    for (int i = 0; i < data.length; i++) {
    System.out.println("[" + i + " => '" + (char) data[i]
        + "'] Is defined: "
        + Character.isDefined(new Byte(data[i]).intValue()));
    try {
        myBlock = Character.UnicodeBlock.of(new Byte(data[i])
            .intValue());
    } catch (IllegalArgumentException e) {
        System.out
            .println("Count => "
                + Character.charCount(new Byte(data[i])
                    .intValue()));
    }
    }
} catch (UnsupportedEncodingException e) {
    System.err.println("Unsupported encoding: " + e.getMessage());
}
System.out.println("Finished");
}

And this is what i get at execution:

Data: J'ai pr???? et `
[0 => 'J'] Is defined: true
[1 => '''] Is defined: true
[2 => 'a'] Is defined: true
[3 => 'i'] Is defined: true
[4 => ' '] Is defined: true
[5 => 'p'] Is defined: true
[6 => 'r'] Is defined: true
[7 => '?'] Is defined: false
Count => 1
[8 => '?'] Is defined: false
Count => 1
[9 => '?'] Is defined: false
Count => 1
[10 => '?'] Is defined: false
Count => 1
[11 => ' '] Is defined: true
[12 => 'e'] Is defined: true
[13 => 't'] Is defined: true
[14 => ' '] Is defined: true
[15 => '`'] Is defined: true
Finished

I'm trying to find a way to also detect multiple byte characters, and only have 'false' result for real incorrect characters.

Maybe a library in Java already exists to do that?

Would be very kind if someone can help me.
Thanks in advance.

Regards.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

你げ笑在眉眼 2024-10-26 15:58:36

有几点:

A few things:

  • CDATA will not protect you from invalid characters; your junk data will still be illegal UTF-8 sequences and may be rejected by XML parsers
  • use a configured CharsetDecoder with an InputStreamReader to validate character sequences; alternatively, check byte sequences are valid by checking them as described in RFC 2279 (see the UTF-8 definition)
  • I wouldn't try parsing XML without an XML parser
  • Character.isDefined expects a UTF-16BE encoded char (or a UTF-32BE encoded int), not UTF-8 encoded bytes
  • in Java 6, Character.isDefined is limited to code points defined in Unicode Standard, version 4.0.; there may be valid UTF-8 documents defined by later standards for which this will fail (version 6 is out now); the latest list of valid code points is defined in UnicodeData.txt
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文