如何检查 XML 数据是否为有效的 UTF-8 并检测不正确的字符？

发布于 2024-10-19 15:58:36 字数 1910 浏览 6 评论 0原文

在我的应用程序中，我必须验证 XML 数据并拾取所有无效字符（将它们放入 CDATA 中）

我的问题很简单... ^^ 如何做到这一点？

我从 Character.UnicodeBlock 方法开始，但是对于编码为多个字节的字符 - 例如“ï”或“é”，它是如何工作的？

这是我目前的代码（进行测试）：

public static void main(String[] args) {

try {
    byte[] data = "J'ai prïé et `".getBytes("UTF-8");

    System.out.print("Data: ");
    for (int i = 0; i < data.length; i++) {
    System.out.print((char) data[i]);
    }

    System.out.println("");

    UnicodeBlock myBlock = null;

    for (int i = 0; i < data.length; i++) {
    System.out.println("[" + i + " => '" + (char) data[i]
        + "'] Is defined: "
        + Character.isDefined(new Byte(data[i]).intValue()));
    try {
        myBlock = Character.UnicodeBlock.of(new Byte(data[i])
            .intValue());
    } catch (IllegalArgumentException e) {
        System.out
            .println("Count => "
                + Character.charCount(new Byte(data[i])
                    .intValue()));
    }
    }
} catch (UnsupportedEncodingException e) {
    System.err.println("Unsupported encoding: " + e.getMessage());
}
System.out.println("Finished");
}

这就是我在执行时得到的结果：

Data: J'ai pr???? et `
[0 => 'J'] Is defined: true
[1 => '''] Is defined: true
[2 => 'a'] Is defined: true
[3 => 'i'] Is defined: true
[4 => ' '] Is defined: true
[5 => 'p'] Is defined: true
[6 => 'r'] Is defined: true
[7 => '?'] Is defined: false
Count => 1
[8 => '?'] Is defined: false
Count => 1
[9 => '?'] Is defined: false
Count => 1
[10 => '?'] Is defined: false
Count => 1
[11 => ' '] Is defined: true
[12 => 'e'] Is defined: true
[13 => 't'] Is defined: true
[14 => ' '] Is defined: true
[15 => '`'] Is defined: true
Finished

我正在尝试找到一种方法来检测多个字节字符，并且对于真正的不正确字符只有“假”结果。

也许 Java 中已经存在一个库可以做到这一点？

如果有人能帮助我，我会非常高兴。提前致谢。

问候。

原文

In my application i have to validate XML data and pickup all invalid characters (put them in CDATA)

My question is quite simple... ^^ how to do it?

I started with Character.UnicodeBlock methods, but for characters incoded into several bytes - for example 'ï' or 'é', how does it works ?

This my code at the moment (to make tests):

public static void main(String[] args) {

try {
    byte[] data = "J'ai prïé et `".getBytes("UTF-8");

    System.out.print("Data: ");
    for (int i = 0; i < data.length; i++) {
    System.out.print((char) data[i]);
    }

    System.out.println("");

    UnicodeBlock myBlock = null;

    for (int i = 0; i < data.length; i++) {
    System.out.println("[" + i + " => '" + (char) data[i]
        + "'] Is defined: "
        + Character.isDefined(new Byte(data[i]).intValue()));
    try {
        myBlock = Character.UnicodeBlock.of(new Byte(data[i])
            .intValue());
    } catch (IllegalArgumentException e) {
        System.out
            .println("Count => "
                + Character.charCount(new Byte(data[i])
                    .intValue()));
    }
    }
} catch (UnsupportedEncodingException e) {
    System.err.println("Unsupported encoding: " + e.getMessage());
}
System.out.println("Finished");
}

And this is what i get at execution:

Data: J'ai pr???? et `
[0 => 'J'] Is defined: true
[1 => '''] Is defined: true
[2 => 'a'] Is defined: true
[3 => 'i'] Is defined: true
[4 => ' '] Is defined: true
[5 => 'p'] Is defined: true
[6 => 'r'] Is defined: true
[7 => '?'] Is defined: false
Count => 1
[8 => '?'] Is defined: false
Count => 1
[9 => '?'] Is defined: false
Count => 1
[10 => '?'] Is defined: false
Count => 1
[11 => ' '] Is defined: true
[12 => 'e'] Is defined: true
[13 => 't'] Is defined: true
[14 => ' '] Is defined: true
[15 => '`'] Is defined: true
Finished

I'm trying to find a way to also detect multiple byte characters, and only have 'false' result for real incorrect characters.

Maybe a library in Java already exists to do that?

Would be very kind if someone can help me.
Thanks in advance.

Regards.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你げ笑在眉眼 2024-10-26 15:58:36

有几点：

CDATA 不会保护您免受无效字符的影响；您的垃圾数据仍然是非法的 UTF-8 序列，并且可能会被
使用使用 InputStreamReader 验证字符序列；或者，按照 RFC 2279 中所述检查字节序列是否有效（请参阅 UTF-8 定义)
如果没有 XML 解析器，我不会尝试解析 XML
Character.isDefined 需要 UTF-16BE 编码的 char (或 UTF-32BE 编码 int)，而不是 Java 6 中的 UTF-8 编码字节
，Character.isDefined 仅限于 Unicode 标准，版本 4.0。；后续标准定义的有效 UTF-8 文档可能会失败（版本 6 现已发布）；最新的有效代码点列表在 UnicodeData.txt 中定义