如何检查 XML 数据是否为有效的 UTF-8 并检测不正确的字符?
在我的应用程序中,我必须验证 XML 数据并拾取所有无效字符(将它们放入 CDATA 中)
我的问题很简单... ^^ 如何做到这一点?
我从 Character.UnicodeBlock 方法开始,但是对于编码为多个字节的字符 - 例如“ï”或“é”,它是如何工作的?
这是我目前的代码(进行测试):
public static void main(String[] args) {
try {
byte[] data = "J'ai prïé et `".getBytes("UTF-8");
System.out.print("Data: ");
for (int i = 0; i < data.length; i++) {
System.out.print((char) data[i]);
}
System.out.println("");
UnicodeBlock myBlock = null;
for (int i = 0; i < data.length; i++) {
System.out.println("[" + i + " => '" + (char) data[i]
+ "'] Is defined: "
+ Character.isDefined(new Byte(data[i]).intValue()));
try {
myBlock = Character.UnicodeBlock.of(new Byte(data[i])
.intValue());
} catch (IllegalArgumentException e) {
System.out
.println("Count => "
+ Character.charCount(new Byte(data[i])
.intValue()));
}
}
} catch (UnsupportedEncodingException e) {
System.err.println("Unsupported encoding: " + e.getMessage());
}
System.out.println("Finished");
}
这就是我在执行时得到的结果:
Data: J'ai pr???? et `
[0 => 'J'] Is defined: true
[1 => '''] Is defined: true
[2 => 'a'] Is defined: true
[3 => 'i'] Is defined: true
[4 => ' '] Is defined: true
[5 => 'p'] Is defined: true
[6 => 'r'] Is defined: true
[7 => '?'] Is defined: false
Count => 1
[8 => '?'] Is defined: false
Count => 1
[9 => '?'] Is defined: false
Count => 1
[10 => '?'] Is defined: false
Count => 1
[11 => ' '] Is defined: true
[12 => 'e'] Is defined: true
[13 => 't'] Is defined: true
[14 => ' '] Is defined: true
[15 => '`'] Is defined: true
Finished
我正在尝试找到一种方法来检测多个字节字符,并且对于真正的不正确字符只有“假”结果。
也许 Java 中已经存在一个库可以做到这一点?
如果有人能帮助我,我会非常高兴。 提前致谢。
问候。
In my application i have to validate XML data and pickup all invalid characters (put them in CDATA)
My question is quite simple... ^^ how to do it?
I started with Character.UnicodeBlock methods, but for characters incoded into several bytes - for example 'ï' or 'é', how does it works ?
This my code at the moment (to make tests):
public static void main(String[] args) {
try {
byte[] data = "J'ai prïé et `".getBytes("UTF-8");
System.out.print("Data: ");
for (int i = 0; i < data.length; i++) {
System.out.print((char) data[i]);
}
System.out.println("");
UnicodeBlock myBlock = null;
for (int i = 0; i < data.length; i++) {
System.out.println("[" + i + " => '" + (char) data[i]
+ "'] Is defined: "
+ Character.isDefined(new Byte(data[i]).intValue()));
try {
myBlock = Character.UnicodeBlock.of(new Byte(data[i])
.intValue());
} catch (IllegalArgumentException e) {
System.out
.println("Count => "
+ Character.charCount(new Byte(data[i])
.intValue()));
}
}
} catch (UnsupportedEncodingException e) {
System.err.println("Unsupported encoding: " + e.getMessage());
}
System.out.println("Finished");
}
And this is what i get at execution:
Data: J'ai pr???? et `
[0 => 'J'] Is defined: true
[1 => '''] Is defined: true
[2 => 'a'] Is defined: true
[3 => 'i'] Is defined: true
[4 => ' '] Is defined: true
[5 => 'p'] Is defined: true
[6 => 'r'] Is defined: true
[7 => '?'] Is defined: false
Count => 1
[8 => '?'] Is defined: false
Count => 1
[9 => '?'] Is defined: false
Count => 1
[10 => '?'] Is defined: false
Count => 1
[11 => ' '] Is defined: true
[12 => 'e'] Is defined: true
[13 => 't'] Is defined: true
[14 => ' '] Is defined: true
[15 => '`'] Is defined: true
Finished
I'm trying to find a way to also detect multiple byte characters, and only have 'false' result for real incorrect characters.
Maybe a library in Java already exists to do that?
Would be very kind if someone can help me.
Thanks in advance.
Regards.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有几点:
Character.isDefined
需要 UTF-16BE 编码的char
(或 UTF-32BE 编码int
),而不是 Java 6 中的 UTF-8 编码字节Character.isDefined
仅限于 Unicode 标准,版本 4.0。;后续标准定义的有效 UTF-8 文档可能会失败(版本 6 现已发布);最新的有效代码点列表在 UnicodeData.txt 中定义A few things:
Character.isDefined
expects a UTF-16BE encodedchar
(or a UTF-32BE encodedint
), not UTF-8 encoded bytesCharacter.isDefined
is limited to code points defined in Unicode Standard, version 4.0.; there may be valid UTF-8 documents defined by later standards for which this will fail (version 6 is out now); the latest list of valid code points is defined in UnicodeData.txt