关于 Java 将字节转换为字符串以比较“字节顺序标记”的混乱
我试图在读取文件时识别 UTF-8 的 BOM。当然,Java文件喜欢处理16位字符,而BOM字符是八位字节。
我的测试代码如下:
public void testByteOrderMarks() {
System.out.println("test byte order marks");
byte[] bytes = {(byte) 0xEF, (byte) 0xBB, (byte) 0xBF, (byte) 'a', (byte) 'b',(byte) 'c'};
String test = new String(bytes, Charset.availableCharsets().get("UTF-8"));
System.out.printf("test len: %s value %s\n", test.length(), test);
String three = test.substring(0,3);
System.out.printf("len %d >%s<\n", three.length(), three);
for (int i = 0; i < test.length();i++) {
byte b = bytes[i];
char c = test.charAt(i);
System.out.printf("b: %s %x c: %s %x\n", (char) b, b, c, (int) c);
}
}
结果是:
测试字节顺序标记
测试长度:4值?abc
长度 3 >?ab
乙:? EF>丙:?费夫
乙:? bb c: 61
乙:? bf c: b 62
b:a 61 c:c 63
我不明白为什么“test”的长度是 4 而不是 6。 我不明白为什么我不选取每个 8 位字节来进行比较。
谢谢
I'm trying to recognize a BOM for UTF-8 when reading a file. Of course, Java files like to deal with 16 bit chars, and the BOM characters are eight bit bytes.
My test code looks like:
public void testByteOrderMarks() {
System.out.println("test byte order marks");
byte[] bytes = {(byte) 0xEF, (byte) 0xBB, (byte) 0xBF, (byte) 'a', (byte) 'b',(byte) 'c'};
String test = new String(bytes, Charset.availableCharsets().get("UTF-8"));
System.out.printf("test len: %s value %s\n", test.length(), test);
String three = test.substring(0,3);
System.out.printf("len %d >%s<\n", three.length(), three);
for (int i = 0; i < test.length();i++) {
byte b = bytes[i];
char c = test.charAt(i);
System.out.printf("b: %s %x c: %s %x\n", (char) b, b, c, (int) c);
}
}
and the result is:
test byte order marks
test len: 4 value ?abc
len 3 >?ab<
b: ? ef> c: ? feff
b: ? bb c: a 61
b: ? bf c: b 62
b: a 61 c: c 63
I can't figure out why the length of "test" is 4 and not 6.
I can't figure out why I don't pick up each 8 bit byte to do the comparison.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
尝试找出 BOM 标头时不要使用字符。 BOM 标头有两个或三个字节,因此您应该打开一个 (File)InputStream,读取两个字节并处理它们。
顺便说一句,XML 标头 (
) 是纯 ASCII,因此将其作为字节流加载也是安全的(好吧,除非有是一个 BOM,表示文件以 16 位字符保存,不是为 UTF-8)。
我的解决方案(请参阅 DecentXML 的XMLInputStreamReader)的作用是加载文件的前几个字节并对其进行分析。这给了我足够的信息来从
InputStream
创建一个正确解码的Reader
。Don't use characters when trying to figure out the BOM header. The BOM header is two or three bytes, so you should open an (File)InputStream, read two bytes and process them.
Incidentally, the XML header (
<?xml version=... encoding=...>
) is pure ASCII so it's safe to load that as a byte stream, too (well, unless there is a BOM to indicate that the file is saved with 16bit characters and not as UTF-8).My solution (see DecentXML's XMLInputStreamReader) is to load the first few bytes of the file and analyze them. That gives me enough information to create a properly decoding
Reader
out of anInputStream
.角色就是角色。字节顺序标记是 Unicode 字符 U+FEFF。在 Java 中,它是字符
'\uFEFF'
。无需深入研究字节。只需读取文件的第一个字符,如果它与'\uFEFF'
匹配,则它是 BOM。如果不匹配,则该文件是在没有 BOM 的情况下写入的。A character is a character. The Byte Order Mark is the Unicode character U+FEFF. In Java it is the character
'\uFEFF'
. There is no need to delve into bytes. Just read the first character of the file, and if it matches'\uFEFF'
it is the BOM. If it doesn't match then the file was written without a BOM.如果您想识别 BOM 文件,更好的解决方案(并且对我有用)将使用 Mozilla 的编码检测器库: http://code.google.com/p/juniversalchardet/
在该链接中很容易描述如何使用它:
如果您使用的是 Maven,则依赖项是:
If you want to recognize a BOM file a better solution (and works for me) will be use the encoding detector library of Mozilla: http://code.google.com/p/juniversalchardet/
In that link is described easily how to use it:
If you are using maven the dependency is: