关于 Java 将字节转换为字符串以比较“字节顺序标记”的混乱

发布于 2024-12-01 14:54:56 字数 991 浏览 0 评论 0原文

我试图在读取文件时识别 UTF-8 的 BOM。当然,Java文件喜欢处理16位字符,而BOM字符是八位字节。

我的测试代码如下:

public void testByteOrderMarks() {
    System.out.println("test byte order marks");

    byte[] bytes = {(byte) 0xEF, (byte) 0xBB, (byte) 0xBF, (byte) 'a', (byte) 'b',(byte) 'c'};
    String test = new String(bytes,  Charset.availableCharsets().get("UTF-8"));
    System.out.printf("test len: %s  value %s\n", test.length(), test);
    String three = test.substring(0,3);
    System.out.printf("len %d  >%s<\n", three.length(), three);
    for (int i = 0; i < test.length();i++) {
        byte b = bytes[i];
        char c = test.charAt(i);
        System.out.printf("b: %s %x c: %s %x\n", (char) b, b,  c, (int) c); 
    }
}

结果是:

测试字节顺序标记
测试长度:4值?abc
长度 3 >?ab
乙:? EF>丙:?费夫
乙:? bb c: 61
乙:? bf c: b 62
b:a 61 c:c 63

我不明白为什么“test”的长度是 4 而不是 6。 我不明白为什么我不选取每个 8 位字节来进行比较。

谢谢

I'm trying to recognize a BOM for UTF-8 when reading a file. Of course, Java files like to deal with 16 bit chars, and the BOM characters are eight bit bytes.

My test code looks like:

public void testByteOrderMarks() {
    System.out.println("test byte order marks");

    byte[] bytes = {(byte) 0xEF, (byte) 0xBB, (byte) 0xBF, (byte) 'a', (byte) 'b',(byte) 'c'};
    String test = new String(bytes,  Charset.availableCharsets().get("UTF-8"));
    System.out.printf("test len: %s  value %s\n", test.length(), test);
    String three = test.substring(0,3);
    System.out.printf("len %d  >%s<\n", three.length(), three);
    for (int i = 0; i < test.length();i++) {
        byte b = bytes[i];
        char c = test.charAt(i);
        System.out.printf("b: %s %x c: %s %x\n", (char) b, b,  c, (int) c); 
    }
}

and the result is:

test byte order marks
test len: 4 value ?abc
len 3 >?ab<
b: ? ef> c: ? feff
b: ? bb c: a 61
b: ? bf c: b 62
b: a 61 c: c 63

I can't figure out why the length of "test" is 4 and not 6.
I can't figure out why I don't pick up each 8 bit byte to do the comparison.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夜未央樱花落 2024-12-08 14:54:57

尝试找出 BOM 标头时不要使用字符。 BOM 标头有两个或三个字节,因此您应该打开一个 (File)InputStream,读取两个字节并处理它们。

顺便说一句,XML 标头 () 是纯 ASCII,因此将其作为字节流加载也是安全的(好吧,除非有是一个 BOM,表示文件以 16 位字符保存,不是为 UTF-8)。

我的解决方案(请参阅 DecentXML 的XMLInputStreamReader)的作用是加载文件的前几个字节并对其进行分析。这给了我足够的信息来从 InputStream 创建一个正确解码的 Reader

Don't use characters when trying to figure out the BOM header. The BOM header is two or three bytes, so you should open an (File)InputStream, read two bytes and process them.

Incidentally, the XML header (<?xml version=... encoding=...>) is pure ASCII so it's safe to load that as a byte stream, too (well, unless there is a BOM to indicate that the file is saved with 16bit characters and not as UTF-8).

My solution (see DecentXML's XMLInputStreamReader) is to load the first few bytes of the file and analyze them. That gives me enough information to create a properly decoding Reader out of an InputStream.

玉环 2024-12-08 14:54:57

角色就是角色。字节顺序标记是 Unicode 字符 U+FEFF。在 Java 中,它是字符'\uFEFF'。无需深入研究字节。只需读取文件的第一个字符,如果它与 '\uFEFF' 匹配,则它是 BOM。如果不匹配,则该文件是在没有 BOM 的情况下写入的。

private final static char BOM = '\uFEFF';    // Unicode Byte Order Mark
String firstLine = readFirstLineOfFile("filename.txt");
if (firstLine.charAt(0) == BOM) {
    // We have a BOM
} else {
    // No BOM present.
}

A character is a character. The Byte Order Mark is the Unicode character U+FEFF. In Java it is the character '\uFEFF'. There is no need to delve into bytes. Just read the first character of the file, and if it matches '\uFEFF' it is the BOM. If it doesn't match then the file was written without a BOM.

private final static char BOM = '\uFEFF';    // Unicode Byte Order Mark
String firstLine = readFirstLineOfFile("filename.txt");
if (firstLine.charAt(0) == BOM) {
    // We have a BOM
} else {
    // No BOM present.
}
一个人练习一个人 2024-12-08 14:54:57

如果您想识别 BOM 文件,更好的解决方案(并且对我有用)将使用 Mozilla 的编码检测器库: http://code.google.com/p/juniversalchardet/
在该链接中很容易描述如何使用它:

import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector {
  public static void main(String[] args) throws java.io.IOException {
    byte[] buf = new byte[4096];
    String fileName = "testFile.";
    java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

    // (1)
    UniversalDetector detector = new UniversalDetector(null);

    // (2)
    int nread;
    while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
      detector.handleData(buf, 0, nread);
    }
    // (3)
    detector.dataEnd();

    // (4)
    String encoding = detector.getDetectedCharset();
    if (encoding != null) {
      System.out.println("Detected encoding = " + encoding);
    } else {
      System.out.println("No encoding detected.");
    }

    // (5)
    detector.reset();
  }
}

如果您使用的是 Maven,则依赖项是:

<dependency>
    <groupId>com.googlecode.juniversalchardet</groupId>
    <artifactId>juniversalchardet</artifactId>
    <version>1.0.3</version>
</dependency>

If you want to recognize a BOM file a better solution (and works for me) will be use the encoding detector library of Mozilla: http://code.google.com/p/juniversalchardet/
In that link is described easily how to use it:

import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector {
  public static void main(String[] args) throws java.io.IOException {
    byte[] buf = new byte[4096];
    String fileName = "testFile.";
    java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

    // (1)
    UniversalDetector detector = new UniversalDetector(null);

    // (2)
    int nread;
    while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
      detector.handleData(buf, 0, nread);
    }
    // (3)
    detector.dataEnd();

    // (4)
    String encoding = detector.getDetectedCharset();
    if (encoding != null) {
      System.out.println("Detected encoding = " + encoding);
    } else {
      System.out.println("No encoding detected.");
    }

    // (5)
    detector.reset();
  }
}

If you are using maven the dependency is:

<dependency>
    <groupId>com.googlecode.juniversalchardet</groupId>
    <artifactId>juniversalchardet</artifactId>
    <version>1.0.3</version>
</dependency>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文