如何在不使用 BOM 且以非 ASCII 字符开头的情况下识别文件的不同编码?

发布于 2024-11-01 14:01:20 字数 3418 浏览 6 评论 0原文

我在尝试识别不带 BOM 的文件的编码时遇到问题,特别是当文件以非 ASCII 字符开头时。

我发现以下两个关于如何识别文件编码的主题,

目前,我创建了一个类来识别文件的不同编码(例如 UTF-8、UTF-16、UTF-32、UTF-16 无 BOM 等),例如以下,

public class UnicodeReader extends Reader {
private static final int BOM_SIZE = 4;
private final InputStreamReader reader;

/**
 * Construct UnicodeReader
 * @param in Input stream.
 * @param defaultEncoding Default encoding to be used if BOM is not found,
 * or <code>null</code> to use system default encoding.
 * @throws IOException If an I/O error occurs.
 */
public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
    byte bom[] = new byte[BOM_SIZE];
    String encoding;
    int unread;
    PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
    int n = pushbackStream.read(bom, 0, bom.length);

    // Read ahead four bytes and check for BOM marks.
    if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
        encoding = "UTF-8";
        unread = n - 3;
    } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
        encoding = "UTF-16BE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
        encoding = "UTF-16LE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
        encoding = "UTF-32BE";
        unread = n - 4;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
        encoding = "UTF-32LE";
        unread = n - 4;
    } else {
        // No BOM detected but still could be UTF-16
        int found = 0;
        for (int i = 0; i < 4; i++) {
            if (bom[i] == (byte) 0x00)
                found++;
        }

        if(found >= 2) {
            if(bom[0] == (byte) 0x00){
                encoding = "UTF-16BE";
            }
            else {
                encoding = "UTF-16LE";
            }
            unread = n;
        }
        else {
            encoding = defaultEncoding;
            unread = n;
        }
    }

    // Unread bytes if necessary and skip BOM marks.
    if (unread > 0) {
        pushbackStream.unread(bom, (n - unread), unread);
    } else if (unread < -1) {
        pushbackStream.unread(bom, 0, 0);
    }

    // Use given encoding.
    if (encoding == null) {
        reader = new InputStreamReader(pushbackStream);
    } else {
        reader = new InputStreamReader(pushbackStream, encoding);
    }
}

public String getEncoding() {
    return reader.getEncoding();
}

public int read(char[] cbuf, int off, int len) throws IOException {
    return reader.read(cbuf, off, len);
}

public void close() throws IOException {
    reader.close();
}

}

上面的代码可以在所有情况下正常工作,除了没有 BOM 且以非 ASCII 字符开头的文件。由于在这种情况下,检查文件是否仍然是无 BOM 的 UTF-16 的逻辑将无法正常工作,并且编码将默认设置为 UTF-8。

是否有办法检查没有 BOM 且以非 ascii 字符开头的文件编码,特别是对于 UTF-16 NO BOM 文件?

谢谢,任何想法将不胜感激。

I got a problem when trying to identify the encoding of a file without BOM, particularly when the file is beginning with non-ascii characters.

I found following two topics about how to identify encodings for files,

Currently, I created a class to identify different encodings for files (e.g. UTF-8, UTF-16, UTF-32, UTF-16 no BOM, etc) like following,

public class UnicodeReader extends Reader {
private static final int BOM_SIZE = 4;
private final InputStreamReader reader;

/**
 * Construct UnicodeReader
 * @param in Input stream.
 * @param defaultEncoding Default encoding to be used if BOM is not found,
 * or <code>null</code> to use system default encoding.
 * @throws IOException If an I/O error occurs.
 */
public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
    byte bom[] = new byte[BOM_SIZE];
    String encoding;
    int unread;
    PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
    int n = pushbackStream.read(bom, 0, bom.length);

    // Read ahead four bytes and check for BOM marks.
    if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
        encoding = "UTF-8";
        unread = n - 3;
    } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
        encoding = "UTF-16BE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
        encoding = "UTF-16LE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
        encoding = "UTF-32BE";
        unread = n - 4;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
        encoding = "UTF-32LE";
        unread = n - 4;
    } else {
        // No BOM detected but still could be UTF-16
        int found = 0;
        for (int i = 0; i < 4; i++) {
            if (bom[i] == (byte) 0x00)
                found++;
        }

        if(found >= 2) {
            if(bom[0] == (byte) 0x00){
                encoding = "UTF-16BE";
            }
            else {
                encoding = "UTF-16LE";
            }
            unread = n;
        }
        else {
            encoding = defaultEncoding;
            unread = n;
        }
    }

    // Unread bytes if necessary and skip BOM marks.
    if (unread > 0) {
        pushbackStream.unread(bom, (n - unread), unread);
    } else if (unread < -1) {
        pushbackStream.unread(bom, 0, 0);
    }

    // Use given encoding.
    if (encoding == null) {
        reader = new InputStreamReader(pushbackStream);
    } else {
        reader = new InputStreamReader(pushbackStream, encoding);
    }
}

public String getEncoding() {
    return reader.getEncoding();
}

public int read(char[] cbuf, int off, int len) throws IOException {
    return reader.read(cbuf, off, len);
}

public void close() throws IOException {
    reader.close();
}

}

The above code could work properly all the cases except when file without BOM and beginning with non-ascii characters. Since under this circumstance, the logic for checking if file still be UTF-16 without BOM will not work correctly, and the encoding will be set as UTF-8 as default.

If there is a way to check encodings of file without BOM and beggining with non-ascii characters, especially for UTF-16 NO BOM file ?

Thanks, any idea would be appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

输什么也不输骨气 2024-11-08 14:01:20

一般来说,如果不提供编码,就无法确定编码。

你可能会通过文本中的特定模式(高位设置,设置,设置,未设置,设置,设置,设置,未设置)来猜测UTF-8,但这仍然是一个猜测。

UTF-16 是一个很难的;您可以在同一个流上成功解析BE和LE;这两种方式都会产生一些字符(尽管可能是无意义的文本)。

一些代码使用统计分析来通过符号的频率来猜测编码,但这需要对文本(即“这是蒙古文文本”)和频率表(可能与文本不匹配)进行一些假设。归根结底,这仍然只是一个猜测,并不能在 100% 的情况下提供帮助。

Generally speaking, there is no way to know encoding for sure if it is not provided.

You may guess UTF-8 by specific pattern in the texts (high bit set, set, set, not set, set, set, set, not set), but it is still a guess.

UTF-16 is a hard one; you can successfully parse BE and LE on the same stream; both ways it will produce some characters (potentially meaningless text though).

Some code out there uses statistical analysis to guess the encoding by the frequency of the symbols, but that requires some assumptions about the text (i.e. "this is a Mongolian text") and frequencies tables (which may not match the text). At the end of the day this remains just a guess, and cannot help in 100% of cases.

随梦而飞# 2024-11-08 14:01:20

最好的方法是不要自己尝试实现这一点。相反,使用现有的库来执行此操作;请参阅 Java:如何确定正确的字符集编码流的。例如:

应该注意的是,最好的办法是猜测文件最可能的编码。在一般情况下,不可能 100% 确定您已经找到正确的编码;即创建文件时使用的编码。


我想说这些第三方库也无法识别我遇到的文件的编码[...]它们可以改进以满足我的要求。

或者,您可能会认识到您的要求极难满足……并进行更改;例如,

  • 将自己限制在一组特定的编码范围内,
  • 坚持要求提供/上传文件的人正确说明其编码(或主要语言)是什么,和/或
  • 接受您的系统将在一定百分比的情况下出错的事实。时间,并提供人们可以纠正错误陈述/猜测的编码的方法。

面对事实:这是一个理论上无法解决的问题。

The best approach is not to try and implement this yourself. Instead use an existing library to do this; see Java : How to determine the correct charset encoding of a stream. For instance:

It should be noted that the best that can be done is to guess at the most likely encoding for the file. In the general case, it is impossible to be 100% sure that you've figured out the correct encoding; i.e. the encoding that was used when creating the file.


I would say these third-party libraries are also cannot identify encodings for the file I encountered [...] they could be improved to meet my requirement.

Alternatively, you could recognize that your requirement is exceedingly hard to meet ... and change it; e.g.

  • restrict yourself to a certain set of encodings,
  • insist that the person who provides / uploads the file correctly state what its encoding (or primary language) is, and/or
  • accept that your system is going to get it wrong a certain percent of the time, and provide the means whereby someone can correct incorrectly stated / guessed encodings.

Face the facts: this is a THEORETICALLY UNSOLVABLE problem.

べ繥欢鉨o。 2024-11-08 14:01:20

如果您确定它是有效的 Unicode 流,并且没有 BOM(因为既不需要也不建议使用 BOM),则它必须是 UTF-8,如果它有 BOM,那么您就知道它是什么。

如果它只是某种随机编码,则无法确定。你所能期望的最好结果就是偶尔会出错,因为不可能在所有情况下都能正确猜测。

如果您可以将可能性限制为非常小的子集,可以提高你猜测正确的几率

唯一可靠的方法是要求提供商告诉您他们提供什么。如果您想要完全的可靠性,这是您唯一的选择。如果你不需要可靠性,那么你就会猜测——但有时会猜测错误。

我感觉您一定是 Windows 用户,因为我们其他人一开始就很少有理由使用 BOM。我知道我经常处理千兆字节的文本(在 Mac、Linux、Solaris 和 BSD 系统上),其中 99% 以上是 UTF-8,而且我只遇到过两次包含 BOM 的文本。我听说 Windows 用户总是被这个问题困扰。如果属实,这可能会也可能不会让您的选择变得更容易。

If you are certain that it is a valid Unicode stream, it must be UTF-8 if it has no BOM (since a BOM is neither required nor recommended), and if it does have one, then you know what it is.

If it is just some random encoding, there is no way to know for certain. The best you can hope for is then to only be wrong sometimes, since there is impossible to guess correctly in all cases.

If you can limit the possibilities to a very small subset, it is possible to improve the odds of your guess being right.

The only reliable way is to require the provider to tell you what they are providing. If you want complete reliability, that is your only choice. If you do not require reliability, then you guess — but sometimes guess wrong.

I have the feeling that you must be a Windows person, since the rest of us seldom have cause for BOMs in the first place. I know that I regularly deal with tgagabytes of text (on Macs, Linux, Solaris, and BSD systems), more than 99% of it UTF-8, and only twice have I come across BOM-laden text. I have heard Windows people get stuck with it all the time though. If true this may, or may not, make your choices easier.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文