Java 文本文件编码

发布于 2024-08-02 12:16:51 字数 181 浏览 1 评论 0原文

我有一个文本文件,它可以是 ANSI(使用 ISO-8859-2 字符集)、UTF-8、UCS-2 Big 或 Little Endian。

有没有办法检测文件的编码以正确读取它?

或者是否可以在不给出编码的情况下读取文件? (它按原样读取文件)

(有几个程序可以检测和转换文本文件的编码/格式。)

I have a text file and it can be ANSI (with ISO-8859-2 charset), UTF-8, UCS-2 Big or Little Endian.

Is there any way to detect the encoding of the file to read it properly?

Or is it possible to read a file without giving the encoding? (and it reads the file as it is)

(There are several program that can detect and convert encoding/format of text files.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

Bonjour°[大白 2024-08-09 12:16:51

是的,有很多方法可以进行字符编码检测,特别是在 Java 中。看一下jchardet,它是基于Mozilla 算法的。还有 cpDetector 和 IBM 的一个项目,名为 ICU4j.我会看一下后者,因为它似乎比其他两个更可靠。它们的工作基于二进制文件的统计分析,ICU4j 还将提供它检测到的字符编码的置信度,因此您可以在上述情况下使用它。它运作得很好。

Yes, there's a number of methods to do character encoding detection, specifically in Java. Take a look at jchardet which is based on the Mozilla algorithm. There's also cpdetector and a project by IBM called ICU4j. I'd take a look at the latter, as it seems to be more reliable than the other two. They work based on statistical analysis of the binary file, ICU4j will also provide a confidence level of the character encoding it detects so you can use this in the case above. It works pretty well.

清晨说晚安 2024-08-09 12:16:51

UTF-8 和 UCS-2/UTF-16 可以通过字节顺序标记 在文件的开头。如果存在,那么文件就采用该编码是相当好的赌注 - 但这并不是绝对确定的。您可能还会发现该文件采用其中一种编码,但没有字节顺序标记。

我对 ISO-8859-2 不太了解,但如果几乎每个文件都是该编码中的有效文本文件,我不会感到惊讶。您能做的最好的事情就是试探性地检查它。事实上,维基百科页面谈论它会表明只有字节 0x7f 是无效的。

无法“按原样”读取文件并获取文本 - 文件是字节序列,因此您必须应用字符编码才能将这些字节解码为字符。

UTF-8 and UCS-2/UTF-16 can be distinguished reasonably easily via a byte order mark at the start of the file. If this exists then it's a pretty good bet that the file is in that encoding - but it's not a dead certainty. You may well also find that the file is in one of those encodings, but doesn't have a byte order mark.

I don't know much about ISO-8859-2, but I wouldn't be surprised if almost every file is a valid text file in that encoding. The best you'll be able to do is check it heuristically. Indeed, the Wikipedia page talking about it would suggest that only byte 0x7f is invalid.

There's no idea of reading a file "as it is" and yet getting text out - a file is a sequence of bytes, so you have to apply a character encoding in order to decode those bytes into characters.

千仐 2024-08-09 12:16:51

您可以使用 ICU4J (http://icu-project.org/apiref/icu4j/)

这是我的代码:

            String charset = "ISO-8859-1"; //Default chartset, put whatever you want

            byte[] fileContent = null;
            FileInputStream fin = null;

            //create FileInputStream object
            fin = new FileInputStream(file.getPath());

            /*
             * Create byte array large enough to hold the content of the file.
             * Use File.length to determine size of the file in bytes.
             */
            fileContent = new byte[(int) file.length()];

            /*
             * To read content of the file in byte array, use
             * int read(byte[] byteArray) method of java FileInputStream class.
             *
             */
            fin.read(fileContent);

            byte[] data =  fileContent;

            CharsetDetector detector = new CharsetDetector();
            detector.setText(data);

            CharsetMatch cm = detector.detect();

            if (cm != null) {
                int confidence = cm.getConfidence();
                System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
                //Here you have the encode name and the confidence
                //In my case if the confidence is > 50 I return the encode, else I return the default value
                if (confidence > 50) {
                    charset = cm.getName();
                }
            }

记住把所有需要的 try catch 都放进去。

我希望这对你有用。

You can use ICU4J (http://icu-project.org/apiref/icu4j/)

Here is my code:

            String charset = "ISO-8859-1"; //Default chartset, put whatever you want

            byte[] fileContent = null;
            FileInputStream fin = null;

            //create FileInputStream object
            fin = new FileInputStream(file.getPath());

            /*
             * Create byte array large enough to hold the content of the file.
             * Use File.length to determine size of the file in bytes.
             */
            fileContent = new byte[(int) file.length()];

            /*
             * To read content of the file in byte array, use
             * int read(byte[] byteArray) method of java FileInputStream class.
             *
             */
            fin.read(fileContent);

            byte[] data =  fileContent;

            CharsetDetector detector = new CharsetDetector();
            detector.setText(data);

            CharsetMatch cm = detector.detect();

            if (cm != null) {
                int confidence = cm.getConfidence();
                System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
                //Here you have the encode name and the confidence
                //In my case if the confidence is > 50 I return the encode, else I return the default value
                if (confidence > 50) {
                    charset = cm.getName();
                }
            }

Remember to put all the try catch need it.

I hope this works for you.

花开柳相依 2024-08-09 12:16:51

如果您的文本文件是正确创建的 Unicode 文本文件,那么字节顺序标记 (BOM) 应该告诉您所需的所有信息。有关 BOM 的更多详细信息,请参阅此处

如果不是,则必须使用某种编码检测库。

If your text file is a properly created Unicode text file then the Byte Order Mark (BOM) should tell you all the information you need. See here for more details about BOM

If it's not then you'll have to use some encoding detection library.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文