Java 文本文件编码

发布于 2024-08-02 12:16:51 字数 181 浏览 1 评论 0原文

我有一个文本文件，它可以是 ANSI（使用 ISO-8859-2 字符集）、UTF-8、UCS-2 Big 或 Little Endian。

有没有办法检测文件的编码以正确读取它？

或者是否可以在不给出编码的情况下读取文件？（它按原样读取文件）

（有几个程序可以检测和转换文本文件的编码/格式。）

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

Bonjour°[大白 2024-08-09 12:16:51

是的，有很多方法可以进行字符编码检测，特别是在 Java 中。看一下jchardet，它是基于Mozilla 算法的。还有 cpDetector 和 IBM 的一个项目，名为 ICU4j.我会看一下后者，因为它似乎比其他两个更可靠。它们的工作基于二进制文件的统计分析，ICU4j 还将提供它检测到的字符编码的置信度，因此您可以在上述情况下使用它。它运作得很好。

回复收藏 0 原文

清晨说晚安 2024-08-09 12:16:51

UTF-8 和 UCS-2/UTF-16 可以通过字节顺序标记在文件的开头。如果存在，那么文件就采用该编码是相当好的赌注 - 但这并不是绝对确定的。您可能还会发现该文件采用其中一种编码，但没有字节顺序标记。

我对 ISO-8859-2 不太了解，但如果几乎每个文件都是该编码中的有效文本文件，我不会感到惊讶。您能做的最好的事情就是试探性地检查它。事实上，维基百科页面谈论它会表明只有字节 0x7f 是无效的。

无法“按原样”读取文件并获取文本 - 文件是字节序列，因此您必须应用字符编码才能将这些字节解码为字符。

回复收藏 0 原文

千仐 2024-08-09 12:16:51

您可以使用 ICU4J (http://icu-project.org/apiref/icu4j/)

这是我的代码：

            String charset = "ISO-8859-1"; //Default chartset, put whatever you want

            byte[] fileContent = null;
            FileInputStream fin = null;

            //create FileInputStream object
            fin = new FileInputStream(file.getPath());

            /*
             * Create byte array large enough to hold the content of the file.
             * Use File.length to determine size of the file in bytes.
             */
            fileContent = new byte[(int) file.length()];

            /*
             * To read content of the file in byte array, use
             * int read(byte[] byteArray) method of java FileInputStream class.
             *
             */
            fin.read(fileContent);

            byte[] data =  fileContent;

            CharsetDetector detector = new CharsetDetector();
            detector.setText(data);

            CharsetMatch cm = detector.detect();

            if (cm != null) {
                int confidence = cm.getConfidence();
                System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
                //Here you have the encode name and the confidence
                //In my case if the confidence is > 50 I return the encode, else I return the default value
                if (confidence > 50) {
                    charset = cm.getName();
                }
            }

记住把所有需要的 try catch 都放进去。

我希望这对你有用。

You can use ICU4J (http://icu-project.org/apiref/icu4j/)

Here is my code:

            String charset = "ISO-8859-1"; //Default chartset, put whatever you want

            byte[] fileContent = null;
            FileInputStream fin = null;

            //create FileInputStream object
            fin = new FileInputStream(file.getPath());

            /*
             * Create byte array large enough to hold the content of the file.
             * Use File.length to determine size of the file in bytes.
             */
            fileContent = new byte[(int) file.length()];

            /*
             * To read content of the file in byte array, use
             * int read(byte[] byteArray) method of java FileInputStream class.
             *
             */
            fin.read(fileContent);

            byte[] data =  fileContent;

            CharsetDetector detector = new CharsetDetector();
            detector.setText(data);

            CharsetMatch cm = detector.detect();

            if (cm != null) {
                int confidence = cm.getConfidence();
                System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
                //Here you have the encode name and the confidence
                //In my case if the confidence is > 50 I return the encode, else I return the default value
                if (confidence > 50) {
                    charset = cm.getName();
                }
            }

Remember to put all the try catch need it.

I hope this works for you.

回复收藏 0 原文