Java 文本文件编码
我有一个文本文件,它可以是 ANSI(使用 ISO-8859-2 字符集)、UTF-8、UCS-2 Big 或 Little Endian。
有没有办法检测文件的编码以正确读取它?
或者是否可以在不给出编码的情况下读取文件? (它按原样读取文件)
(有几个程序可以检测和转换文本文件的编码/格式。)
I have a text file and it can be ANSI (with ISO-8859-2 charset), UTF-8, UCS-2 Big or Little Endian.
Is there any way to detect the encoding of the file to read it properly?
Or is it possible to read a file without giving the encoding? (and it reads the file as it is)
(There are several program that can detect and convert encoding/format of text files.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
是的,有很多方法可以进行字符编码检测,特别是在 Java 中。看一下jchardet,它是基于Mozilla 算法的。还有 cpDetector 和 IBM 的一个项目,名为 ICU4j.我会看一下后者,因为它似乎比其他两个更可靠。它们的工作基于二进制文件的统计分析,ICU4j 还将提供它检测到的字符编码的置信度,因此您可以在上述情况下使用它。它运作得很好。
Yes, there's a number of methods to do character encoding detection, specifically in Java. Take a look at jchardet which is based on the Mozilla algorithm. There's also cpdetector and a project by IBM called ICU4j. I'd take a look at the latter, as it seems to be more reliable than the other two. They work based on statistical analysis of the binary file, ICU4j will also provide a confidence level of the character encoding it detects so you can use this in the case above. It works pretty well.
UTF-8 和 UCS-2/UTF-16 可以通过字节顺序标记 在文件的开头。如果存在,那么文件就采用该编码是相当好的赌注 - 但这并不是绝对确定的。您可能还会发现该文件采用其中一种编码,但没有字节顺序标记。
我对 ISO-8859-2 不太了解,但如果几乎每个文件都是该编码中的有效文本文件,我不会感到惊讶。您能做的最好的事情就是试探性地检查它。事实上,维基百科页面谈论它会表明只有字节 0x7f 是无效的。
无法“按原样”读取文件并获取文本 - 文件是字节序列,因此您必须应用字符编码才能将这些字节解码为字符。
UTF-8 and UCS-2/UTF-16 can be distinguished reasonably easily via a byte order mark at the start of the file. If this exists then it's a pretty good bet that the file is in that encoding - but it's not a dead certainty. You may well also find that the file is in one of those encodings, but doesn't have a byte order mark.
I don't know much about ISO-8859-2, but I wouldn't be surprised if almost every file is a valid text file in that encoding. The best you'll be able to do is check it heuristically. Indeed, the Wikipedia page talking about it would suggest that only byte 0x7f is invalid.
There's no idea of reading a file "as it is" and yet getting text out - a file is a sequence of bytes, so you have to apply a character encoding in order to decode those bytes into characters.
您可以使用 ICU4J (http://icu-project.org/apiref/icu4j/)
这是我的代码:
记住把所有需要的 try catch 都放进去。
我希望这对你有用。
You can use ICU4J (http://icu-project.org/apiref/icu4j/)
Here is my code:
Remember to put all the try catch need it.
I hope this works for you.
如果您的文本文件是正确创建的 Unicode 文本文件,那么字节顺序标记 (BOM) 应该告诉您所需的所有信息。有关 BOM 的更多详细信息,请参阅此处
如果不是,则必须使用某种编码检测库。
If your text file is a properly created Unicode text file then the Byte Order Mark (BOM) should tell you all the information you need. See here for more details about BOM
If it's not then you'll have to use some encoding detection library.