如何确定文本编码
我知道 UTF 文件有 BOM 来确定编码,但其他编码又如何呢? 不知道如何猜测该编码。
我是新的java程序员。 我已经编写了使用 UTF BOM 猜测 UTF 编码的代码。 但我对其他编码有问题。我怎么猜他们呢。
有人可以帮助我吗? 提前致谢。
I know UTF file has BOM for determining encoding but what about other encoding that has
no clue how to guess that encoding.
I am new java programmer.
I have written code for guessing UTF encoding using UTF BOM.
but I have problem with other encoding. How do I guess them.
Anybody can help me?
thanks in Advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这个问题是几个 上一页 一个。至少有两个 Java 库尝试猜测编码(尽管请记住,没有办法 100% 猜对)。
当然,如果您知道编码只是三个或四个选项之一,您也许可以编写更准确的猜测算法。
This question is a duplicate of several previous ones. There are at least two libraries for Java that attempt to guess the encoding (although keep in mind that there is no way to guess right 100% of the time).
Of course, if you know the encoding will only be one of three or four options, you might be able to write a more accurate guessing algorithm.
简短的回答是:你不能。
即使在 UTF-8 中,BOM 也是完全可选的,并且通常建议不要使用它,因为许多应用程序无法正确处理它,只是将其显示为可打印字符。字节顺序标记的最初目的是告诉 UTF-16 文件的字节顺序。
也就是说,大多数处理 Unicode 的应用程序都会实现某种猜测算法。阅读文件的开头并查找某些签名。
Short answer is: you cannot.
Even in UTF-8, the BOM is entirely optional and it's often recommended not to use it since many apps do not handle it properly and just display it as if it was a printable char. The original purpose of Byte Order Markers was to tell out the endianness of UTF-16 files.
This said, most apps that handle Unicode implement some sort of guessing algorithm. Read the beginning of the file and look for certain signatures.
如果您不知道编码并且没有任何指示符(例如 BOM),则并不总是能够准确地“猜测”编码。存在一些可以给您提示的指针。
例如,ISO-8859-1 文件(通常)不会有任何 0x00 字符,但 UTF-16 文件却有大量此类字符。
最常见的解决方案是,如果无法检测到编码,则让用户选择编码。
If you don't know the encoding and don't have any indicators (like a BOM), its not always possible to accurately "guess" the encoding. Some pointers exist that can give you hints.
For example, a ISO-8859-1 file will (usually) not have any 0x00 chars, however a UTF-16 file have loads of them.
The most common solution is to let the user select the encoding if you cannot detect it.