Java中获取文件的编码
可能的重复:
Java:如何确定正确的字符集编码一个流
用户将一个CSV文件上传到服务器,服务器需要检查CSV文件是否编码为UTF-8。如果需要,请通知用户,他上传了错误的编码文件。问题是如何检测用户上传的文件是UTF-8编码?后端是用Java编写的。那么有人收到建议了吗?
Possible Duplicate:
Java : How to determine the correct charset encoding of a stream
User will upload a CSV file to the server, server need to check if the CSV file is encoded as UTF-8. If so need to inform user, (s)he uploaded a wrong encoding file. The problem is how to detect the file user uploaded is UTF-8 encoding? The back end is written in Java. So anyone get the suggestion?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
至少在一般情况下,无法确定文件使用什么编码——您能做的最好的事情就是基于启发式进行合理猜测。你可以排除一些可能性,但充其量你只是在不确认任何一种可能性的情况下缩小可能性范围。例如,大多数 ISO 8859 变体允许任何字节值(或字节值模式),因此几乎任何内容都可以使用几乎任何 ISO 8859 变体进行编码(并且我只使用“几乎“出于谨慎,不确定您是否可以消除任何可能性)。
不过,您可以做出一些合理的猜测。例如,一个以 UTF-8 编码 BOM (EF BB BF) 的三个字符开头的文件,可以安全地假设它确实是 UTF-8。同样,如果您看到类似以下的序列:110xxxxx 10xxxxxx,则可以相当合理地猜测您所看到的内容是使用 UTF-8 编码的。如果您看到类似 110xxxxx 110xxxxx 的序列,则可以消除某些内容(正确)被 UTF-8 编码的可能性。 (110xxxxx 是序列的前导字节,必须后跟一个非前导字节,而不是正确编码的 UTF-8 中的另一个前导字节)。
At least in the general case, there's no way to be certain what encoding is used for a file -- the best you can do is a reasonable guess based on heuristics. You can eliminate some possibilities, but at best you're narrowing down the possibilities without confirming any one. For example, most of the ISO 8859 variants allow any byte value (or pattern of byte values), so almost any content could be encoded with almost any ISO 8859 variant (and I'm only using "almost" out of caution, not any certainty that you could eliminate any of the possibilities).
You can, however, make some reasonable guesses. For example, a file that start out with the three characters of a UTF-8 encoded BOM (EF BB BF), it's probably safe to assume it's really UTF-8. Likewise, if you see sequences like: 110xxxxx 10xxxxxx, it's a pretty fair guess that what you're seeing is encoded with UTF-8. You can eliminate the possibility that something is (correctly) UTF-8 enocded if you ever see a sequence like 110xxxxx 110xxxxx. (110xxxxx is a lead byte of a sequence, which must be followed by a non-lead byte, not another lead byte in properly encoded UTF-8).
您可以尝试使用第 3 方库猜测编码,例如: http: //glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
You can try and guess the encoding using a 3rd party library, for example: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
好吧,你不能。您可以使用文件中的一些示例数据显示某种“预览”(或者我应该说审查?),以便用户可以检查它看起来是否正常。也许可以选择不同的编码选项来帮助确定正确的编码选项。
Well, you can't. You could show kind of a "preview" (or should I say review?) with some sample data from the file so the user can check if it looks okay. Perhaps with the possibility of selecting different encoding options to help determine the correct one.