如何检测文件的字符编码?
我们的应用程序从用户处接收文件,如果这些文件属于我们支持的编码类型(即 UTF-8、Shift-JIS、EUC-JP),则必须对其进行验证,一旦验证该文件,我们还需要将该文件保存在我们的系统中并将其编码为元数据。
目前,我们使用 JCharDet (这是 mozilla 字符检测器的 java 端口),但是有一些 Shift-JIS 字符似乎无法检测为有效的 Shift-JIS 字符。
有什么想法我们还可以使用吗?
Our application receives files from our users, and those files must be validated if they are of the encoding type that we support (i.e. UTF-8, Shift-JIS, EUC-JP), and once that file is validated, we would also need to save that file in our system and its encoding as meta-data.
Currently, we're using JCharDet (which is a java port of mozilla's character detector), but there are some Shift-JIS characters that it seems to fail to detect as valid Shift-JIS characters.
Any ideas what else we can use?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
ICU4J 的 CharsetDetector 将为您提供帮助。
顺便问一下,什么样的字符导致了这个错误,又导致了什么样的错误呢?我认为 ICU4J 也会有同样的问题,具体取决于字符和错误。
ICU4J's CharsetDetector will help you.
By the way, what kind of character had caused the error, and what kind of error had caused? I think ICU4J would have same problem, depending on the character and the error.
Apache Tika 是一个内容分析工具包,主要用于确定文件类型(而不是编码方案),但是它确实返回文本文件类型的内容编码信息。我不知道它的算法是否像 JCharDet 一样先进,但它可能值得一试......
Apache Tika is a content analysis toolkit that is mainly useful for determining file types — as opposed to encoding schemes — but it does returns content encoding information for text file types. I don't know if its algorithms are as advanced as JCharDet, but it might be worth a try...