如何检测文件的字符编码?

发布于 2024-09-18 05:17:37 字数 314 浏览 10 评论 0原文

我们的应用程序从用户处接收文件,如果这些文件属于我们支持的编码类型(即 UTF-8、Shift-JIS、EUC-JP),则必须对其进行验证,一旦验证该文件,我们还需要将该文件保存在我们的系统中并将其编码为元数据。

目前,我们使用 JCharDet (这是 mozilla 字符检测器的 java 端口),但是有一些 Shift-JIS 字符似乎无法检测为有效的 Shift-JIS 字符。

有什么想法我们还可以使用吗?

Our application receives files from our users, and those files must be validated if they are of the encoding type that we support (i.e. UTF-8, Shift-JIS, EUC-JP), and once that file is validated, we would also need to save that file in our system and its encoding as meta-data.

Currently, we're using JCharDet (which is a java port of mozilla's character detector), but there are some Shift-JIS characters that it seems to fail to detect as valid Shift-JIS characters.

Any ideas what else we can use?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夏见 2024-09-25 05:17:38

ICU4J 的 CharsetDetector 将为您提供帮助。

BufferedInputStream bis = new BufferedInputStream(new FileInputStream(path));
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
String charsetName = cd.detect().getName();

顺便问一下,什么样的字符导致了这个错误,又导致了什么样的错误呢?我认为 ICU4J 也会有同样的问题,具体取决于字符和错误。

ICU4J's CharsetDetector will help you.

BufferedInputStream bis = new BufferedInputStream(new FileInputStream(path));
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
String charsetName = cd.detect().getName();

By the way, what kind of character had caused the error, and what kind of error had caused? I think ICU4J would have same problem, depending on the character and the error.

荒岛晴空 2024-09-25 05:17:38

Apache Tika 是一个内容分析工具包,主要用于确定文件类型(而不是编码方案),但是它确实返回文本文件类型的内容编码信息。我不知道它的算法是否像 JCharDet 一样先进,但它可能值得一试......

Apache Tika is a content analysis toolkit that is mainly useful for determining file types — as opposed to encoding schemes — but it does returns content encoding information for text file types. I don't know if its algorithms are as advanced as JCharDet, but it might be worth a try...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文