如何在java中将文本内容标准化为UTF 8

发布于 2024-08-25 22:46:52 字数 449 浏览 7 评论 0原文

我们有一个 CMS,其中包含数千个文本/html 文件。事实证明,用户一直在使用各种字符编码(utf-8、utf-8 w BOM、windows 1252、iso-8859-1)上传文本/html 文件。

当这些文件被读入并写入响应时,我们的 CMS 框架会在响应的内容类型属性上强制使用 charset=UTF-8。

因此,任何非 UTF-8 内容都会以损坏的字符(当没有从“本机”字符编码到 UTF-8 的正确字符转换时?、黑色菱形等)向用户显示。此外,这些文档没有附加指示字符集的元数据 - 据我所知,判断它们是什么字符集的唯一方法是在文本渲染应用程序(Firefox、Notepadd++ 等)中查看它们,然后“看” ”查看内容是否“看起来”正确。

有谁知道如何自动/智能地将未知编码的文件转换为UTF-8?我读过这可以通过统计建模来完成,但这就是我的想法。

关于如何最好地解决问题的想法?

谢谢

We have a CMS which has several thousand text/html files in it. It turns out that users have been uploading text/html files using various character encodings (utf-8,utf-8 w BOM, windows 1252, iso-8859-1).

When these files are read in and written to the response our CMS's framework forces a charset=UTF-8 on the response's content-type attribute.

Because of this, any non UTF-8 content is displayed to the user with mangled characters (?, black diamonds, etc. when there isnt the correct character translation from the "native" char encoding to UTF-8). Also, there is no metadata attached to these documents that indicate charset - As far as I know, the only way to tell what charset they are is to look at them in a text rendering app (Firefox,Notepadd++,etc.) and "look" at the content to see if it "looks" right.

Does anyone know how to automatically/intelligently convert files of unknown encoding to UTF-8? I've read this can be accomplished with statistical modeling but thats what above my head.

Thoughts on how to best approach the problem?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

初心 2024-09-01 22:46:52

您可以使用 ICU4J 的 CharsetDetector

You can use ICU4J's CharsetDetector

岁月无声 2024-09-01 22:46:52

尝试将其解码为 UTF-8。如果失败,则查找 \x92,如果找到,则解码为 CP1252。否则,解码为 Latin-1。

Try to decode it as UTF-8. If this fails then look for \x92, and decode as CP1252 if found. Otherwise, decode as Latin-1.

厌味 2024-09-01 22:46:52

一般来说,没有办法说清楚。字节序列 63 61 66 C3 A9 与 windows-1252 中的“café”、IBM437 中的“caf├⌐”或 UTF-8 中的“café”同等有效。不过,从统计数据来看,最后一种可能性更大。

如果您不想处理统计方法,则在大多数情况下有效的方法是假设任何看起来像 UTF-8 的内容都在 windows-1252 中,而其他任何内容都在 windows-1252 中。

或者,如果可以使用 UTF-16,请在文件开头查找 FE FF 或 FF FE。

In general, there is no way to tell. The byte sequence 63 61 66 C3 A9 is equally valid as "café" in windows-1252, "caf├⌐" in IBM437, or "café" in UTF-8. The last is statistically more likely, though.

If you don't want to deal with statistical methods, an approach that works much of the time is to assume that anything that looks like UTF-8 is, and that anything else is in windows-1252.

Or if UTF-16 is a possibility, look for FE FF or FF FE at the beginning of the file.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文