检测多字节字符编码

发布于 2024-12-10 17:36:18 字数 112 浏览 0 评论 0原文

有哪些 C/C++ 库可用于检测字符数组 (char*) 的多字节字符编码(UTF-8、UTF-16 等)。一个好处是还可以检测匹配器何时停止,即检测给定的一组可能编码的前缀匹配范围。

What C/C++ Libraries are there for detecting the multi-byte character encoding (UTF-8, UTF-16, etc) of character array (char*). A bonus would be to also detect when the matcher halted, that is detect prefix match ranges of a given set of a possible encodings.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

地狱即天堂 2024-12-17 17:36:18

ICU 执行字符集检测。您必须注意,正如 ICU 文件所述:

这充其量只是使用统计数据和
启发式。因此,如果您在以下位置提供,检测效果最好
至少几百个字节的字符数据,大部分在一个单一的
语言。

ICU does character set detection. You must note that, as the ICU documentation states:

This is, at best, an imprecise operation using statistics and
heuristics. Because of this, detection works best if you supply at
least a few hundred bytes of character data that's mostly in a single
language.

郁金香雨 2024-12-17 17:36:18

如果输入仅为 ASCII,则无法检测流中是否存在任何高位设置字节。在这种情况下也可以选择 UTF-8。

至于 UTF-8 与 ISO-8859-x,您可以尝试将输入解析为 UTF-8,如果解析失败,则回退到 ISO-8859,但仅此而已。确实没有办法检测存在哪个 ISO-8859 变体。我建议查看 Firefox 尝试自动检测的方式,但这并不是万无一失的,可能取决于输入是否为 HTML。

If the input is only ASCII, there's no way to detect what should be hone had there been any high-bit-set bytes in the stream. May as well just pick UTF-8 in that case.

As for UTF-8 vs. ISO-8859-x, you could try parsing the input as UTF-8 and fall back to ISO-8859 if the parse fails, but that's about it. There's not really a way to detect which ISO-8859 variant is there. I'd recommend looking at the way Firefox tries to auto-detect, but it's not foolproof and probably depends on knowing the input is HTML.

冷︶言冷语的世界 2024-12-17 17:36:18

一般来说,不可能检测到字符编码,除非文本有一些表示编码的特殊标记。您可以使用包含仅出现在某些编码中的字符的单词的字典来启发式检测编码。

这当然只能是一种启发式方法,您需要浏览整个文本。

示例:“英文文本可以用多种编码编写”。例如,可以使用德语代码页编写该句子。它与大多数“西方”编码(包括 UTF-8)没有什么区别,除非您添加一些 ASCII 中不存在的特殊字符(如 ä)。

in general, there is no possibly to detect the character encoding, except if the text has some special mark denoting the encoding. You could heuristically detect an encoding using dictionaries that contain words with characters that are only present in some encodings.

This can of course only be a heuristic and you need to scan the whole text.

Example: "an English text can be written in multiple encodings". This sentence can be written for example using a German codepage. It's indistinguishable from most "western" encodings (including UTF-8) unless you add some special characters (like ä) that are not present in ASCII.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文