检测多字节字符编码
有哪些 C/C++ 库可用于检测字符数组 (char*
) 的多字节字符编码(UTF-8、UTF-16 等)。一个好处是还可以检测匹配器何时停止,即检测给定的一组可能编码的前缀匹配范围。
What C/C++ Libraries are there for detecting the multi-byte character encoding (UTF-8, UTF-16, etc) of character array (char*
). A bonus would be to also detect when the matcher halted, that is detect prefix match ranges of a given set of a possible encodings.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
ICU 执行字符集检测。您必须注意,正如 ICU 文件所述:
ICU does character set detection. You must note that, as the ICU documentation states:
如果输入仅为 ASCII,则无法检测流中是否存在任何高位设置字节。在这种情况下也可以选择 UTF-8。
至于 UTF-8 与 ISO-8859-x,您可以尝试将输入解析为 UTF-8,如果解析失败,则回退到 ISO-8859,但仅此而已。确实没有办法检测存在哪个 ISO-8859 变体。我建议查看 Firefox 尝试自动检测的方式,但这并不是万无一失的,可能取决于输入是否为 HTML。
If the input is only ASCII, there's no way to detect what should be hone had there been any high-bit-set bytes in the stream. May as well just pick UTF-8 in that case.
As for UTF-8 vs. ISO-8859-x, you could try parsing the input as UTF-8 and fall back to ISO-8859 if the parse fails, but that's about it. There's not really a way to detect which ISO-8859 variant is there. I'd recommend looking at the way Firefox tries to auto-detect, but it's not foolproof and probably depends on knowing the input is HTML.
一般来说,不可能检测到字符编码,除非文本有一些表示编码的特殊标记。您可以使用包含仅出现在某些编码中的字符的单词的字典来启发式检测编码。
这当然只能是一种启发式方法,您需要浏览整个文本。
示例:“英文文本可以用多种编码编写”。例如,可以使用德语代码页编写该句子。它与大多数“西方”编码(包括 UTF-8)没有什么区别,除非您添加一些 ASCII 中不存在的特殊字符(如 ä)。
in general, there is no possibly to detect the character encoding, except if the text has some special mark denoting the encoding. You could heuristically detect an encoding using dictionaries that contain words with characters that are only present in some encodings.
This can of course only be a heuristic and you need to scan the whole text.
Example: "an English text can be written in multiple encodings". This sentence can be written for example using a German codepage. It's indistinguishable from most "western" encodings (including UTF-8) unless you add some special characters (like ä) that are not present in ASCII.