检测 C/C++ 中字符串的编码

发布于 2024-12-06 02:48:24 字数 110 浏览 0 评论 0 原文

给定一个指向字节数组(字符)的指针形式的字符串,如何检测 C/C++ 中字符串的编码(我使用了 Visual Studio 2008)?我进行了搜索,但大多数示例都是用 C# 完成的。

谢谢

Given a string in form of a pointer to a array of bytes (chars), how can I detect the encoding of the string in C/C++ (I used visual studio 2008)?? I did a search but most of samples are done in C#.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

梦纸 2024-12-13 02:48:24

假设您知道输入数组的长度,您可以进行以下猜测:

  1. 首先,检查前几个字节是否与任何众所周知的Unicode 的字节顺序标记 (BOM)。如果他们这样做了,你就完成了!
  2. 接下来,在最后一个字节之前搜索“\0”。如果找到,您可能正在处理 UTF-16 或 UTF-32。如果发现多个连续的“\0”,则可能是 UTF-32。
  3. 如果任何字符是从 0x800xff,那么它肯定不是 ASCII 或 UTF-7。如果您将输入限制为某种 Unicode 变体,则可以假设它是 UTF-8。否则,您必须进行一些猜测来确定它是哪个多字节字符集。那不会很有趣。
  4. 此时,它是:ASCII、UTF-7、Base64 或恰好不使用最高位且不包含任何空字符的 UTF-16 或 UTF-32 范围。

Assuming you know the length of the input array, you can make the following guesses:

  1. First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!
  2. Next, search for '\0' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive '\0's, it's probably UTF-32.
  3. If any character is from 0x80 to 0xff, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun.
  4. At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters.
假面具 2024-12-13 02:48:24

这不是一个容易解决的问题,并且通常依靠启发式方法来对输入编码进行最佳猜测,这可能会被相对无害的输入所绊倒 - 例如,看看 这篇维基百科文章编码 Redux 的记事本文件了解更多详细信息。

如果您正在寻找具有最小依赖性的仅限 Windows 的解决方案,您可以考虑使用 IsTextUnicode 和 MLang 的 DetectInputCodePage 尝试字符集检测。

如果您正在寻找可移植性,但不介意以 ICU 的形式承担相当大的依赖性,那么您可以利用它的 字符集检测例程以可移植的方式实现相同的功能。

It's not an easy problem to solve, and generally relies on heuristics to take a best guess at what the input encoding is, which can be tripped up by relatively innocuous inputs - for example, take a look at this Wikipedia article and The Notepad file encoding Redux for more details.

If you're looking for a Windows-only solution with minimal dependencies, you can look at using a combination of IsTextUnicode and MLang's DetectInputCodePage to attempt character set detection.

If you are looking for portability, but don't mind taking on a fairly large dependency in the form of ICU then you can make use of it's character set detection routines to achieve the same thing in a portable manner.

以往的大感动 2024-12-13 02:48:24

我编写了一个小型 C++ 库来检测文本文件编码。它使用 Qt,但仅使用标准库就可以轻松实现。

它通过测量符号出现统计并将其与不同编码和语言中预先计算的参考值进行比较来进行操作。因此,它不仅检测编码,还检测文本的语言。缺点是必须为目标语言提供预先计算的统计数据才能正确检测该语言。

I have written a small C++ library for detecting text file encoding. It uses Qt, but it can be just as easily implemented using just the standard library.

It operates by measuring symbol occurrence statistics and comparing it to pre-computed reference values in different encodings and languages. As a result, it not only detects encoding but also the language of the text. The downside is that pre-computed statistics must be provided for the target language to detect this language properly.

https://github.com/VioletGiraffe/text-encoding-detector

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文