区分字符串格式
有一个无类型指针指向某个可以保存 ANSI 或 Unicode 字符串的缓冲区,我如何判断它保存的当前字符串是否是多字节?
Having an untyped pointer pointing to some buffer which can hold either ANSI or Unicode string, how do I tell whether the current string it holds is multibyte or not?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
除非字符串本身包含有关其格式的信息(例如标头或字节顺序标记),否则检测字符串是 ANSI 还是 Unicode 并不是万无一失的方法。 Windows API 包含一个名为
IsTextUnicode()
基本上猜测字符串是 ANSI 还是 Unicode,但是 然后你会遇到这个问题 因为您被迫猜测。为什么首先要有一个指向字符串的无类型指针?您必须确切地知道数据表示信息的内容和方式,要么首先使用类型化指针,要么提供 ANSI/Unicode 标志或其他东西。除非你确切地知道它代表什么,否则一串字节是没有意义的。
Unless the string itself contains information about its format (e.g. a header or a byte order mark) then there is no foolproof way to detect if a string is ANSI or Unicode. The Windows API includes a function called
IsTextUnicode()
that basically guesses if a string is ANSI or Unicode, but then you run into this problem because you're forced to guess.Why do you have an untyped pointer to a string in the first place? You must know exactly what and how your data is representing information, either by using a typed pointer in the first place or provide an ANSI/Unicode flag or something. A string of bytes is meaningless unless you know exactly what it represents.
Unicode不是一种编码,它是代码点到字符的映射。例如,编码是 UTF8 或 UCS2。
而且,考虑到 ASCII 和 UTF8 编码之间存在零差异,如果您将自己限制为较低的 128 个字符,那么您实际上无法分辨出差异。
您最好询问是否有办法区分 ASCII 和 Unicode 的特定编码之间的区别。答案是使用统计分析,但统计分析存在固有的不准确性的可能性。
例如,如果整个字符串由小于 128 的字节组成,则它是 ASCII(它可能是 UTF8,但无法区分,在这种情况下没有区别)。
如果它主要是英语/罗马语并且由许多两字节序列组成,其中一个字节为 0,则它可能是 UTF16。等等。我不相信在没有某种指标(例如 BOM)的情况下存在万无一失的方法。
我的建议是不要把自己置于必须猜测的境地。如果数据类型本身不能包含指示符,请为 ASCII 和 Unicode 的特定编码提供不同的函数。然后将决定的工作交给你的客户。在调用层次结构中的某个时刻,某人现在应该进行编码。
或者,更好的是,完全放弃 ASCII,拥抱新世界并只使用 Unicode。对于 UTF8 编码,ASCII 与 Unicode 相比没有优势:-)
Unicode is not an encoding, it's a mapping of code points to characters. The encoding is UTF8 or UCS2, for example.
And, given that there is zero difference between ASCII and UTF8 encoding if you restrict yourself to the lower 128 characters, you can't actually tell the difference.
You'd be better off asking if there were a way to tell the difference between ASCII and a particular encoding of Unicode. And the answer to that is to use statistical analysis, with the inherent possibility of inaccuracy.
For example, if the entire string consists of bytes less than 128, it's ASCII (it could be UTF8 but there's no way to tell and no difference in that case).
If it's primarily English/Roman and consists of lots of two-byte sequences with a zero as one of the bytes, it's probably UTF16. And so on. I don't believe there's a foolproof method without actually having an indicator of some sort (e.g., BOM).
My suggestion is to not put yourself in the position where you have to guess. If the data type itself can't contain an indicator, provide different functions for ASCII and a particular encoding of Unicode. Then force the work of deciding on to your client. At some point in the calling hierarchy, someone should now the encoding.
Or, better yet, ditch ASCII altogether, embrace the new world and use Unicode exclusively. With UTF8 encoding, ASCII has exactly no advantages over Unicode :-)
一般来说你不能
你可以检查零的模式 - 最后只有一个可能意味着ansi'c',每隔一个字节一个零可能意味着ansi文本作为UTF16,3zeros可能是UTF32
In general you can't
You could check for the pattern of zeros - just one at the end probably means ansi 'c', every other byte a zero probably means ansi text as UTF16, 3zeros might be UTF32