确定字节数组是否包含 ANSI 或 Unicode 字符串？

发布于 2024-12-07 17:56:07 字数 387 浏览 2 评论 0原文

假设我有一个接收字节数组的函数：

void fcn(byte* data)
{
...
}

有谁知道 fcn() 确定 data 是 ANSI 字符串还是 Unicode 字符串的可靠方法吗？

请注意，我故意不传递长度参数，我收到的只是指向数组的指针。长度参数将是一个很大的帮助，但我没有收到它，所以我必须不这样做。

本文提到了一个显然可以做到这一点的 OLE API，但当然他们没有告诉您哪个 api 函数： http ://support.microsoft.com/kb/138142

原文

Say I have a function that receives a byte array:

void fcn(byte* data)
{
...
}

Does anyone know a reliable way for fcn() to determine if data is an ANSI string or a Unicode string?

Note that I'm intentionally NOT passing a length arg, all I receive is the pointer to the array. A length arg would be a great help, but I don't receive it, so I must do without.

This article mentions an OLE API that apparently does it, but of course they don't tell you WHICH api function: http://support.microsoft.com/kb/138142

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

香草可樂 2024-12-14 17:56:08

首先，谈谈术语。不存在 ANSI 字符串这样的东西；有 ASCII 字符串，代表字符编码。 ASCII 是由 ANSI 开发的，但它们不可互换。

另外，不存在 Unicode 字符串这样的东西。有 Unicode 编码，但这些只是 Unicode 本身的一部分。

我假设“Unicode 字符串”指的是“UTF-8 编码的代码点序列”。对于 ANSI 字符串，我假设您指的是 ASCII。

如果是这样，那么根据 UTF-8 编码的定义，每个 ASCII 字符串也是一个 UTF-8 字符串。 ASCII 只定义到 0x7F 的字符，并且到 0x7F 的所有 UTF-8 代码单元（字节）与 ASCII 下的含义相同。

因此，您关心的是其他 128 个可能的值。那是……复杂。

您问这个问题的唯一原因是您无法控制字符串输入的编码。因此，问题是 ASCII 和 UTF-8 并不是唯一可能的选择。

例如，有 Latin-1。有许多字符串是用 Latin-1 编码的，它采用 ASCII 不使用的其他 128 个字节并为它们定义字符。这很糟糕，因为其他 128 个字节将与 UTF-8 的编码发生冲突。

还有代码页。许多字符串是根据特定的代码页进行编码的；在 Windows 上尤其如此。解码它们需要知道您正在处理的代码页。

如果您确定字符串是 ASCII（7 位，高位始终为 0）或 UTF-8，那么您可以轻松做出确定。字符串要么是 ASCII（因此也是 UTF-8），要么一个或多个字节的高位设置为 1。在这种情况下，您必须使用 UTF-8 解码逻辑。

除非您确实确定这些是唯一的可能性，否则您将需要做更多的事情。您可以通过尝试通过 UTF-8 解码器运行数据来验证数据。如果它遇到无效的代码单元序列，那么您就知道它不是 UTF-8。问题是理论上可以创建技术上有效的 UTF-8 的 Latin-1 字符串。那时你有点搞砸了。基于代码页的字符串也是如此。

最终，如果您不知道字符串的编码是什么，则无法保证您可以正确显示它。这就是为什么了解字符串的来源及其含义很重要。

First, a word on terminology. There is no such thing as an ANSI string; there are ASCII strings, which represents a character encoding. ASCII was developed by ANSI, but they're not interchangable.

Also, there is no such thing as a Unicode string. There are Unicode encodings, but those are only a part of Unicode itself.

I will assume that by "Unicode string" you mean "UTF-8 encoded codepoint sequence." And by ANSI string, I'll assume you mean ASCII.

If so, then every ASCII string is also a UTF-8 string, by the definition of UTF-8's encoding. ASCII only defines characters up to 0x7F, and all UTF-8 code units (bytes) up to 0x7F mean the same thing as they do under ASCII.

Therefore, your concern would be for the other 128 possible values. That is... complicated.

The only reason you would ask this question is if you have no control over the encoding of the string input. And therefore, the problem is that ASCII and UTF-8 are not the only possible choices.

There's Latin-1, for example. There are many strings out there that are encoded in Latin-1, which takes the other 128 bytes that ASCII doesn't use and defines characters for them. That's bad, because those other 128 bytes will conflict with UTF-8's encoding.

There are also code pages. Many strings were encoded against a particular code page; this is particularly so on Windows. Decoding them requires knowing what codepage you're working on.

If you are in a situation where you are certain that a string is either ASCII (7-bit, with the high bit always 0) or UTF-8, then you can make the determination easily. Either the string is ASCII (and therefore also UTF-8), or one or more of the bytes will have the high bit set to 1. In which case, you must use UTF-8 decoding logic.

Unless you are truly certain of that these are the only possibilities, you are going to need to do a bit more. You can validate the data by trying to run it through a UTF-8 decoder. If it runs into an invalid code unit sequence, then you know it isn't UTF-8. The problem is that it is theoretically possible to create a Latin-1 string that is technically valid UTF-8. You're kinda screwed at that point. The same goes for code page-based strings.

Ultimately, if you don't know what encoding the string is, there's no guarantee you can display it properly. That's why it's important to know where your strings come from and what they mean.

回复收藏 0 原文

~没有更多了~