对宽数据使用窄字符串操作函数

发布于 2024-11-11 01:27:10 字数 399 浏览 3 评论 0原文

我正在解析一个 XML 文件,该文件可以包含不同语言的本地化字符串(目前只有英语和西班牙语,但将来可能是任何语言),XML 解析器的 API 通过char*,UTF8 编码。

解析数据后需要对数据进行一些操作(在数据中搜索子字符串、连接字符串、确定子字符串的长度等)。

使用标准函数(例如 strlen、strcat 等)会很方便。由于我从 XML 解析器接收的原始数据是 char*,因此我可以使用这些标准字符串处理函数轻松完成所有操作。

然而,这些当然都假设和要求字符串以 NULL 结尾。 因此,我的问题是 - 如果您有表示为 char* 的宽数据,那么 NULL 终止符字符是否可以出现在数据内而不是末尾?

即,如果某种语言中的字符不需要2个字节来表示它,并且用一个字节来表示,那么另一个字节是否可以为NULL?

I'm parsing an XML file which can contain localized strings in different languages (at the moment its just english and spanish, but in the future it could be any language), the API for the XML parser returns all data within the XML via a char* which is UTF8 encoded.

Some manipulation of the data is required after its been parsed (searching within it for substrings, concatenating strings, determining the length of substrings etc.).

It would be convenient to use standard functions such as strlen, strcat etc. As the raw data I'm receiving from the XML parser is a char* I can do all manipulation readily using these standard string handling functions.

However these all of course make the assumption and requirement that the strings are NULL terminated.
My question therefore is - if you have wide data represented as a char*, can a NULL terminator character occur within the data rather than at the end?

i.e. if a character in a certain language doesn't require 2 bytes to represent it, and it is represented in one byte, will/can the other byte be NULL?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

撧情箌佬 2024-11-18 01:27:10

UTF-8 并不“宽”。 UTF-8 是多字节编码,其中 Unicode 字符可以占用 1 到 4 个字节。 UTF-8 有效字符内不会有零终止符。确保您不会对解析器为您提供的内容感到困惑。它可以是 UTF-16 或 UCS2 或它们放置在宽字符串中的 4 字节等效项,在这种情况下,您必须将它们视为宽字符串。

UTF-8 is not "wide". UTF-8 is multibyte encoding, where Unicode character can take 1 to 4 bytes. UTF-8 won't have zero terminators inside valid character. Make sure you are not confused on what your parser is giving you. It could be UTF-16 or UCS2 or their 4-byte equivalents placed in wide character strings, in which case you have to treat them as wide strings.

空‖城人不在 2024-11-18 01:27:10

C 区分多字节字符宽字符

  • 宽字符必须能够准确地表示执行字符集中的任何字符相同的字节数(例如如果兀需要4个字节来表示,那么A也必须需要4个字节来表示)。宽字符编码的示例有 UCS-4 和已弃用的 UCS-2。

  • 多字节字符可以使用不同数量的字节来表示。多字节编码的示例有 UTF-8 和 UTF-16。

使用 UTF-8 时,您可以继续使用 str* 函数,但您必须记住,它们不提供返回字符长度的方法 > 字符串,需要转换为宽字符,并使用wcslenstrlen 返回以字节为单位的长度,而不是字符,这在不同情况下很有用。

我必须强调的是,执行字符集的所有元素都需要表示为预定义大小(以字节为单位)的单个宽字符。有些系统使用UTF-16作为宽字符,结果是实现不符合C标准,并且一些wc*函数可能无法正常工作。

C distinguishes between between multibyte characters and wide characters:

  • Wide characters must be able to represent any character of the execution character set using exactly the same number of bytes (e.g. if 兀 takes 4 bytes to be represented, A must also take 4 bytes to be represented). Examples of wide character encodings are UCS-4, and the deprecated UCS-2.

  • Multibyte characters can take a varying number of bytes to be represented. Examples of multibyte encodings are UTF-8 and UTF-16.

When using UTF-8, you can continue to use the str* functions, but you have to bear in mind that they don't provide a way to return the length in characters of a string, you need to convert to wide characters, and use wcslen. strlen returns the length in bytes, not characters, which is useful in different situations.

I can't stress enough that all elements of the execution character set need to be represented into a single wide character of a predefined size in bytes. Some systems use UTF-16 for their wide characters, the result is that the implementation can't be conforming to the C standard, and some wc* functions can't possibly work right.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文