对宽数据使用窄字符串操作函数
我正在解析一个 XML 文件,该文件可以包含不同语言的本地化字符串(目前只有英语和西班牙语,但将来可能是任何语言),XML 解析器的 API 通过char*,UTF8 编码。
解析数据后需要对数据进行一些操作(在数据中搜索子字符串、连接字符串、确定子字符串的长度等)。
使用标准函数(例如 strlen、strcat 等)会很方便。由于我从 XML 解析器接收的原始数据是 char*,因此我可以使用这些标准字符串处理函数轻松完成所有操作。
然而,这些当然都假设和要求字符串以 NULL 结尾。 因此,我的问题是 - 如果您有表示为 char* 的宽数据,那么 NULL 终止符字符是否可以出现在数据内而不是末尾?
即,如果某种语言中的字符不需要2个字节来表示它,并且用一个字节来表示,那么另一个字节是否可以为NULL?
I'm parsing an XML file which can contain localized strings in different languages (at the moment its just english and spanish, but in the future it could be any language), the API for the XML parser returns all data within the XML via a char* which is UTF8 encoded.
Some manipulation of the data is required after its been parsed (searching within it for substrings, concatenating strings, determining the length of substrings etc.).
It would be convenient to use standard functions such as strlen, strcat etc. As the raw data I'm receiving from the XML parser is a char* I can do all manipulation readily using these standard string handling functions.
However these all of course make the assumption and requirement that the strings are NULL terminated.
My question therefore is - if you have wide data represented as a char*, can a NULL terminator character occur within the data rather than at the end?
i.e. if a character in a certain language doesn't require 2 bytes to represent it, and it is represented in one byte, will/can the other byte be NULL?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
UTF-8 并不“宽”。 UTF-8 是多字节编码,其中 Unicode 字符可以占用 1 到 4 个字节。 UTF-8 有效字符内不会有零终止符。确保您不会对解析器为您提供的内容感到困惑。它可以是 UTF-16 或 UCS2 或它们放置在宽字符串中的 4 字节等效项,在这种情况下,您必须将它们视为宽字符串。
UTF-8 is not "wide". UTF-8 is multibyte encoding, where Unicode character can take 1 to 4 bytes. UTF-8 won't have zero terminators inside valid character. Make sure you are not confused on what your parser is giving you. It could be UTF-16 or UCS2 or their 4-byte equivalents placed in wide character strings, in which case you have to treat them as wide strings.
C 区分多字节字符和宽字符:
宽字符必须能够准确地表示执行字符集中的任何字符相同的字节数(例如如果兀需要4个字节来表示,那么A也必须需要4个字节来表示)。宽字符编码的示例有 UCS-4 和已弃用的 UCS-2。
多字节字符可以使用不同数量的字节来表示。多字节编码的示例有 UTF-8 和 UTF-16。
使用 UTF-8 时,您可以继续使用
str*
函数,但您必须记住,它们不提供返回字符长度的方法 > 字符串,需要转换为宽字符,并使用wcslen
。strlen
返回以字节为单位的长度,而不是字符,这在不同情况下很有用。我必须强调的是,执行字符集的所有元素都需要表示为预定义大小(以字节为单位)的单个宽字符。有些系统使用UTF-16作为宽字符,结果是实现不符合C标准,并且一些
wc*
函数可能无法正常工作。C distinguishes between between multibyte characters and wide characters:
Wide characters must be able to represent any character of the execution character set using exactly the same number of bytes (e.g. if 兀 takes 4 bytes to be represented, A must also take 4 bytes to be represented). Examples of wide character encodings are UCS-4, and the deprecated UCS-2.
Multibyte characters can take a varying number of bytes to be represented. Examples of multibyte encodings are UTF-8 and UTF-16.
When using UTF-8, you can continue to use the
str*
functions, but you have to bear in mind that they don't provide a way to return the length in characters of a string, you need to convert to wide characters, and usewcslen
.strlen
returns the length in bytes, not characters, which is useful in different situations.I can't stress enough that all elements of the execution character set need to be represented into a single wide character of a predefined size in bytes. Some systems use UTF-16 for their wide characters, the result is that the implementation can't be conforming to the C standard, and some
wc*
functions can't possibly work right.