我正在使用一个包含 UTF32 字符的十六进制代码的数据库。我想将这些字符存储在 NSString 中。我需要有以两种方式进行转换的例程。
要将 NSString 的第一个字符转换为 unicode 值,此例程似乎可以工作:
const unsigned char *cs = (const unsigned char *)
[s cStringUsingEncoding:NSUTF32StringEncoding];
uint32_t code = 0;
for ( int i = 3 ; i >= 0 ; i-- ) {
code <<= 8;
code += cs[i];
}
return code;
但是,我无法执行相反的操作(即采用单个代码并将其转换为 NSString)。我想我可以做与上面相反的事情,只需创建一个包含 UTF32 字符且字节顺序正确的 c 字符串,然后使用正确的编码从中创建一个 NSString 。
但是,从 cstrings 转换为 / 从 cstrings 转换对我来说似乎是不可逆的。
例如,我尝试过这段代码,“tmp”字符串不等于原始字符串“s”。
char *cs = [s cStringUsingEncoding:NSUTF32StringEncoding];
NSString *tmp = [NSString stringWithCString:cs encoding:NSUTF32StringEncoding];
我做错了什么?我应该使用“wchar_t”作为 cstring 而不是 char * 吗?
I'm working with a database that includes hex codes for UTF32 characters. I would like to take these characters and store them in an NSString. I need to have routines to convert in both ways.
To convert the first character of an NSString to a unicode value, this routine seems to work:
const unsigned char *cs = (const unsigned char *)
[s cStringUsingEncoding:NSUTF32StringEncoding];
uint32_t code = 0;
for ( int i = 3 ; i >= 0 ; i-- ) {
code <<= 8;
code += cs[i];
}
return code;
However, I am unable to do the reverse (i.e. take a single code and convert it into an NSString). I thought I could just do the reverse of what I do above by simply creating a c-string with the UTF32 character in it with the bytes in the correct order, and then create an NSString from that using the correct encoding.
However, converting to / from cstrings does not seem to be reversible for me.
For example, I've tried this code, and the "tmp" string is not equal to the original string "s".
char *cs = [s cStringUsingEncoding:NSUTF32StringEncoding];
NSString *tmp = [NSString stringWithCString:cs encoding:NSUTF32StringEncoding];
What am I doing wrong? Should I be using "wchar_t" for the cstring instead of char *?
发布评论
评论(2)
您有几个合理的选择。
1. 转换
首先是将 UTF32 转换为 UTF16 并与 NSString 一起使用,因为 UTF16 是 NSString 的“本机”编码。实际上并不是那么难。如果UTF32字符在BMP中(例如高两个字节为0),则可以直接将其转换为
unichar
。如果它在任何其他平面中,您可以将其转换为 UTF16 字符的代理对。您可以在维基百科页面上找到这些规则。但是快速(未经测试的)转换看起来像现在你可以同时使用两个字符创建一个 NSString:
要向后,你可以使用
[NSString getCharacters:range:]
来获取 unichar 的背面然后反转代理对算法以恢复 UTF32 字符(任何不在0xD800-0xDFFF
范围内的字符都应该转换为 UTF32 直接地)。2. 字节缓冲区
您的另一个选择是让 NSString 直接进行转换,而不使用 cStrings。要将 UTF32 值转换为 NSString,您可以使用如下所示的内容:
要再次将其返回,您可以使用
You have a couple of reasonable options.
1. Conversion
The first is to convert your UTF32 to UTF16 and use those with NSString, as UTF16 is the "native" encoding of NSString. It's not actually all that hard. If the UTF32 character is in the BMP (e.g. it's high two bytes are 0's), you can just cast it to
unichar
directly. If it's in any other plane, you can convert it to a surrogate pair of UTF16 characters. You can find the rules on the wikipedia page. But a quick (untested) conversion would look likeNow you can create an NSString using both characters at the same time:
To go backwards, you can use
[NSString getCharacters:range:]
to get the unichar's back and then reverse the surrogate pair algorithm to get your UTF32 character back (any characters which aren't in the range0xD800-0xDFFF
should just be cast to UTF32 directly).2. Byte buffers
Your other option is to let NSString do the conversion directly without using cStrings. To convert a UTF32 value into an NSString you can use something like the following:
To get it back out again, you can use
这里有两个问题:
1:
第一个是
[NSString cStringUsingEncoding:]
和[NSString getCString:maxLength:encoding:]
都返回原生的 C 字符串-endianness(小)使用NSUTF32StringEncoding
和NSUTF16StringEncoding
时不添加 BOM。Unicode 标准规定:(请参阅“我应该如何处理 BOM”)
“如果没有 BOM,文本应被解释为 big-endian。”
这也在 NSString 的文档:(请参阅“解释 UTF-16 编码数据”)
“...如果未另外指定字节顺序,NSString 假定 UTF-16字符都是大端字节序,除非有 BOM(字节顺序标记),在这种情况下,BOM 决定字节顺序。”
尽管它们指的是 UTF-16,但这同样适用于UTF-32。
2:
第二个是
[NSString stringWithCString:encoding:]
内部使用CFStringCreateWithCString
创建 C 字符串。问题是CFStringCreateWithCString
仅接受使用8 位编码的字符串。来自文档:(请参阅“参数”部分)字符串必须使用 8 位编码。
要解决此问题:
NSString -> C-string
和C-string -> NSString
)[NSString initWithBytes:length:encoding:]
以 UTF-32 或 UTF-16 编码的字符串。There are two problems here:
1:
The first one is that both
[NSString cStringUsingEncoding:]
and[NSString getCString:maxLength:encoding:]
return the C-string in native-endianness (little) without adding a BOM to it when usingNSUTF32StringEncoding
andNSUTF16StringEncoding
.The Unicode standard states that: (see, "How I should deal with BOMs")
"If there is no BOM, the text should be interpreted as big-endian."
This is also stated in NSString's documentation: (see, "Interpreting UTF-16-Encoded Data")
"... if the byte order is not otherwise specified, NSString assumes that the UTF-16 characters are big-endian, unless there is a BOM (byte-order mark), in which case the BOM dictates the byte order."
Although they're referring to UTF-16, the same applies to UTF-32.
2:
The second one is that
[NSString stringWithCString:encoding:]
internally usesCFStringCreateWithCString
to create the C-string. The problem with this is thatCFStringCreateWithCString
only accepts strings using 8-bit encodings. From the documentation: (see, "Parameters" section)The string must use an 8-bit encoding.
To solve this issue:
NSString -> C-string
andC-string -> NSString
)[NSString initWithBytes:length:encoding:]
when trying to create an NSString from a C-string encoded in UTF-32 or UTF-16.