Objective-C 中 UTF-8 和 UTF-8 的区别UTF-16 字符串作为字节
我正在尝试将 NSString 转换为字节数组,然后再转换回 NSString。我尝试过 NSUnicodeEncoding 和 NSUTF8StringEncoding。我的问题是,当我迭代字节数组时,我看到不同的数据。
此代码中唯一的变化是我将 NSUTF8StringEncoding 更改为 NSUnicodeEncoding,并添加 dataLength += 2 以便它考虑 BOM。
NSString *message = @"testing";
NSUInteger dataLength = [message lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
void *byteData = malloc( dataLength );
NSRange range = NSMakeRange(0, [message length]);
BOOL result = [message getBytes:byteData maxLength:dataLength usedLength:&actualLength encoding:NSUTF8StringEncoding options:0 range:range remainingRange:&remain];
for( NSUInteger x = 0; x < dataLength; x++ )
{
NSLog( @"byte data: %s", (char *)byteData);
int t = (int)*(char *)byteData;
byteData++;
}
区别在于 NSLog : 作为 NSUTF8StringEncoding,我看到
- testing`
- esting`
- sting`
- ting`
- ...
作为 NSUnicodeEncoding,我看到
- null
- t
- null
- e
- ...
int t 值对于给定字符是正确的,但我不明白为什么 byteData 如此不同。我希望它们都像 NSUnicodeEncoding 一样工作。
I am trying to convert NSStrings to byte arrays and then back to NSStrings. I have tried with NSUnicodeEncoding and NSUTF8StringEncoding. My question is that as I iterate over the byte arrays, I'm seeing different data
Only change in this code is that I change NSUTF8StringEncoding to NSUnicodeEncoding and that I add dataLength += 2 so that it accounts for the BOM.
NSString *message = @"testing";
NSUInteger dataLength = [message lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
void *byteData = malloc( dataLength );
NSRange range = NSMakeRange(0, [message length]);
BOOL result = [message getBytes:byteData maxLength:dataLength usedLength:&actualLength encoding:NSUTF8StringEncoding options:0 range:range remainingRange:&remain];
for( NSUInteger x = 0; x < dataLength; x++ )
{
NSLog( @"byte data: %s", (char *)byteData);
int t = (int)*(char *)byteData;
byteData++;
}
The difference is in the NSLog :
As NSUTF8StringEncoding I see
- testing`
- esting`
- sting`
- ting`
- ...
As NSUnicodeEncoding I see
- null
- t
- null
- e
- ...
The int t value is correct for the given character, but I don't understand why the byteData is so different. I would expect them both to act like the NSUnicodeEncoding.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在UTF8中,字母F由单个F字节表示。字符串“FU”由 ASCII F 字节后跟 ASCII U 字节表示。在 Unicode(如此处所用)中,每个字符占用两个字节。标准 ASCII 字符前面有一个零字节。
目前尚不清楚为什么您看到的行为与您所期望的不完全一样。在UTF-8中,标准ASCII字符占用一个字节。在你的Unicode编码中,占据两个。所以它肯定不会完全一样。
In UTF8, the letter F is represented by a single F byte. The string "FU" is represented by an ASCII F byte followed by an ASCII U byte. In Unicode (as used here), each character occupies two bytes. Standard ASCII characters are preceded by a zero byte.
It's not clear why the behavior you see isn't exactly what you'd expect. In UTF-8, standard ASCII characters occupy one byte. In your Unicode encoding, the occupy two. So it certainly won't be at all the same.
根据这个答案,
NSUnicodeStringEncoding< /code> “是小端字节序 UTF-16,前面带有字节顺序标记”,因此应该预期结果与UTF-8。
According to this answer,
NSUnicodeStringEncoding
"is little-endian UTF-16 preceded with a byte order mark", so it should be expected that the result is totally different than UTF-8.