在c/cocoa中读取并输出UTF-8字符串
在 Objective-C/cocoa 应用程序中,我使用 c 函数打开一个文本文件,逐行读取它并在第三方函数中使用一些行。在伪代码中:
char *line = fgets(aFile);
library_function(line); // This function calls for a utf-8 encoded char * string
这可以正常工作,直到输入文件包含特殊字符(例如重音符号或 UTF-8 BOM),此时库函数会输出损坏的字符。
但是,如果我这样做:
char *line = fgets(aFile);
NSString *stringObj = [NSString stringWithUTF8String:line];
library_function([stringObj UTF8String]);
那么一切都会正常工作并且字符串会正确输出。
那条 [NSString...
行在做什么,而我却没有这样做? 我最初获取该行的方式是否有问题?或者完全是另外一回事?
In an objective-c/cocoa app, I am using c functions to open a text file, read it line-by-line and use some lines in a third-party function. In psuedo-code:
char *line = fgets(aFile);
library_function(line); // This function calls for a utf-8 encoded char * string
This works fine until the input file contains special characters (such as accents or the UTF-8 BOM) whereupon the library function outputs mangled characters.
However, if I do this:
char *line = fgets(aFile);
NSString *stringObj = [NSString stringWithUTF8String:line];
library_function([stringObj UTF8String]);
Then it all works fine and the string is outputted correctly.
What is that [NSString...
line doing that I'm not?
Am I doing something wrong with how the line is fetched initially? Or is it something else entirely?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
UTF-8 是一种多字节字符集(参见 wikipedia),这意味着一些字符需要多个字节(您遇到的带重音符号的字节)。 C 的
char
类型是单字节,因此 C 的“字符”定义与 Unicode 不匹配。如果您想使用标准 C RTL 读取 Unicode,您还需要使用 Unicode 转换库,例如 libiconv。
(使用 wchar_t 也可能有效;我从未研究过它。)
或者您可以使用 NSString,它已经支持 Unicode。
UTF-8 is a multi-byte character set (see wikipedia), which means some characters require multiple bytes (the accented ones you've run into). C's
char
type is a single byte, so C's definition of "character" doesn't match Unicode's.If you want to read Unicode with the standard C RTL, you'll also need to use a Unicode conversion library, such as libiconv.
(Using wchar_t may also work; I've never researched it.)
Or you can use NSString, which already supports Unicode.