关于 NSFileHandle、Obj-C 的几个问题
我现在正在使用文件处理 Obj-C,我的应用程序将读取一些具有 UTF16 字符编码的巨大文本文件(例如 5 MB)。 第一个问题是如何检测我要读取的文件大小?
第二个问题是,当我只读取该文件一次时,它会给出正确的文本,但是当我尝试另一次查找或读取时,它不会给出我的原始文本,这是我的代码段:
NSFileHandle *sourceFile;
NSData *d1;
NSString *st1,*st2 = @"";
sourceFile = [NSFileHandle fileHandleForReadingAtPath : filePath]; // 我的文件大小为 5 MB
for (int i = 0; i < 500; i ++) {
d1 = [源文件读取数据长度:20];
st1 = [[NSString alloc] initWithData:d1 编码:NSUTF16StringEncoding]; // 将原始数据转换为 UTF16 字符串
st2 = [st2 stringByAppendingFormat:@"%@",st1];
st1 = @"";
}
[源文件关闭文件];
执行完之后,st2会携带一些字符串,这个字符串会有一些清晰的字符(如原始文件中一样),但随后它会携带一些不清楚的字符(例如䠆⠆䀆䀆䀆ㄆ䌆✆⨆䜆) ..我整夜没睡,试图弄清楚,但无法:(
I'm working now on Obj-C with files, my application shall read some huge text files (e.g. 5 MB) that have character encoding of UTF16..
The first problem is how do I detect the file size that I'm going to read from ?
The second problem is when I read the file only one time it gives me the right text, but when I try to seek or read another time, then it will not give me my original text, and here is my code segment :
NSFileHandle *sourceFile;
NSData *d1;
NSString *st1,*st2 = @"";
sourceFile = [NSFileHandle fileHandleForReadingAtPath : filePath]; // my file's size is 5 MB
for (int i = 0; i < 500; i ++) {
d1 = [sourceFile readDataOfLength:20];
st1 = [[NSString alloc] initWithData:d1 encoding:NSUTF16StringEncoding]; // converting my raw data into a UTF16 string
st2 = [st2 stringByAppendingFormat:@"%@",st1];
st1 = @"";
}
[sourceFile closeFile];
after this executed, then st2 will carry some string, and this string will have some clear character (as in the original file), but then it will carry a mess of unclear characters (e.g 䠆⠆䀆䀆䀆ㄆ䌆✆⨆䜆).. I haven't slept all the night trying to figure it out, but couldn't :(
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
@新维布兰特:
很抱歉告知您错误,但 UTF-16 并不总是每个字符 2 个字节(或 16 位)。正如您在维基百科文章中看到的,U+10000 以上的所有字符都可以是 4 个字节......
因此,仅注意偶数偏移量是不够的,因为您可以通过此方式截断 4 字节字符。
最好的方法是始终使用正确的编码并将其留给文件管理器来确定字符的大小。
@Neovibrant:
Sorry to prrof you wrong, but UTF-16 is not always 2 Bytes (or 16 bit) per character. As you see in the wikipedia article it can be 4 bytes for all characters above U+10000 ...
So it will not be sufficient to watch out for an even offset because you can truncate a 4-byte character by this.
Best way is always to use the correct encoding and leave it to the file manager to determine the size of a character.
要获取文件大小,您可以简单地使用 NSFileManager:
第二个问题是由于 UTF-16 编码造成的。你看,在 UTF-16 中,一个字符由 2 个以上字节表示(http://en .wikipedia.org/wiki/UTF-16)。
假设您有一个 UTF-16 格式的文本文件,其中包含文本
Hello
。字节将是:如果您从字节 0(或任何偶数索引)开始读取,一切都很好,您将得到预期的结果。但是你开始读取奇数字节(如 1),所有字符都会被搞乱,因为字节被移位:
To get the file size you can simply use the NSFileManager:
The second problem is because of the UTF-16 encoding. You see, in UTF-16, a character is represented by 2+ bytes (http://en.wikipedia.org/wiki/UTF-16).
Let's assume you have a text file in UTF-16 with the text
Hello
. The bytes will be:Everything is fine if you start reading from byte 0 (or any even index), you'll get the expected result. But you start reading from and odd byte (like 1), all characters will be screwed up because the bytes are shifted: