通过非unicode代码读取UTF-8 Unicode文件
我必须读取一个 UTF-8 编码的 Unicode 文本文件,并且必须将此数据写入另一个文本文件。 该文件在行中包含制表符分隔的数据。
我的阅读代码是不支持 unicode 的 C++ 代码。 我正在做的是在 string/char*
中逐行读取文件,并将该字符串按原样放入目标文件。 我无法更改代码,因此不欢迎代码更改建议。
我想知道的是,在逐行读取时,我是否可以在一行中遇到 NULL 终止字符('\0'),因为它是 unicode 并且一个字符可以跨越多个字节。
我的想法是,一行中很可能会遇到 NULL 终止字符。 你的想法?
I have to read a text file which is Unicode with UTF-8 encoding and have to write this data to another text file. The file has tab-separated data in lines.
My reading code is C++ code without unicode support. What I am doing is reading the file line-by-line in a string/char*
and putting that string as-is to the destination file. I can't change the code so code-change suggestions are not welcome.
What I want to know is that while reading line-by-line can I encounter a NULL terminating character ('\0') within a line since it is unicode and one character can span multiple bytes.
My thinking was that it is quite possible that a NULL terminating character could be encountered within a line. Your thoughts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
UTF-8 对所有 ASCII 字符使用 1 个字节,这些字符与标准 ASCII 编码中的代码值相同,对其他字符最多使用 4 个字节。 每个字节的高位被保留作为控制位。 对于使用超过 1 个字节的代码点,将设置控制位。
因此,UTF-8 文件中不应有 0 个字符。
检查 维基百科 UTF-8
UTF-8 uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters. The upper bits of each byte are reserved as control bits. For code points using more then 1 byte, the control bits are set.
Thus there shall not be 0 character in your UTF-8 file.
Check Wikipedia for UTF-8
不太可能:UTF-8 转义序列中的所有字节的较高位都设置为 1。
Very unlikely: all the bytes in an UTF-8 escape sequence have the higher bit set to 1.