读取带有西里尔字母的文件
我必须打开带有西里尔字母符号的文件。我已经将文件编码为utf8。这是示例:
zh: 你的家人买不起吗? 适合你的服装
ru: Не ваша семья позволить себе костюм для вас
如何打开文件:
ifstream readFile(fileData.c_str());
while (!readFile.eof())
{
std::getline(readFile, buffer);
...
}
第一个麻烦,文本“en”之前有一些符号(我在调试器中看到了这个):
“en:至少”
另一个麻烦是西里尔字母符号:
“ru:наиСеньшиÐ1”
怎么了?
I have to open file with cyrillic symbols. I've encoded file into utf8. Here is example:
en: Couldn't your family afford a
costume for you
ru: Не ваша семья
позволить себе костюм для вас
How do I open file:
ifstream readFile(fileData.c_str());
while (!readFile.eof())
{
std::getline(readFile, buffer);
...
}
The first trouble, there is some symbol before text 'en' (I saw this in debugger):
"en: least"
And another trouble is cyrillic symbols:
" ru: наименьший"
What's wrong?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
,这是一个人造 BOM,是将 U+FEFF BYTE ORDER MARK 字符编码为 UTF-8 的结果。
由于 UTF-8 是一种没有字节顺序的编码,因此永远不应该使用人造 BOM,但不幸的是,相当多的现有软件(尤其是在 MS 世界中)仍然这样做。将消息文件加载到文本编辑器中,然后再次将其另存为 UTF-8,如果特别列出了“UTF-8 without BOM”编码,则使用“UTF-8 without BOM”编码。
这就是当您获得 UTF-8 字节字符串(表示
наименьший
)并将其打印为代码页 1252 (Windows Western) 时所得到的结果欧洲)字节字符串。这不是输入问题;您已读入字符串 OK 并且有一个 UTF-8 字节字符串。但是,在您未引用的代码中,它的输出为 cp1252。如果您只是将其打印到控制台,这是可以预料的,因为控制台始终使用系统默认代码页(在西方 Windows 安装上为 1252),而不是 UTF-8。如果您需要将 Unicode 发送到控制台,则必须将字节转换为本机 Unicode
wchar
并从那里写入它们。我不知道你的字符串的最终目的地是什么......如果你只是要将它们写入另一个文件或其他文件,你可以将它们保留为字节而不关心它们采用的编码。That's a faux-BOM, the result of encoding a U+FEFF BYTE ORDER MARK character into UTF-8.
Since UTF-8 is an encoding that does not have a byte order, the faux-BOM shouldn't ever be used, but unfortunately quite a bit of existing software (especially in the MS world) does nonetheless. Load the messages file into a text editor and save it back out again as UTF-8, using a “UTF-8 without BOM” encoding if one is especially listed.
That's what you get when you've got a UTF-8 byte string (representing
наименьший
) and you print it as if it were a Code Page 1252 (Windows Western European) byte string. It's not an input problem; you have read in the string OK and have a UTF-8 byte string. But then, in code you haven't quoted, it gets output as cp1252.If you're just printing it to the console, this is to be expected, as the console always uses the system default code page (1252 on a Western Windows install), and not UTF-8. If you need to send Unicode to the console you'll have to convert the bytes to native-Unicode
wchar
s and write them from there. I don't know what the final destination for your strings is though... if you're just going to write them to another file or something you could just keep them as bytes and not care about what encoding they're in.我想你的操作系统是windows。存在几种简单的方法:
注意:对于控制台打印,您必须使用 WinApi 函数将 UTF-8 转换为 cp866(我的默认西里尔 Windows 编码) cp1251) 因为 Windows 控制台仅支持 dos 编码。
注意:对于文件打印,您需要知道您的文件使用什么编码
i suppose that your os is windows. exists several ways simple:
Note: for console printing you must use WinApi functions to convert UTF-8 to cp866 (my default cyrilic windows encoding cp1251) because of windows console supports only dos encodings.
Note: for file printing you need to know what encoding use your file
读取后使用
libiconv
将文本转换为可用的编码。Use
libiconv
to convert the text to a usable encoding after reading.使用 icu 转换文本。
Use icu to convert the text.