使用 basic_ifstream读取俄语字符 (Unicode)
这可能吗?我一直在尝试读取一个包含俄语的简单文件,但它显然不起作用。
我调用了 file.imbue(loc) (此时 loc 是正确的,Russian_Russia.1251)。 buf 的类型为 basic_string
我使用 basic_ifstream
这一切都与英语字符完美配合...
while (file >> ch)
{
if(isalnum(ch, loc))
{
buf += ch;
}
else if(!buf.empty())
{
// Do stuff with buf.
buf.clear();
}
}
我不明白为什么在阅读俄语字符时我会遇到垃圾。 (例如,如果文件包含 хеы хеы хеы,我会得到“яюE”、5(平方)、K(平方)等...
Is this even possible? I've been trying to read a simple file that contains Russian, and it's clearly not working.
I've called file.imbue(loc) (and at this point, loc is correct, Russian_Russia.1251).
And buf is of type basic_string<wchar_t>
The reason I'm using basic_ifstream<wchar_t> is because this is a template (so technically, basic_ifstream<T>, but in this case, T=wchar_t).
This all works perfectly with english characters...
while (file >> ch)
{
if(isalnum(ch, loc))
{
buf += ch;
}
else if(!buf.empty())
{
// Do stuff with buf.
buf.clear();
}
}
I don't see why I'm getting garbage when reading Russian characters. (for example, if the file contains хеы хеы хеы, I get "яюE", 5(square), K(square), etc...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
代码页 1251 不适用于 Unicode——如果没记错的话,它适用于 8859-5。不幸的是,您的 iostream 实现很可能不支持“开箱即用”的 UTF-16。这有点奇怪,因为这样做只会涉及不更改地传递数据,但大多数人仍然不支持它。就其价值而言,至少如果我没记错的话,C++ 0x 应该添加这个。
Code page 1251 isn't for Unicode -- if memory serves, it's for 8859-5. Unfortunately, chances are that your iostream implementation doesn't support UTF-16 "out of the box." This is a bit strange, since doing so would just involve passing the data through un-changed, but most still don't support it. For what it's worth, at least if I recall correctly, C++ 0x is supposed to add this.
仍然有许多 STL 实现没有可以处理 Unicode 编码的 std::codecvt。它们的 wchar_t 模板化流将默认为系统代码页,即使它们在其他方面启用了 Unicode,例如文件名。如果文件实际上包含 UTF-8,它们将产生垃圾。也许这会有所帮助。
There are still lots of STL implementations that don't have a std::codecvt that can handle Unicode encodings. Their wchar_t templated streams will default to the system code page, even though they are otherwise Unicode enabled for, say, the filename. If the file actually contains UTF-8, they'll produce junk. Maybe this will help.
默认情况下,Iostreams 假定磁盘上的所有数据均为非 unicode 格式,以便与不处理 unicode 的现有程序兼容。 C++0x 将通过允许原生 unicode 支持来修复此问题,但此时 iostreams 使用
std::codecvt
将普通 char 数据转换为宽字符为你。请参阅 cplusplus.com 的 std::codecvt 说明。如果要将 unicode 与 iostream 一起使用,则需要以
std::codecvt
形式指定 codecvt 方面,该方面仅不加更改地传递数据。Iostreams, by default, assumes any data on disk is in a non-unicode format, for compatibility with existing programs that do not handle unicode. C++0x will fix this by allowing native unicode support, but at this time there is a
std::codecvt<wchar_t, char, mbstate_t>
used by iostreams to convert the normal char data into wide characters for you. See cplusplus.com's description of std::codecvt.If you want to use unicode with iostreams, you need to specify a codecvt facet with the form
std::codecvt<wchar_t, wchar_t, mbstate_t>
, which just passes through data unchanged.我不确定,但你可以尝试调用 setlocale(LC_CTYPE, "");
I am not sure, but you can try to call setlocale(LC_CTYPE, "");