使用 basic_ifstream读取俄语字符 (Unicode)

发布于 2024-08-25 16:37:01 字数 560 浏览 3 评论 0原文

这可能吗?我一直在尝试读取一个包含俄语的简单文件,但它显然不起作用。

我调用了 file.imbue(loc) (此时 loc 是正确的,Russian_Russia.1251)。 buf 的类型为 basic_string

我使用 basic_ifstream的原因是是因为这是一个模板(因此从技术上讲,basic_ifstream,但在本例中,T=wchar_t)。

这一切都与英语字符完美配合...

while (file >> ch)
{
    if(isalnum(ch, loc))
    {
        buf += ch;
    }
    else if(!buf.empty())
    {
        // Do stuff with buf.
        buf.clear();
    }
}

我不明白为什么在阅读俄语字符时我会遇到垃圾。 (例如,如果文件包含 хеы хеы хеы,我会得到“яюE”、5(平方)、K(平方)等...

Is this even possible? I've been trying to read a simple file that contains Russian, and it's clearly not working.

I've called file.imbue(loc) (and at this point, loc is correct, Russian_Russia.1251).
And buf is of type basic_string<wchar_t>

The reason I'm using basic_ifstream<wchar_t> is because this is a template (so technically, basic_ifstream<T>, but in this case, T=wchar_t).

This all works perfectly with english characters...

while (file >> ch)
{
    if(isalnum(ch, loc))
    {
        buf += ch;
    }
    else if(!buf.empty())
    {
        // Do stuff with buf.
        buf.clear();
    }
}

I don't see why I'm getting garbage when reading Russian characters. (for example, if the file contains хеы хеы хеы, I get "яюE", 5(square), K(square), etc...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

不可一世的女人 2024-09-01 16:37:01

代码页 1251 不适用于 Unicode——如果没记错的话,它适用于 8859-5。不幸的是,您的 iostream 实现很可能不支持“开箱即用”的 UTF-16。这有点奇怪,因为这样做只会涉及不更改地传递数据,但大多数人仍然不支持它。就其价值而言,至少如果我没记错的话,C++ 0x 应该添加这个。

Code page 1251 isn't for Unicode -- if memory serves, it's for 8859-5. Unfortunately, chances are that your iostream implementation doesn't support UTF-16 "out of the box." This is a bit strange, since doing so would just involve passing the data through un-changed, but most still don't support it. For what it's worth, at least if I recall correctly, C++ 0x is supposed to add this.

雪若未夕 2024-09-01 16:37:01

仍然有许多 STL 实现没有可以处理 Unicode 编码的 std::codecvt。它们的 wchar_t 模板化流将默认为系统代码页,即使它们在其他方面启用了 Unicode,例如文件名。如果文件实际上包含 UTF-8,它们将产生垃圾。也许这会有所帮助

There are still lots of STL implementations that don't have a std::codecvt that can handle Unicode encodings. Their wchar_t templated streams will default to the system code page, even though they are otherwise Unicode enabled for, say, the filename. If the file actually contains UTF-8, they'll produce junk. Maybe this will help.

鸠魁 2024-09-01 16:37:01

默认情况下,Iostreams 假定磁盘上的所有数据均为非 unicode 格式,以便与不处理 unicode 的现有程序兼容。 C++0x 将通过允许原生 unicode 支持来修复此问题,但此时 iostreams 使用 std::codecvt 将普通 char 数据转换为宽字符为你。请参阅 cplusplus.com 的 std::codecvt 说明

如果要将 unicode 与 iostream 一起使用,则需要以 std::codecvt 形式指定 codecvt 方面,该方面仅不加更改地传递数据。

Iostreams, by default, assumes any data on disk is in a non-unicode format, for compatibility with existing programs that do not handle unicode. C++0x will fix this by allowing native unicode support, but at this time there is a std::codecvt<wchar_t, char, mbstate_t> used by iostreams to convert the normal char data into wide characters for you. See cplusplus.com's description of std::codecvt.

If you want to use unicode with iostreams, you need to specify a codecvt facet with the form std::codecvt<wchar_t, wchar_t, mbstate_t>, which just passes through data unchanged.

怎会甘心 2024-09-01 16:37:01

我不确定,但你可以尝试调用 setlocale(LC_CTYPE, "");

I am not sure, but you can try to call setlocale(LC_CTYPE, "");

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文