使用标准C+&#x2B读取UTF-8文本并转换为UTF-16 Wifstream

发布于 2025-01-19 06:42:04 字数 807 浏览 3 评论 0原文

我想从一个使用UTF-8编码的文件中读取一些文本,然后使用std :: WifStream将其转换为UTF-

//
// Read UTF-8 text and convert to UTF-16
//
std::wifstream src;
src.imbue(std::locale("???"));          // UTF-8 ???
src.open("some_text_file_using_utf8");
std::wstring line;                      // UTF-16 string
while (std::getline(src, line))
{
    ... do something processing the UTF-16 string ...
}

16 UTF-8转换?
是否可以使用std :: Locale实现该目标?

我正在使用Visual Studio2013。


注意:

我知道我/o流往往很慢,并且可以使用Win32内存映射的文件来更快地阅读,MultibyTeteToWIDEchar()< /代码>转换的Win32 API等。
但是对于这种特殊情况,我希望仅使用标准C ++及其标准库而无需提升的解决方案。

如果C ++标准库无法做到这一点,则 second 选项将是使用 boost ;在这种情况下,我应该使用哪个提升库?

I'd like to read some text from a file that uses UTF-8 encoding and convert it to UTF-16, using std::wifstream, something like this:

//
// Read UTF-8 text and convert to UTF-16
//
std::wifstream src;
src.imbue(std::locale("???"));          // UTF-8 ???
src.open("some_text_file_using_utf8");
std::wstring line;                      // UTF-16 string
while (std::getline(src, line))
{
    ... do something processing the UTF-16 string ...
}

Is there a standard locale name for the UTF-8 conversion?
Is it possible to achieve that goal using std::locale?

I'm using Visual Studio 2013.


NOTE:

I know that I/O streams tend to be slow, and it's possible to use Win32 memory mapped files for faster reading, and MultiByteToWideChar() Win32 API for the conversion, etc.
But for this particular case I'd like a solution that only uses standard C++ and its standard library, without Boost.

If the C++ standard library just can't do that, the second option would be to use Boost; in this case, which Boost library should I use?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

℡寂寞咖啡 2025-01-26 06:42:04

这在Windows上与Visual Studio一起工作,我认为可以追溯到VS2010,

#include <locale>  // consume_header, locale
#include <codecvt> // codecvt_utf8_utf16

src.imbue(std::locale(
    src.getloc(),
    new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header>));

因为Windows使用了16位WCHAR_T,并且还普遍使用UTF-16作为编码此编码的宽字符,在该环境中效果很好。 (而且由于我假设一个Windows环境,我的示例包括 compume_header处理Windows的公约将标题添加到UTF-8数据)。

在其他平台上WCHAR_T通常是32位,虽然您可以将UTF-16代码单位值存储在这样的32位代码单元中,但没有其他东西可以期待这样的事情。在具有32位WCHAR_T的平台上,您可能希望使用std :: Codecvt_utf8&lt; wchar_t&gt;生产UTF-32宽字符串。


理想情况下,您想要的是一个codecvt facet,它知道如何从UTF-8转换为语言环境的wchar_t编码或宽执行编码。但是,问题在于,不需要任何广泛的编码来支持UTF-8中可表示的整个字符范围。最重要的是,WCHAR_T对于指定的便携式代码不是特别有用。

但是,如果您坚持使用使用UTF-16或UTF-32的平台,则可能有用的一个技巧,具体取决于WCHAR_T的大小是:

template <int N> struct get_codecvt_utf8_wchar_impl;
template <> struct get_codecvt_utf8_wchar_impl<16> {
  using type = std::codecvt_utf8_utf16<wchar_t>;
};
template <> struct get_codecvt_utf8_wchar_impl<32> {
  using type = std::codecvt_utf8<wchar_t>;
};

using codecvt_utf8_wchar = get_codecvt_utf8_wchar_impl<
    sizeof(wchar_t) * CHAR_BIT>::type;

src.imbue(std::locale(src.getloc(), new codecvt_utf8_wchar));

您也可以使用char16_t char32_t,它可以适用于便携式代码,但是标准缺少一些位来使iostreams与这些角色类型可用,并且实现并不完全支持指定的内容。

vs我认为仍然实现char16_tchar32_t作为typeDefs,因此使用它们的模板专业化是行不通的(即使存在专业 do ,也存在您可以在标题中查看,它们只是因为编译器无法处理它们而出去。 libstdc ++尚未实现模板专业化,即使它支持char16_tchar32_t作为真实类型。我知道的最完整的实现是带有合适的编译器(GCC或Clang)的LIBC ++,但即使如此仍然缺少&lt; cuchar&gt; header。

由于实现支持受到限制,因此除了将它们用作跨平台用户代码中的一致表示外,还可以防止便携式代码对这些代码进行很多处理(尽管即使是单独使用)。

This works on Windows with Visual Studio, I think as far back as VS2010

#include <locale>  // consume_header, locale
#include <codecvt> // codecvt_utf8_utf16

src.imbue(std::locale(
    src.getloc(),
    new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header>));

Since Windows uses a 16-bit wchar_t and also universally uses UTF-16 as the wide character encoding this works great in that environment. (And because I'm assuming a Windows environment my example includes consume_header to handle Windows' convention of adding a header to UTF-8 data).

On other platforms wchar_t is generally 32-bit and, while you can store UTF-16 code unit values in such 32-bit code units, nothing else will be written expecting such a thing. On a platform with 32-bit wchar_t you might prefer to use std::codecvt_utf8<wchar_t> to produce UTF-32 wide strings.


For portability ideally what you'd want is a codecvt facet that knows how to convert from UTF-8 to either the locale's wchar_t encoding or the wide execution encoding. The problem with that, however, is that there's no requirement for any wide encoding to support the entire range of characters representable in UTF-8. The bottom line is that wchar_t isn't particularly useful for portable code as specified.

However one trick that might be useful if you're sticking to platforms that use UTF-16 or UTF-32 depending on the size of wchar_t is:

template <int N> struct get_codecvt_utf8_wchar_impl;
template <> struct get_codecvt_utf8_wchar_impl<16> {
  using type = std::codecvt_utf8_utf16<wchar_t>;
};
template <> struct get_codecvt_utf8_wchar_impl<32> {
  using type = std::codecvt_utf8<wchar_t>;
};

using codecvt_utf8_wchar = get_codecvt_utf8_wchar_impl<
    sizeof(wchar_t) * CHAR_BIT>::type;

src.imbue(std::locale(src.getloc(), new codecvt_utf8_wchar));

You can also use char16_t and char32_t, which would lend themselves to portable code, however the standard is missing a few bits to make iostreams usable with these character types and also implementations don't fully support what is specified.

VS I think still implements char16_t and char32_t as typedefs and so the template specializations using them don't work (even though the specializations do exist if you look in the headers, they're just ifdef'd out because the compiler can't handle them). libstdc++ doesn't implement the template specializations yet even though it supports char16_t and char32_t as real types. The most complete implementation I know of is libc++ with a suitable compiler (gcc or clang), but even that is still missing the <cuchar> header.

Since implementation support is limited that sort of prevents portable code from doing much with these besides using them as a consistent representation in user code across platforms (though that is useful even on its own).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文