将 Unicode UTF-8 文件读入 wstring

发布于 2024-10-13 13:45:22 字数 72 浏览 4 评论 0原文

如何在 Windows 平台上将 Unicode (UTF-8) 文件读取到 wstring(s) 中?

How can I read a Unicode (UTF-8) file into wstring(s) on the Windows platform?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

左秋 2024-10-20 13:45:22

借助 C++11 支持,您可以使用 std::codecvt_utf8 facet ,它封装了之间的转换UTF-8 编码的字节字符串和 UCS2 或 UCS4 字符串可用于读取和写入 UTF-8 文件(文本和二进制)。

为了使用 您通常创建的方面 语言环境对象 将特定于文化的信息封装为一组共同定义特定本地化环境的方面。一旦拥有语言环境对象,您就可以imbue 你的流缓冲区:

#include <sstream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename);
    wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}

可以这样使用:

std::wstring wstr = readFile("a.txt");

或者你可以设置 全局 C++ 语言环境 在使用字符串流之前,会导致所有将来的调用 std::locale< /code> 默认构造函数返回全局 C++ 语言环境的副本(那么您不需要显式地使用它来注入流缓冲区):

std::locale::global(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));

With C++11 support, you can use std::codecvt_utf8 facet which encapsulates conversion between a UTF-8 encoded byte string and UCS2 or UCS4 character string and which can be used to read and write UTF-8 files, both text and binary.

In order to use facet you usually create locale object that encapsulates culture-specific information as a set of facets that collectively define a specific localized environment. Once you have a locale object, you can imbue your stream buffer with it:

#include <sstream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename);
    wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}

which can be used like this:

std::wstring wstr = readFile("a.txt");

Alternatively you can set the global C++ locale before you work with string streams which causes all future calls to the std::locale default constructor to return a copy of the global C++ locale (you don't need to explicitly imbue stream buffers with it then):

std::locale::global(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
悲凉≈ 2024-10-20 13:45:22

根据 @Hans Passant 的评论,最简单的方法是使用 _wfopen_s。使用模式 rt, ccs=UTF-8 打开文件。

这是另一个至少适用于 VC++ 2010 的纯 C++ 解决方案:

#include <locale>
#include <codecvt>
#include <string>
#include <fstream>
#include <cstdlib>

int main() {
    const std::locale empty_locale = std::locale::empty();
    typedef std::codecvt_utf8<wchar_t> converter_type;
    const converter_type* converter = new converter_type;
    const std::locale utf8_locale = std::locale(empty_locale, converter);
    std::wifstream stream(L"test.txt");
    stream.imbue(utf8_locale);
    std::wstring line;
    std::getline(stream, line);
    std::system("pause");
}

除了 locale::empty() (这里 locale::global() 也可能有效)和basic_ifstream 构造函数的 wchar_t* 重载,这甚至应该非常符合标准(当然,其中“标准”意味着 C++0x)。

According to a comment by @Hans Passant, the simplest way is to use _wfopen_s. Open the file with mode rt, ccs=UTF-8.

Here is another pure C++ solution that works at least with VC++ 2010:

#include <locale>
#include <codecvt>
#include <string>
#include <fstream>
#include <cstdlib>

int main() {
    const std::locale empty_locale = std::locale::empty();
    typedef std::codecvt_utf8<wchar_t> converter_type;
    const converter_type* converter = new converter_type;
    const std::locale utf8_locale = std::locale(empty_locale, converter);
    std::wifstream stream(L"test.txt");
    stream.imbue(utf8_locale);
    std::wstring line;
    std::getline(stream, line);
    std::system("pause");
}

Except for locale::empty() (here locale::global() might work as well) and the wchar_t* overload of the basic_ifstream constructor, this should even be pretty standard-compliant (where “standard” means C++0x, of course).

何必那么矫情 2024-10-20 13:45:22

以下是仅适用于 Windows 的特定于平台的函数:

size_t GetSizeOfFile(const std::wstring& path)
{
    struct _stat fileinfo;
    _wstat(path.c_str(), &fileinfo);
    return fileinfo.st_size;
}

std::wstring LoadUtf8FileToString(const std::wstring& filename)
{
    std::wstring buffer;            // stores file contents
    FILE* f = _wfopen(filename.c_str(), L"rtS, ccs=UTF-8");

    // Failed to open file
    if (f == NULL)
    {
        // ...handle some error...
        return buffer;
    }

    size_t filesize = GetSizeOfFile(filename);

    // Read entire file contents in to memory
    if (filesize > 0)
    {
        buffer.resize(filesize);
        size_t wchars_read = fread(&(buffer.front()), sizeof(wchar_t), filesize, f);
        buffer.resize(wchars_read);
        buffer.shrink_to_fit();
    }

    fclose(f);

    return buffer;
}

像这样使用:

std::wstring mytext = LoadUtf8FileToString(L"C:\\MyUtf8File.txt");

请注意,整个文件已加载到内存中,因此您可能不想将其用于非常大的文件。

Here's a platform-specific function for Windows only:

size_t GetSizeOfFile(const std::wstring& path)
{
    struct _stat fileinfo;
    _wstat(path.c_str(), &fileinfo);
    return fileinfo.st_size;
}

std::wstring LoadUtf8FileToString(const std::wstring& filename)
{
    std::wstring buffer;            // stores file contents
    FILE* f = _wfopen(filename.c_str(), L"rtS, ccs=UTF-8");

    // Failed to open file
    if (f == NULL)
    {
        // ...handle some error...
        return buffer;
    }

    size_t filesize = GetSizeOfFile(filename);

    // Read entire file contents in to memory
    if (filesize > 0)
    {
        buffer.resize(filesize);
        size_t wchars_read = fread(&(buffer.front()), sizeof(wchar_t), filesize, f);
        buffer.resize(wchars_read);
        buffer.shrink_to_fit();
    }

    fclose(f);

    return buffer;
}

Use like so:

std::wstring mytext = LoadUtf8FileToString(L"C:\\MyUtf8File.txt");

Note the entire file is loaded in to memory, so you might not want to use it for very large files.

╰つ倒转 2024-10-20 13:45:22
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <cstdlib>

int main()
{
    std::wifstream wif("filename.txt");
    wif.imbue(std::locale("zh_CN.UTF-8"));

    std::wcout.imbue(std::locale("zh_CN.UTF-8"));
    std::wcout << wif.rdbuf();
}
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <cstdlib>

int main()
{
    std::wifstream wif("filename.txt");
    wif.imbue(std::locale("zh_CN.UTF-8"));

    std::wcout.imbue(std::locale("zh_CN.UTF-8"));
    std::wcout << wif.rdbuf();
}
清风疏影 2024-10-20 13:45:22

最近处理所有的编码,都是这样解决的。最好使用 std::u32string ,因为它在所有平台上都有稳定的大小,并且大多数字体都支持 utf-32 格式。 (文件仍应为utf-8)

std::u32string readFile(std::string filename) {
    std::basic_ifstream<char32_t> fin(filename);
    std::u32string str{};
    std::getline(fin, str, U'\0');
    return str;
}

随意使用除gcount之外的标准函数,并将tellg的结果保存到pos_type仅有的。另外,请务必将分隔符传递给 std::getline (如果不这样做,该函数会给出异常 std::bad_cast

Recently dealt with all the encodings, solved this way. It is better to use std::u32string as it has stable size on all platforms, and most fonts work with utf-32 format. (the file should still be in utf-8)

std::u32string readFile(std::string filename) {
    std::basic_ifstream<char32_t> fin(filename);
    std::u32string str{};
    std::getline(fin, str, U'\0');
    return str;
}

Feel free to use standard functions other than gcount, and save the result of tellg to pos_type only. Also, be sure to pass separator to std::getline (if you don't do this, the function gives exception std::bad_cast)

滥情哥ㄟ 2024-10-20 13:45:22

这个问题已在 对 C++ 的 std::wstring、UTF-16、UTF-8 以及在 Windows GUI 中显示字符串感到困惑。总之,wstring 基于 UCS-2 标准,该标准是 UTF-16 的前身。这是严格的两字节标准。我相信这涵盖了阿拉伯语。

This question was addressed in Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI. In sum, wstring is based upon the UCS-2 standard, which is the predecessor of UTF-16. This is a strictly two byte standard. I believe this covers Arabic.

梦在深巷 2024-10-20 13:45:22

这有点原始,但是如何将文件读取为普通旧字节,然后将字节缓冲区转换为 wchar_t* ?

像这样的东西:

#include <iostream>
#include <fstream>
std::wstring ReadFileIntoWstring(const std::wstring& filepath)
{
    std::wstring wstr;
    std::ifstream file (filepath.c_str(), std::ios::in|std::ios::binary|std::ios::ate);
    size_t size = (size_t)file.tellg();
    file.seekg (0, std::ios::beg);
    char* buffer = new char [size];
    file.read (buffer, size);
    wstr = (wchar_t*)buffer;
    file.close();
    delete[] buffer;
    return wstr;
}

This is a bit raw, but how about reading the file as plain old bytes then cast the byte buffer to wchar_t* ?

Something like:

#include <iostream>
#include <fstream>
std::wstring ReadFileIntoWstring(const std::wstring& filepath)
{
    std::wstring wstr;
    std::ifstream file (filepath.c_str(), std::ios::in|std::ios::binary|std::ios::ate);
    size_t size = (size_t)file.tellg();
    file.seekg (0, std::ios::beg);
    char* buffer = new char [size];
    file.read (buffer, size);
    wstr = (wchar_t*)buffer;
    file.close();
    delete[] buffer;
    return wstr;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文