当前位置：文江博客话题详情

将 Unicode UTF-8 文件读入 wstring

发布于 2024-10-13 13:45:22 字数 72 浏览 12 评论 0原文

如何在 Windows 平台上将 Unicode (UTF-8) 文件读取到 wstring(s) 中？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

左秋 2024-10-20 13:45:22

借助 C++11 支持，您可以使用 std::codecvt_utf8 facet ，它封装了之间的转换UTF-8 编码的字节字符串和 UCS2 或 UCS4 字符串和可用于读取和写入 UTF-8 文件（文本和二进制）。

为了使用您通常创建的方面语言环境对象 将特定于文化的信息封装为一组共同定义特定本地化环境的方面。一旦拥有语言环境对象，您就可以imbue 你的流缓冲区：

#include <sstream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename);
    wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}

可以这样使用：

std::wstring wstr = readFile("a.txt");

或者你可以设置全局 C++ 语言环境在使用字符串流之前，会导致所有将来的调用 std::locale< /code> 默认构造函数返回全局 C++ 语言环境的副本（那么您不需要显式地使用它来注入流缓冲区）：

std::locale::global(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));

With C++11 support, you can use std::codecvt_utf8 facet which encapsulates conversion between a UTF-8 encoded byte string and UCS2 or UCS4 character string and which can be used to read and write UTF-8 files, both text and binary.

In order to use facet you usually create locale object that encapsulates culture-specific information as a set of facets that collectively define a specific localized environment. Once you have a locale object, you can imbue your stream buffer with it:

#include <sstream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename);
    wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}

which can be used like this:

std::wstring wstr = readFile("a.txt");

Alternatively you can set the global C++ locale before you work with string streams which causes all future calls to the std::locale default constructor to return a copy of the global C++ locale (you don't need to explicitly imbue stream buffers with it then):

std::locale::global(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));

回复收藏 0 原文

悲凉≈ 2024-10-20 13:45:22

根据 @Hans Passant 的评论，最简单的方法是使用 _wfopen_s。使用模式 rt, ccs=UTF-8 打开文件。

这是另一个至少适用于 VC++ 2010 的纯 C++ 解决方案：

#include <locale>
#include <codecvt>
#include <string>
#include <fstream>
#include <cstdlib>

int main() {
    const std::locale empty_locale = std::locale::empty();
    typedef std::codecvt_utf8<wchar_t> converter_type;
    const converter_type* converter = new converter_type;
    const std::locale utf8_locale = std::locale(empty_locale, converter);
    std::wifstream stream(L"test.txt");
    stream.imbue(utf8_locale);
    std::wstring line;
    std::getline(stream, line);
    std::system("pause");
}

除了 locale::empty() （这里 locale::global() 也可能有效）和basic_ifstream 构造函数的 wchar_t* 重载，这甚至应该非常符合标准（当然，其中“标准”意味着 C++0x）。

According to a comment by @Hans Passant, the simplest way is to use _wfopen_s. Open the file with mode rt, ccs=UTF-8.

Here is another pure C++ solution that works at least with VC++ 2010:

#include <locale>
#include <codecvt>
#include <string>
#include <fstream>
#include <cstdlib>

int main() {
    const std::locale empty_locale = std::locale::empty();
    typedef std::codecvt_utf8<wchar_t> converter_type;
    const converter_type* converter = new converter_type;
    const std::locale utf8_locale = std::locale(empty_locale, converter);
    std::wifstream stream(L"test.txt");
    stream.imbue(utf8_locale);
    std::wstring line;
    std::getline(stream, line);
    std::system("pause");
}

Except for locale::empty() (here locale::global() might work as well) and the wchar_t* overload of the basic_ifstream constructor, this should even be pretty standard-compliant (where “standard” means C++0x, of course).

回复收藏 0 原文

何必那么矫情 2024-10-20 13:45:22

以下是仅适用于 Windows 的特定于平台的函数：

size_t GetSizeOfFile(const std::wstring& path)
{
    struct _stat fileinfo;
    _wstat(path.c_str(), &fileinfo);
    return fileinfo.st_size;
}

std::wstring LoadUtf8FileToString(const std::wstring& filename)
{
    std::wstring buffer;            // stores file contents
    FILE* f = _wfopen(filename.c_str(), L"rtS, ccs=UTF-8");

    // Failed to open file
    if (f == NULL)
    {
        // ...handle some error...
        return buffer;
    }

    size_t filesize = GetSizeOfFile(filename);

    // Read entire file contents in to memory
    if (filesize > 0)
    {
        buffer.resize(filesize);
        size_t wchars_read = fread(&(buffer.front()), sizeof(wchar_t), filesize, f);
        buffer.resize(wchars_read);
        buffer.shrink_to_fit();
    }

    fclose(f);

    return buffer;
}

像这样使用：

std::wstring mytext = LoadUtf8FileToString(L"C:\\MyUtf8File.txt");

请注意，整个文件已加载到内存中，因此您可能不想将其用于非常大的文件。

Here's a platform-specific function for Windows only:

size_t GetSizeOfFile(const std::wstring& path)
{
    struct _stat fileinfo;
    _wstat(path.c_str(), &fileinfo);
    return fileinfo.st_size;
}

std::wstring LoadUtf8FileToString(const std::wstring& filename)
{
    std::wstring buffer;            // stores file contents
    FILE* f = _wfopen(filename.c_str(), L"rtS, ccs=UTF-8");

    // Failed to open file
    if (f == NULL)
    {
        // ...handle some error...
        return buffer;
    }

    size_t filesize = GetSizeOfFile(filename);

    // Read entire file contents in to memory
    if (filesize > 0)
    {
        buffer.resize(filesize);
        size_t wchars_read = fread(&(buffer.front()), sizeof(wchar_t), filesize, f);
        buffer.resize(wchars_read);
        buffer.shrink_to_fit();
    }

    fclose(f);

    return buffer;
}

Use like so:

std::wstring mytext = LoadUtf8FileToString(L"C:\\MyUtf8File.txt");

Note the entire file is loaded in to memory, so you might not want to use it for very large files.

回复收藏 0 原文

╰つ倒转 2024-10-20 13:45:22

#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <cstdlib>

int main()
{
    std::wifstream wif("filename.txt");
    wif.imbue(std::locale("zh_CN.UTF-8"));

    std::wcout.imbue(std::locale("zh_CN.UTF-8"));
    std::wcout << wif.rdbuf();
}

#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <cstdlib>

int main()
{
    std::wifstream wif("filename.txt");
    wif.imbue(std::locale("zh_CN.UTF-8"));

    std::wcout.imbue(std::locale("zh_CN.UTF-8"));
    std::wcout << wif.rdbuf();
}

回复收藏 0 原文

清风疏影 2024-10-20 13:45:22

最近处理所有的编码，都是这样解决的。最好使用 std::u32string ，因为它在所有平台上都有稳定的大小，并且大多数字体都支持 utf-32 格式。（文件仍应为utf-8）

std::u32string readFile(std::string filename) {
    std::basic_ifstream<char32_t> fin(filename);
    std::u32string str{};
    std::getline(fin, str, U'\0');
    return str;
}

随意使用除gcount之外的标准函数，并将tellg的结果保存到pos_type仅有的。另外，请务必将分隔符传递给 std::getline （如果不这样做，该函数会给出异常 std::bad_cast）

Recently dealt with all the encodings, solved this way. It is better to use std::u32string as it has stable size on all platforms, and most fonts work with utf-32 format. (the file should still be in utf-8)

std::u32string readFile(std::string filename) {
    std::basic_ifstream<char32_t> fin(filename);
    std::u32string str{};
    std::getline(fin, str, U'\0');
    return str;
}

Feel free to use standard functions other than gcount, and save the result of tellg to pos_type only. Also, be sure to pass separator to std::getline (if you don't do this, the function gives exception std::bad_cast)

回复收藏 0 原文

滥情哥ㄟ 2024-10-20 13:45:22

这个问题已在对 C++ 的 std::wstring、UTF-16、UTF-8 以及在 Windows GUI 中显示字符串感到困惑。总之，wstring 基于 UCS-2 标准，该标准是 UTF-16 的前身。这是严格的两字节标准。我相信这涵盖了阿拉伯语。

回复收藏 0 原文

梦在深巷 2024-10-20 13:45:22

这有点原始，但是如何将文件读取为普通旧字节，然后将字节缓冲区转换为 wchar_t* ？

像这样的东西：

#include <iostream>
#include <fstream>
std::wstring ReadFileIntoWstring(const std::wstring& filepath)
{
    std::wstring wstr;
    std::ifstream file (filepath.c_str(), std::ios::in|std::ios::binary|std::ios::ate);
    size_t size = (size_t)file.tellg();
    file.seekg (0, std::ios::beg);
    char* buffer = new char [size];
    file.read (buffer, size);
    wstr = (wchar_t*)buffer;
    file.close();
    delete[] buffer;
    return wstr;
}

This is a bit raw, but how about reading the file as plain old bytes then cast the byte buffer to wchar_t* ?

Something like:

#include <iostream>
#include <fstream>
std::wstring ReadFileIntoWstring(const std::wstring& filepath)
{
    std::wstring wstr;
    std::ifstream file (filepath.c_str(), std::ios::in|std::ios::binary|std::ios::ate);
    size_t size = (size_t)file.tellg();
    file.seekg (0, std::ios::beg);
    char* buffer = new char [size];
    file.read (buffer, size);
    wstr = (wchar_t*)buffer;
    file.close();
    delete[] buffer;
    return wstr;
}

回复收藏 0 原文

~没有更多了~