将 wstring 转换为 UTF-8 编码的字符串

发布于 2024-10-05 23:05:39 字数 1060 浏览 10 评论 0原文

我需要在 wstring 和 string 之间进行转换。我发现，使用 codecvt 方面应该可以解决问题，但它似乎不适用于 utf-8 语言环境。

我的想法是，当我将utf-8编码的文件读取到字符时，一个utf-8字符被读入两个普通字符（这就是utf-8的工作原理）。我想根据我在代码中使用的库的 wstring 表示形式创建这个 utf-8 字符串。

有人知道该怎么做吗？

我已经尝试过这个：

  locale mylocale("cs_CZ.utf-8");
  mbstate_t mystate;

  wstring mywstring = L"čřžýáí";

  const codecvt<wchar_t,char,mbstate_t>& myfacet =
    use_facet<codecvt<wchar_t,char,mbstate_t> >(mylocale);

  codecvt<wchar_t,char,mbstate_t>::result myresult;  

  size_t length = mywstring.length();
  char* pstr= new char [length+1];

  const wchar_t* pwc;
  char* pc;

  // translate characters:
  myresult = myfacet.out (mystate,
      mywstring.c_str(), mywstring.c_str()+length+1, pwc,
      pstr, pstr+length+1, pc);

  if ( myresult == codecvt<wchar_t,char,mbstate_t>::ok )
   cout << "Translation successful: " << pstr << endl;
  else cout << "failed" << endl;
  return 0;

对于 cs_CZ.utf-8 语言环境返回“失败”，对于 cs_CZ.iso8859-2 语言环境可以正常工作。

原文

I need to convert between wstring and string. I figured out, that using codecvt facet should do the trick, but it doesn't seem to work for utf-8 locale.

My idea is, that when I read utf-8 encoded file to chars, one utf-8 character is read into two normal characters (which is how utf-8 works). I'd like to create this utf-8 string from wstring representation for library I use in my code.

Does anybody know how to do it?

I already tried this:

  locale mylocale("cs_CZ.utf-8");
  mbstate_t mystate;

  wstring mywstring = L"čřžýáí";

  const codecvt<wchar_t,char,mbstate_t>& myfacet =
    use_facet<codecvt<wchar_t,char,mbstate_t> >(mylocale);

  codecvt<wchar_t,char,mbstate_t>::result myresult;  

  size_t length = mywstring.length();
  char* pstr= new char [length+1];

  const wchar_t* pwc;
  char* pc;

  // translate characters:
  myresult = myfacet.out (mystate,
      mywstring.c_str(), mywstring.c_str()+length+1, pwc,
      pstr, pstr+length+1, pc);

  if ( myresult == codecvt<wchar_t,char,mbstate_t>::ok )
   cout << "Translation successful: " << pstr << endl;
  else cout << "failed" << endl;
  return 0;

which returns 'failed' for cs_CZ.utf-8 locale and works correctly for cs_CZ.iso8859-2 locale.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

涫野音 2024-10-12 23:05:39

下面的代码可能对你有帮助:)

#include <codecvt>
#include <string>

// convert UTF-8 string to wstring
std::wstring utf8_to_wstring (const std::string& str)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
    return myconv.from_bytes(str);
}

// convert wstring to UTF-8 string
std::string wstring_to_utf8 (const std::wstring& str)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
    return myconv.to_bytes(str);
}

The code below might help you :)

#include <codecvt>
#include <string>

// convert UTF-8 string to wstring
std::wstring utf8_to_wstring (const std::string& str)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
    return myconv.from_bytes(str);
}

// convert wstring to UTF-8 string
std::string wstring_to_utf8 (const std::wstring& str)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
    return myconv.to_bytes(str);
}

回复收藏 0 原文

掀纱窥君容 2024-10-12 23:05:39

你的平台是什么？请注意，Windows 不支持 UTF-8 区域设置，因此这可以解释失败的原因。

要以依赖于平台的方式完成此操作，您可以使用 MultiByteToWideChar/< Windows 上的 a href="http://msdn.microsoft.com/en-us/library/dd374130%28VS.85%29.aspx" rel="noreferrer">WideCharToMultiByte 和 iconv。您也许可以使用一些增强魔法以独立于平台的方式完成此操作，但我自己还没有尝试过，所以我无法添加此选项。

回复收藏 0 原文

冬天旳寂寞 2024-10-12 23:05:39

在 Windows 上，您必须使用 std::codecvt_utf8_utf16！否则，您的转换将在需要两个 16 位代码单元的 Unicode 代码点上失败。喜欢

On Windows you have to use std::codecvt_utf8_utf16<wchar_t>! Otherwise your conversion will fail on Unicode code points that need two 16 bit code units. Like ???? (U+1F609)

#include <codecvt>
#include <string>

// convert UTF-8 string to wstring
std::wstring utf8_to_wstring (const std::string& str)
{
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> myconv;
    return myconv.from_bytes(str);
}

// convert wstring to UTF-8 string
std::string wstring_to_utf8 (const std::wstring& str)
{
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> myconv;
    return myconv.to_bytes(str);
}

回复收藏 0 原文

无力看清 2024-10-12 23:05:39

您可以使用 boost 的 utf_to_utf 转换器获取 char 格式以存储在 std::string 中。

std::string myresult = boost::locale::conv::utf_to_utf<char>(my_wstring);

You can use boost's utf_to_utf converter to get char format to store in std::string.

std::string myresult = boost::locale::conv::utf_to_utf<char>(my_wstring);

回复收藏 0 原文

夜无邪 2024-10-12 23:05:39

目前获得最多支持的答案与平台无关。它会破坏非 BMP 字符（即表情符号

The currently most upvoted answer is not platform-independent. It breaks on non-BMP characters (i.e. Emojis ????). JWiesemann already pointed this out in their answer, but their code will only work on windows.

So here's a correct platform-independent version:

#include <codecvt>
#include <codecvt>
#include <string>
#include <type_traits>

std::string wstring_to_utf8(std::wstring const& str)
{
  std::wstring_convert<std::conditional_t<
        sizeof(wchar_t) == 4,
        std::codecvt_utf8<wchar_t>,
        std::codecvt_utf8_utf16<wchar_t>>> converter;
  return converter.to_bytes(str);
}

std::wstring utf8_to_wstring(std::string const& str)
{
  std::wstring_convert<std::conditional_t<
        sizeof(wchar_t) == 4,
        std::codecvt_utf8<wchar_t>,
        std::codecvt_utf8_utf16<wchar_t>>> converter;
  return converter.from_bytes(str);
}

On msvc this might generate some deprecation warnings. You can disable these by wrapping the functions in

#pragma warning(push)
#pragma warning(disable : 4996)
<the two functions>
#pragma warning(pop)

See this answer to another question as to why it's ok to disable that warning.

回复收藏 0 原文

乄_柒ぐ汐 2024-10-12 23:05:39

locale 的作用是为程序提供有关外部编码的信息，但假设内部编码没有改变。如果您想输出 UTF-8，则需要从 wchar_t 而不是从 char* 进行。

您可以做的是将其输出为原始数据（而不是字符串），如果系统区域设置是 UTF-8，则应该正确解释它。

另外，当使用 (w)cout/(w)cerr/(w)cin 时，您需要在流中注入语言环境。

回复收藏 0 原文

恋竹姑娘 2024-10-12 23:05:39

Lexertl 库有一个迭代器，可以让您执行此操作：

std::string str;
str.assign(
  lexertl::basic_utf8_out_iterator<std::wstring::const_iterator>(wstr.begin()),
  lexertl::basic_utf8_out_iterator<std::wstring::const_iterator>(wstr.end()));

The Lexertl library has an iterator that lets you do this:

std::string str;
str.assign(
  lexertl::basic_utf8_out_iterator<std::wstring::const_iterator>(wstr.begin()),
  lexertl::basic_utf8_out_iterator<std::wstring::const_iterator>(wstr.end()));

回复收藏 0 原文