如何在 Linux 中使用 POSIX 方法从文件中读取 Unicode-16 字符串？

发布于 2024-07-13 09:46:01 字数 1865 浏览 8 评论 0原文

我有一个包含 UNICODE-16 字符串的文件，我想将其读入 Linux 程序。这些字符串是根据 Windows 内部 WCHAR 格式编写的。（Windows 总是使用 UTF-16 吗？例如在日语版本中）

我相信我可以使用原始读取和 wcstombs_l 转换来读取它们。但是，我不知道要使用什么区域设置。在我最新的 Ubuntu 和 Mac OS X 机器上运行“locale -a”会产生名称中带有 utf-16 的零个语言环境。

有没有更好的办法？

更新：正确的答案和下面的其他答案帮助我指出使用 libiconv。这是我用来进行转换的函数。我目前将它放在一个类中，该类将转换为一行代码。

// Function for converting wchar_t* to char*. (Really: UTF-16LE --> UTF-8)
// It will allocate the space needed for dest. The caller is
// responsible for freeing the memory.
static int iwcstombs_alloc(char **dest, const wchar_t *src)
{
  iconv_t cd;
  const char from[] = "UTF-16LE";
  const char to[] = "UTF-8";

  cd = iconv_open(to, from);
  if (cd == (iconv_t)-1)
  {
    printf("iconv_open(\"%s\", \"%s\") failed: %s\n",
           to, from, strerror(errno));
    return(-1);
  }

  // How much space do we need?
  // Guess that we need the same amount of space as used by src.
  // TODO: There should be a while loop around this whole process
  //       that detects insufficient memory space and reallocates
  //       more space.
  int len = sizeof(wchar_t) * (wcslen(src) + 1);

  //printf("len = %d\n", len);

  // Allocate space
  int destLen = len * sizeof(char);
  *dest = (char *)malloc(destLen);
  if (*dest == NULL)
  {
    iconv_close(cd);
    return -1;
  }

  // Convert

  size_t inBufBytesLeft = len;
  char *inBuf = (char *)src;
  size_t outBufBytesLeft = destLen;
  char *outBuf = (char *)*dest;

  int rc = iconv(cd,
                 &inBuf,
                 &inBufBytesLeft,
                 &outBuf,
                 &outBufBytesLeft);
  if (rc == -1)
  {
    printf("iconv() failed: %s\n", strerror(errno));
    iconv_close(cd);
    free(*dest);
    *dest = NULL;
    return -1;
  }

  iconv_close(cd);

  return 0;
} // iwcstombs_alloc()

原文

I have a file containing UNICODE-16 strings that I would like to read into a Linux program. The strings were written raw from Windows' internal WCHAR format. (Does Windows always use UTF-16? e.g. in Japanese versions)

I believe that I can read them using raw reads and the converting with wcstombs_l. However, I cannot figure what locale to use. Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.

Is there a better way?

Update: the correct answer and others below helped point me to using libiconv. Here's a function I'm using to do the conversion. I currently have it inside a class that makes the conversions into a one-line piece of code.

// Function for converting wchar_t* to char*. (Really: UTF-16LE --> UTF-8)
// It will allocate the space needed for dest. The caller is
// responsible for freeing the memory.
static int iwcstombs_alloc(char **dest, const wchar_t *src)
{
  iconv_t cd;
  const char from[] = "UTF-16LE";
  const char to[] = "UTF-8";

  cd = iconv_open(to, from);
  if (cd == (iconv_t)-1)
  {
    printf("iconv_open(\"%s\", \"%s\") failed: %s\n",
           to, from, strerror(errno));
    return(-1);
  }

  // How much space do we need?
  // Guess that we need the same amount of space as used by src.
  // TODO: There should be a while loop around this whole process
  //       that detects insufficient memory space and reallocates
  //       more space.
  int len = sizeof(wchar_t) * (wcslen(src) + 1);

  //printf("len = %d\n", len);

  // Allocate space
  int destLen = len * sizeof(char);
  *dest = (char *)malloc(destLen);
  if (*dest == NULL)
  {
    iconv_close(cd);
    return -1;
  }

  // Convert

  size_t inBufBytesLeft = len;
  char *inBuf = (char *)src;
  size_t outBufBytesLeft = destLen;
  char *outBuf = (char *)*dest;

  int rc = iconv(cd,
                 &inBuf,
                 &inBufBytesLeft,
                 &outBuf,
                 &outBufBytesLeft);
  if (rc == -1)
  {
    printf("iconv() failed: %s\n", strerror(errno));
    iconv_close(cd);
    free(*dest);
    *dest = NULL;
    return -1;
  }

  iconv_close(cd);

  return 0;
} // iwcstombs_alloc()

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

只有一腔孤勇 2024-07-20 09:46:01

最简单的方法是将文件从 utf16 转换为 utf8 本机 UNIX 编码，然后读取它，

iconv -f utf16 -t utf8 file_in.txt -o file_out.txt

您还可以使用 iconv(3) （请参阅 man 3 iconv）使用 C 转换字符串。大多数其他语言也与 iconv 绑定。

您可以使用任何 UTF-8 语言环境，例如 en_US.UTF-8，通常是默认语言环境
在大多数 Linux 发行版上。

Simplest way is convert the file from utf16 to utf8 native UNIX encoding and then read it,

iconv -f utf16 -t utf8 file_in.txt -o file_out.txt

You can also use iconv(3) (see man 3 iconv) to convert string using C. Most of other languages has bindings to iconv as well.

Than you can use any UTF-8 locale like en_US.UTF-8 that are usualy the default one
on most linux distros.

回复收藏 0 原文

昨迟人 2024-07-20 09:46:01

（Windows 总是使用 UTF-16 吗？例如在日语版本中）

是的，NT 的 WCHAR 始终是 UTF-16LE。

，“系统代码页”实际上是 cp932/Shift-JIS，它仍然存在于 NT 中，以便为许多非 Unicode 本机、FAT32 路径等的应用程序带来好处。）

（对于日语安装来说 wchar_t 不保证是 16 位，在 Linux 上也不会，使用 UTF-32 (UCS-4)。所以 wcstombs_l 不太可能高兴。

正确的做法是使用像 iconv 这样的库将其读入您内部使用的任何格式 - 大概是 wchar_t。您可以尝试通过插入字节来自己破解它，但您可能会得到像代理这样的错误。

在我最新的 Ubuntu 和 Mac OS X 计算机上运行“locale -a”会产生名称中带有 utf-16 的零个语言环境。

事实上，由于所有 \0，Linux 无法使用 UTF-16 作为语言环境默认编码。

回复收藏 0 原文

时光是把杀猪刀 2024-07-20 09:46:01

您可以读取二进制文件，然后进行自己的快速转换：
http://unicode.org/faq/utf_bom.html#utf16-3
但使用正确处理无效序列的库（如 libiconv）可能更安全。

回复收藏 0 原文

老子叫无熙 2024-07-20 09:46:01

我强烈建议使用 Unicode 编码作为程序的内部表示。使用 UTF-16 或 UTF-8。如果内部使用UTF-16，那么显然不需要翻译。如果您使用 UTF-8，则可以使用其中包含 .UTF-8 的区域设置，例如 en_US.UTF-8。

回复收藏 0 原文

~没有更多了~

关于作者

青萝楚歌

暂无简介

0 文章

0 评论

501 人气

关注发私信

友情链接

文江博客

如何在 Linux 中使用 POSIX 方法从文件中读取 Unicode-16 字符串？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

如何在 Linux 中使用 POSIX 方法从文件中读取 Unicode-16 字符串？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。