如何在 Linux 中使用 POSIX 方法从文件中读取 Unicode-16 字符串?
我有一个包含 UNICODE-16 字符串的文件,我想将其读入 Linux 程序。 这些字符串是根据 Windows 内部 WCHAR 格式编写的。 (Windows 总是使用 UTF-16 吗?例如在日语版本中)
我相信我可以使用原始读取和 wcstombs_l 转换来读取它们。 但是,我不知道要使用什么区域设置。 在我最新的 Ubuntu 和 Mac OS X 机器上运行“locale -a”会产生名称中带有 utf-16 的零个语言环境。
有没有更好的办法?
更新:正确的答案和下面的其他答案帮助我指出使用 libiconv。 这是我用来进行转换的函数。 我目前将它放在一个类中,该类将转换为一行代码。
// Function for converting wchar_t* to char*. (Really: UTF-16LE --> UTF-8)
// It will allocate the space needed for dest. The caller is
// responsible for freeing the memory.
static int iwcstombs_alloc(char **dest, const wchar_t *src)
{
iconv_t cd;
const char from[] = "UTF-16LE";
const char to[] = "UTF-8";
cd = iconv_open(to, from);
if (cd == (iconv_t)-1)
{
printf("iconv_open(\"%s\", \"%s\") failed: %s\n",
to, from, strerror(errno));
return(-1);
}
// How much space do we need?
// Guess that we need the same amount of space as used by src.
// TODO: There should be a while loop around this whole process
// that detects insufficient memory space and reallocates
// more space.
int len = sizeof(wchar_t) * (wcslen(src) + 1);
//printf("len = %d\n", len);
// Allocate space
int destLen = len * sizeof(char);
*dest = (char *)malloc(destLen);
if (*dest == NULL)
{
iconv_close(cd);
return -1;
}
// Convert
size_t inBufBytesLeft = len;
char *inBuf = (char *)src;
size_t outBufBytesLeft = destLen;
char *outBuf = (char *)*dest;
int rc = iconv(cd,
&inBuf,
&inBufBytesLeft,
&outBuf,
&outBufBytesLeft);
if (rc == -1)
{
printf("iconv() failed: %s\n", strerror(errno));
iconv_close(cd);
free(*dest);
*dest = NULL;
return -1;
}
iconv_close(cd);
return 0;
} // iwcstombs_alloc()
I have a file containing UNICODE-16 strings that I would like to read into a Linux program. The strings were written raw from Windows' internal WCHAR format. (Does Windows always use UTF-16? e.g. in Japanese versions)
I believe that I can read them using raw reads and the converting with wcstombs_l. However, I cannot figure what locale to use. Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.
Is there a better way?
Update: the correct answer and others below helped point me to using libiconv. Here's a function I'm using to do the conversion. I currently have it inside a class that makes the conversions into a one-line piece of code.
// Function for converting wchar_t* to char*. (Really: UTF-16LE --> UTF-8)
// It will allocate the space needed for dest. The caller is
// responsible for freeing the memory.
static int iwcstombs_alloc(char **dest, const wchar_t *src)
{
iconv_t cd;
const char from[] = "UTF-16LE";
const char to[] = "UTF-8";
cd = iconv_open(to, from);
if (cd == (iconv_t)-1)
{
printf("iconv_open(\"%s\", \"%s\") failed: %s\n",
to, from, strerror(errno));
return(-1);
}
// How much space do we need?
// Guess that we need the same amount of space as used by src.
// TODO: There should be a while loop around this whole process
// that detects insufficient memory space and reallocates
// more space.
int len = sizeof(wchar_t) * (wcslen(src) + 1);
//printf("len = %d\n", len);
// Allocate space
int destLen = len * sizeof(char);
*dest = (char *)malloc(destLen);
if (*dest == NULL)
{
iconv_close(cd);
return -1;
}
// Convert
size_t inBufBytesLeft = len;
char *inBuf = (char *)src;
size_t outBufBytesLeft = destLen;
char *outBuf = (char *)*dest;
int rc = iconv(cd,
&inBuf,
&inBufBytesLeft,
&outBuf,
&outBufBytesLeft);
if (rc == -1)
{
printf("iconv() failed: %s\n", strerror(errno));
iconv_close(cd);
free(*dest);
*dest = NULL;
return -1;
}
iconv_close(cd);
return 0;
} // iwcstombs_alloc()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
最简单的方法是将文件从 utf16 转换为 utf8 本机 UNIX 编码,然后读取它,
您还可以使用 iconv(3) (请参阅 man 3 iconv)使用 C 转换字符串。大多数其他语言也与 iconv 绑定。
您可以使用任何 UTF-8 语言环境,例如 en_US.UTF-8,通常是默认语言环境
在大多数 Linux 发行版上。
Simplest way is convert the file from utf16 to utf8 native UNIX encoding and then read it,
You can also use iconv(3) (see man 3 iconv) to convert string using C. Most of other languages has bindings to iconv as well.
Than you can use any UTF-8 locale like en_US.UTF-8 that are usualy the default one
on most linux distros.
是的,NT 的 WCHAR 始终是 UTF-16LE。
,“系统代码页”实际上是 cp932/Shift-JIS,它仍然存在于 NT 中,以便为许多非 Unicode 本机、FAT32 路径等的应用程序带来好处。)
(对于日语安装来说 wchar_t 不保证是 16 位,在 Linux 上也不会,使用 UTF-32 (UCS-4)。 所以 wcstombs_l 不太可能高兴。
正确的做法是使用像 iconv 这样的库将其读入您内部使用的任何格式 - 大概是 wchar_t。 您可以尝试通过插入字节来自己破解它,但您可能会得到像代理这样的错误。
事实上,由于所有 \0,Linux 无法使用 UTF-16 作为语言环境默认编码。
Yes, NT's WCHAR is always UTF-16LE.
(The ‘system codepage’, which for Japanese installs is indeed cp932/Shift-JIS, still exists in NT for the benefit of the many, many applications that aren't Unicode-native, FAT32 paths, and so on.)
However, wchar_t is not guaranteed to be 16 bits and on Linux it won't be, UTF-32 (UCS-4) is used. So wcstombs_l is unlikely to be happy.
The Right Thing would be to use a library like iconv to read it in to whichever format you are using internally - presumably wchar_t. You could try to hack it yourself by poking bytes in, but you'd probably get things like the Surrogates wrong.
Indeed, Linux can't use UTF-16 as a locale default encoding thanks to all the \0s.
您可以读取二进制文件,然后进行自己的快速转换:
http://unicode.org/faq/utf_bom.html#utf16-3
但使用正确处理无效序列的库(如 libiconv)可能更安全。
You can read as binary, then do your own quick conversion:
http://unicode.org/faq/utf_bom.html#utf16-3
But it is probably safer to use a library (like libiconv) which handles invalid sequences properly.
我强烈建议使用 Unicode 编码作为程序的内部表示。 使用 UTF-16 或 UTF-8。 如果内部使用UTF-16,那么显然不需要翻译。 如果您使用 UTF-8,则可以使用其中包含
.UTF-8
的区域设置,例如en_US.UTF-8
。I would strongly recommend using a Unicode encoding as your program's internal representation. Use either UTF-16 or UTF-8. If you use UTF-16 internally, then obviously no translation is required. If you use UTF-8, you can use a locale with
.UTF-8
in it such asen_US.UTF-8
.