如何在 Linux 中打开包含非 Ascii 字符串的 wchar_t* 文件?

发布于 2024-10-12 08:41:16 字数 135 浏览 7 评论 0原文

环境:Gcc/G++ Linux

我的文件系统中有一个非ascii 文件,我要打开它。

现在我有一个wchar_t*,但我不知道如何打开它。 (我信任的 fopen 只打开 char* 文件)

请帮忙。多谢。

Environment: Gcc/G++ Linux

I have a non-ascii file in file system and I'm going to open it.

Now I have a wchar_t*, but I don't know how to open it. (my trusted fopen only opens char* file)

Please help. Thanks a lot.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

倾城°AllureLove 2024-10-19 08:41:16

有两个可能的答案:

如果您想确保所有 Unicode 文件名均可表示,您可以对文件系统使用 UTF-8 文件名的假设进行硬编码。这是“现代”Linux 桌面应用程序方法。只需使用库函数(iconv 效果很好)或您自己的实现(但查找规范,这样您就不会将字符串从 wchar_t (UTF-32) 转换为 UTF-8)不要像 Shelwien 那样犯严重错误),然后使用 fopen

如果您想以更面向标准的方式执行操作,则应该使用 wcsrtombswchar_t 字符串转换为多字节 char 字符串语言环境的编码(在任何现代系统上都希望是 UTF-8)并使用 fopen。请注意,这要求您事先使用 setlocale(LC_CTYPE, "")setlocale(LC_ALL, "") 设置区域设置。

最后,不完全是答案,而是建议:

将文件名存储为 wchar_t 字符串可能是一个可怕的错误。您应该将文件名存储为抽象字节字符串,并且仅将它们及时转换为 wchar_t 以便在用户界面中显示它们(如果有必要的话;许多 UI 工具包使用纯字节字符串)他们自己并为您将其解释为角色)。通过这种方式,您可以消除许多可能令人讨厌的极端情况,并且您永远不会遇到某些文件因其名称而无法访问的情况。

There are two possible answers:

If you want to make sure all Unicode filenames are representable, you can hard-code the assumption that the filesystem uses UTF-8 filenames. This is the "modern" Linux desktop-app approach. Just convert your strings from wchar_t (UTF-32) to UTF-8 with library functions (iconv would work well) or your own implementation (but lookup the specs so you don't get it horribly wrong like Shelwien did), then use fopen.

If you want to do things the more standards-oriented way, you should use wcsrtombs to convert the wchar_t string to a multibyte char string in the locale's encoding (which hopefully is UTF-8 anyway on any modern system) and use fopen. Note that this requires that you previously set the locale with setlocale(LC_CTYPE, "") or setlocale(LC_ALL, "").

And finally, not exactly an answer but a recommendation:

Storing filenames as wchar_t strings is probably a horrible mistake. You should instead store filenames as abstract byte strings, and only convert those to wchar_t just-in-time for displaying them in the user interface (if it's even necessary for that; many UI toolkits use plain byte strings themselves and do the interpretation as characters for you). This way you eliminate a lot of possible nasty corner cases, and you never encounter a situation where some files are inaccessible due to their names.

倾听心声的旋律 2024-10-19 08:41:16

Linux 不是 UTF-8,但它是您对文件名的唯一选择

(文件中可以包含您想要的任何内容。)


对于文件名,Linux 并不需要真正担心字符串编码。文件名是需要以 null 结尾的字节字符串。

这并不完全意味着 Linux 是 UTF-8,但它确实意味着它与宽字符不兼容,因为它们可能在不是结束字节的字节中包含零。

但 UTF-8 保留了 no-nulls- except-at-the-end 模型,所以我不得不相信,实际的方法是文件名“转换为 UTF-8”。

文件的内容是 Linux 内核级别以上标准的问题,因此这里没有任何 Linux-y 可以或想要做的事情。文件的内容将仅由读写它们的程序关心。 Linux只是存储并返回字节流,它可以拥有你想要的所有嵌入的nul。

Linux is not UTF-8, but it's your only choice for filenames anyway

(Files can have anything you want inside them.)


With respect to filenames, linux does not really have a string encoding to worry about. Filenames are byte strings that need to be null-terminated.

This doesn't precisely mean that Linux is UTF-8, but it does mean that it's not compatible with wide characters as they could have a zero in a byte that's not the end byte.

But UTF-8 preserves the no-nulls-except-at-the-end model, so I have to believe that the practical approach is "convert to UTF-8" for filenames.

The content of files is a matter for standards above the Linux kernel level, so here there isn't anything Linux-y that you can or want to do. The content of files will be solely the concern of the programs that read and write them. Linux just stores and returns the byte stream, and it can have all the embedded nuls you want.

平定天下 2024-10-19 08:41:16

将 wchar 字符串转换为 utf8 char 字符串,然后使用 fopen。

typedef unsigned int   uint;
typedef unsigned short word;
typedef unsigned char  byte;

int UTF16to8( wchar_t* w, char* s ) {
  uint  c;
  word* p = (word*)w;
  byte* q = (byte*)s; byte* q0 = q;
  while( 1 ) {
    c = *p++;
    if( c==0 ) break;
    if( c<0x080 ) *q++ = c; else 
      if( c<0x800 ) *q++ = 0xC0+(c>>6), *q++ = 0x80+(c&63); else 
        *q++ = 0xE0+(c>>12), *q++ = 0x80+((c>>6)&63), *q++ = 0x80+(c&63);
  }
  *q = 0;
  return q-q0;
}

int UTF8to16( char* s, wchar_t* w ) {
  uint  cache,wait,c;
  byte* p = (byte*)s;
  word* q = (word*)w; word* q0 = q;
  while(1) {
    c = *p++;
    if( c==0 ) break;
    if( c<0x80 ) cache=c,wait=0; else
      if( (c>=0xC0) && (c<=0xE0) ) cache=c&31,wait=1; else 
        if( (c>=0xE0) ) cache=c&15,wait=2; else
          if( wait ) (cache<<=6)+=c&63,wait--;
    if( wait==0 ) *q++=cache;
  }
  *q = 0;
  return q-q0;
}

Convert wchar string to utf8 char string, then use fopen.

typedef unsigned int   uint;
typedef unsigned short word;
typedef unsigned char  byte;

int UTF16to8( wchar_t* w, char* s ) {
  uint  c;
  word* p = (word*)w;
  byte* q = (byte*)s; byte* q0 = q;
  while( 1 ) {
    c = *p++;
    if( c==0 ) break;
    if( c<0x080 ) *q++ = c; else 
      if( c<0x800 ) *q++ = 0xC0+(c>>6), *q++ = 0x80+(c&63); else 
        *q++ = 0xE0+(c>>12), *q++ = 0x80+((c>>6)&63), *q++ = 0x80+(c&63);
  }
  *q = 0;
  return q-q0;
}

int UTF8to16( char* s, wchar_t* w ) {
  uint  cache,wait,c;
  byte* p = (byte*)s;
  word* q = (word*)w; word* q0 = q;
  while(1) {
    c = *p++;
    if( c==0 ) break;
    if( c<0x80 ) cache=c,wait=0; else
      if( (c>=0xC0) && (c<=0xE0) ) cache=c&31,wait=1; else 
        if( (c>=0xE0) ) cache=c&15,wait=2; else
          if( wait ) (cache<<=6)+=c&63,wait--;
    if( wait==0 ) *q++=cache;
  }
  *q = 0;
  return q-q0;
}
£噩梦荏苒 2024-10-19 08:41:16

查看此文档

http://www.firstobject.com/wchar_t -string-on-linux-osx-windows.htm

我认为Linux遵循POSIX标准,它将所有文件名视为UTF-8。

Check out this document

http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm

I think Linux follows POSIX standard, which treats all file names as UTF-8.

海风掠过北极光 2024-10-19 08:41:16

当你说“文件系统中的非ascii文件”时,我认为它是包含非ascii字符的文件名,而不是文件本身。文件包含什么并不重要。

您可以使用普通的 fopen 来完成此操作,但您必须匹配文件系统使用的编码。

这取决于 Linux 的版本、您使用的文件系统以及您的设置方式,但如果幸运的话,文件系统很可能使用 UTF-8。因此,将您的 wchar_t (这可能是一个 UTF-16 编码的字符串?),将其转换为以 UTF-8 编码的 char 字符串,并将其传递给 fopen。

I take it it's the name of the file that contains non-ascii characters, not the file itself, when you say "non-ascii file in file system". It doesn't really matter what the file contains.

You can do this with normal fopen, but you'll have to match the encoding the filesystem uses.

It depends on what version of Linux and what filesystem you're using and how you've set it up, but likely, if you're lucky, the filesystem uses UTF-8. So take your wchar_t (which is probably a UTF-16 encoded string?), convert it to a char string encoded in UTF-8, and pass that to fopen.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文