调用 fopen 或 open 时使用什么编码?
当我们在 Linux 中调用系统调用(如“open
”)或 stdio 函数(如“fopen
”)时,我们必须提供“const char * filename
”。我的问题是这里使用的编码是什么?它是 utf-8 或 ascii 或 iso8859-x?这取决于系统或环境设置吗?
我知道在 MS Windows 中有一个接受 utf-16 的 _wopen
。
When we invoke system call in linux like 'open
' or stdio function like 'fopen
' we must provide a 'const char * filename
'. My question is what is the encoding used here? It's utf-8 or ascii or iso8859-x? Does it depend on the system or environment setting?
I know in MS Windows there is a _wopen
which accept utf-16.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
它是一个字节字符串,解释取决于特定的文件系统。
It's a byte string, the interpretation is up to the particular filesystem.
Linux 上的文件系统调用与编码无关,即它们不需要(不需要)了解特定的编码。就他们而言,文件名参数指向的字节字符串按原样传递到文件系统。文件系统期望文件名采用正确的编码(通常是 UTF-8,如 Matthew Talbert 提到的)。
这意味着您通常不需要执行任何操作(文件名被视为不透明的字节字符串),但这实际上取决于您从何处接收文件名,以及是否需要以任何方式操作文件名。
Filesystem calls on Linux are encoding-agnostic, i.e. they do not (need to) know about the particular encoding. As far as they are concerned, the byte-string pointed to by the filename argument is passed down to the filesystem as-is. The filesystem expects that filenames are in the correct encoding (usually UTF-8, as mentioned by Matthew Talbert).
This means that you often don't need to do anything (filenames are treated as opaque byte-strings), but it really depends on where you receive the filename from, and whether you need to manipulate the filename in any way.
这取决于系统区域设置。查看“locale”命令的输出。如果变量以 UTF-8 结尾,那么您的区域设置就是 UTF-8。大多数现代 Linux 将使用 UTF-8。尽管安德鲁是正确的,从技术上来说它只是一个字节字符串,但如果您与系统区域设置不匹配,某些程序可能无法正常工作,并且不可能获得正确的用户输入等。最好坚持使用 UTF-8。
It depends on the system locale. Look at the output of the "locale" command. If the variables end in UTF-8, then your locale is UTF-8. Most modern linuxes will be using UTF-8. Although Andrew is correct that technically it's just a byte string, if you don't match the system locale some programs may not work correctly and it will be impossible to get correct user input, etc. It's best to stick with UTF-8.
文件名是字节字符串;无论您使用的区域设置或任何其他关于如何编码文件名的约定,您必须传递给 fopen 和所有采用文件名/路径名的函数的字符串都是文件的确切字节字符串命名。例如,如果您在 NFC 中有一个名为
ö.txt
的 UTF-8 文件,并且您的区域设置是 UTF-8 编码并使用 NFC,则只需将名称写入ö.txt
并将其传递给fopen
。但是,如果您的区域设置基于 Latin-1,则无法将ö.txt
("\xf6.txt"
) 的 Latin-1 形式传递给fopen
并期望它成功;这是一个不同的字节字符串,因此是一个不同的文件名。您需要传递"\xc3\xb6.txt"
("ö.txt"
如果您将其解释为 Latin-1),与真实姓名。这种情况与您似乎熟悉的 Windows 非常不同,其中文件名 is 是解释为 UTF-16 的 16 位单元序列(尽管据我所知,它们实际上不需要是有效的 UTF -16) 和传递给
fopen
等的文件名根据当前语言环境解释为 Unicode 字符,然后用于根据其 UTF- 来打开/访问文件16 名字。The filename is the byte string; regardless of locale or any other conventions you're using about how filenames should be encoded, the string you must pass to
fopen
and to all functions taking filenames/pathnames is the exact byte string for how the file is named. For example if you have a file namedö.txt
in UTF-8 in NFC, and your locale is UTF-8 encoded and uses NFC, you can just write the name asö.txt
and pass that tofopen
. If your locale is Latin-1 based, though, you can't pass the Latin-1 form ofö.txt
("\xf6.txt"
) tofopen
and expect it to succeed; that's a different byte string and thus a different filename. You would need to pass"\xc3\xb6.txt"
("ö.txt"
if you interpret that as Latin-1), the same byte string as the actual name.This situation is very different from Windows, which you seem to be familiar with, where the filename is is a sequence of 16-bit units interpreted as UTF-16 (although AFAIK they need not actually be valid UTF-16) and filenames passed to
fopen
, etc. are interpreted according to the current locale as Unicode characters which are then used to open/access the file based on its UTF-16 name.如上所述,这将是一个字节字符串,并且解释将向底层系统开放。更具体地说,想象一下 C 函数;一个在用户空间,一个在内核空间,它们以 char * 作为参数。用户空间中的编码将取决于用户程序的执行字符集(例如,由gcc中的
-fexec-charset=charset
指定)。内核函数期望的编码取决于内核编译期间使用的执行字符集(不确定从哪里获取该信息)。As already mentioned above, this will be a byte string and the interpretation will be open to the underlying system. More specifically, imagine to C functions; one in user space and one in kernel space which take
char *
as their parameter. The encoding in user space will depend upon the execution character set of the user program (eg. specified by-fexec-charset=charset
in gcc). The encoding expected by the kernel function depends upon the execution charset used during kernel compilation (not sure where to get that information).我对这个主题做了一些进一步的查询,并得出结论:unixoid 文件系统可以通过两种不同的方式处理文件名编码。
文件名以“系统区域设置”进行编码,通常但不需要与
locale
命令反映的当前环境区域设置相同(但某些预设在文件文件名以 UTF-8 编码,与任何区域设置无关。
GTK+ 通过假定 UTF-8 并允许通过当前区域设置编码或用户提供的编码覆盖它来解决这个混乱问题。
Qt 通过假设区域设置编码(并且系统区域设置反映在当前区域设置中)并允许使用用户提供的转换函数覆盖它来解决这个问题。
所以底线是:使用 UTF-8 或 LC_ALL 或 LANG 默认告诉您的内容,并至少为其他替代方案提供覆盖设置。
I did some further inquiries on this topic and came to the conclusion that there are two different ways how filename encoding can be handled by unixoid file systems.
File names are encoded in the "sytem locale", which usually is, but needs not to be the same as the current environment locale that is reflected by the
locale
command (but some preset in a global configuration file).File names are encoded in UTF-8, independent from any locale settings.
GTK+ solves this mess by assuming UTF-8 and allowing to override it either by the current locale encoding or a user-supplied encoding.
Qt solves it by assuming locale encoding (and that system locale is reflected in the current locale) and allowing to override it with a user-supplied conversion function.
So the bottom line is: Use either UTF-8 or what LC_ALL or LANG tell you by default, and provide an override setting at least for the other alternative.