argv的编码是什么?

发布于 2024-10-25 08:37:13 字数 2397 浏览 2 评论 0原文

我不清楚 C 的 argv 中使用了什么编码。我特别对以下场景感兴趣:

  • 用户使用区域设置 L1 创建一个名称 N 包含非 ASCII 字符的文件
  • 随后,用户使用区域设置 L2 进行制表符补全命令行上该文件的名称,该文件作为命令行参数输入到程序 P

中 P 在命令行上看到的字节序列是什么?

我观察到,在 Linux 上,在 UTF-8 语言环境中创建文件名,然后在(例如)zw_TW.big5 语言环境中使用制表符完成它似乎会导致我的程序 P 被输入 UTF-8而不是 Big5。然而,在 OS X 上,相同的一系列操作会导致我的程序 P 获得 Big5 编码的文件名。

以下是我认为到目前为止发生的情况(很长,我可能是错的,需要更正):

Windows

文件名以某种 Unicode 格式存储在磁盘上。因此,Windows 采用名称 N,从 L1(当前代码页)转换为 N 的 Unicode 版本(我们将其称为 N1),并存储N1 在磁盘上。

然后我假设发生的是,当稍后完成制表符时,名称N1将转换为语言环境L2(新的当前代码页)以进行显示。运气好的话,这将产生原始名称 N ——但如果 N 包含在 L2 中无法表示的字符,则情况并非如此。我们将新名称称为 N2

当用户实际按下 Enter 键以使用该参数运行 P 时,名称 N2 会被转换回 Unicode,再次生成 N1。此 N1 现在可通过 GetCommandLineW/wmain/tmain 以 UCS2 格式提供给程序,但的用户code>GetCommandLine/main 将在当前语言环境(代码页)中看到名称 N2

OS X 的

据我所知, 磁盘存储情况是相同的。 OS X 将文件名存储为 Unicode。

对于 Unicode 终端,我认为终端会在 Unicode 缓冲区中构建命令行。因此,当您按 Tab 键完成时,它会将文件名作为 Unicode 文件名复制到该缓冲区。

当您运行该命令时,该 Unicode 缓冲区将转换为当前语言环境 L2,并通过 argv 提供给程序,并且该程序可以使用当前语言环境将 argv 解码为 Unicode 进行显示。

Linux

在 Linux 上,一切都不同了,我对正在发生的事情感到非常困惑。 Linux 将文件名存储为字节字符串,而不是 Unicode。因此,如果您在语言环境 L1 中创建名为 N 的文件,则 N 作为字节字符串存储在磁盘上。

当我稍后运行终端并尝试使用 Tab 键完成名称时,我不确定会发生什么。在我看来,命令行被构造为字节缓冲区,并且文件名作为字节字符串只是连接到该缓冲区上。我假设当您键入标准字符时,它会立即编码为附加到该缓冲区的字节。

当你运行一个程序时,我认为缓冲区被直接发送到argv。现在,argv 有什么编码?看起来您在语言环境 L2 中在命令行中键入的任何字符都将采用 L2 编码,但文件名将采用 L1 编码。所以 argv 包含两种编码的混合!

问题

如果有人能让我知道这里发生了什么,我真的很高兴。我目前所拥有的只是半猜测和猜测,而且它们并不真正吻合。我真正希望的是在当前代码页(Windows)或当前语言环境(Linux / OS X)中对 argv 进行编码,但情况似乎并非如此...

附加功能

这是一个简单的候选程序 P,可让您自己观察编码:

#include <stdio.h>

int main(int argc, char **argv)
{
    if (argc < 2) {
        printf("Not enough arguments\n");
        return 1;
    }
    
    int len = 0;
    for (char *c = argv[1]; *c; c++, len++) {
        printf("%d ", (int)(*c));
    }
    
    printf("\nLength: %d\n", len);
    
    return 0;
}

您可以使用 locale -a 查看可用的语言环境,并使用 export LC_ALL=my_encoding更改您的区域设置。

It's not clear to me what encodings are used where in C's argv. In particular, I'm interested in the following scenario:

  • A user uses locale L1 to create a file whose name, N, contains non-ASCII characters
  • Later on, a user uses locale L2 to tab-complete the name of that file on the command line, which is fed into a program P as a command line argument

What sequence of bytes does P see on the command line?

I have observed that on Linux, creating a filename in the UTF-8 locale and then tab-completing it in (e.g.) the zw_TW.big5 locale seems to cause my program P to be fed UTF-8 rather than Big5. However, on OS X the same series of actions results in my program P getting a Big5 encoded filename.

Here is what I think is going on so far (long, and I'm probably wrong and need to be corrected):

Windows

File names are stored on disk in some Unicode format. So Windows takes the name N, converts from L1 (the current code page) to a Unicode version of N we will call N1, and stores N1 on disk.

What I then assume happens is that when tab-completing later on, the name N1 is converted to locale L2 (the new current code page) for display. With luck, this will yield the original name N -- but this won't be true if N contained characters unrepresentable in L2. We call the new name N2.

When the user actually presses enter to run P with that argument, the name N2 is converted back into Unicode, yielding N1 again. This N1 is now available to the program in UCS2 format via GetCommandLineW/wmain/tmain, but users of GetCommandLine/main will see the name N2 in the current locale (code page).

OS X

The disk-storage story is the same, as far as I know. OS X stores file names as Unicode.

With a Unicode terminal, I think what happens is that the terminal builds the command line in a Unicode buffer. So when you tab complete, it copies the file name as a Unicode file name to that buffer.

When you run the command, that Unicode buffer is converted to the current locale, L2, and fed to the program via argv, and the program can decode argv with the current locale into Unicode for display.

Linux

On Linux, everything is different and I'm extra-confused about what is going on. Linux stores file names as byte strings, not in Unicode. So if you create a file with name N in locale L1 that N as a byte string is what is stored on disk.

When I later run the terminal and try and tab-complete the name, I'm not sure what happens. It looks to me like the command line is constructed as a byte buffer, and the name of the file as a byte string is just concatenated onto that buffer. I assume that when you type a standard character it is encoded on the fly to bytes that are appended to that buffer.

When you run a program, I think that buffer is sent directly to argv. Now, what encoding does argv have? It looks like any characters you typed in the command line while in locale L2 will be in the L2 encoding, but the file name will be in the L1 encoding. So argv contains a mixture of two encodings!

Question

I'd really like it if someone could let me know what is going on here. All I have at the moment is half-guesses and speculation, and it doesn't really fit together. What I'd really like to be true is for argv to be encoded in the current code page (Windows) or the current locale (Linux / OS X) but that doesn't seem to be the case...

Extras

Here is a simple candidate program P that lets you observe encodings for yourself:

#include <stdio.h>

int main(int argc, char **argv)
{
    if (argc < 2) {
        printf("Not enough arguments\n");
        return 1;
    }
    
    int len = 0;
    for (char *c = argv[1]; *c; c++, len++) {
        printf("%d ", (int)(*c));
    }
    
    printf("\nLength: %d\n", len);
    
    return 0;
}

You can use locale -a to see available locales, and use export LC_ALL=my_encoding to change your locale.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

极致的悲 2024-11-01 08:37:13

感谢大家的回复。我已经了解了很多关于这个问题的知识,并发现以下内容可以解决我的问题:

  1. 正如所讨论的,在 Windows 上 argv 是使用当前代码页进行编码的。但是,您可以使用 GetCommandLineW 以 UTF-16 形式检索命令行。对于支持 unicode 的现代 Windows 应用程序,不建议使用 argv,因为代码页已被弃用。

  2. 在 Unix 上,argv 没有固定编码:

    a) 通过制表符补全/通配符插入的文件名将逐字出现在 argv 中,与它们在磁盘上命名的字节序列完全相同。即使这些字节序列在当前语言环境中没有意义,也是如此。

    b) 用户使用 IME 直接输入的输入将出现在区域设置编码的 argv 中。 (Ubuntu 似乎使用 LOCALE 来决定如何对 IME 输入进行编码,而 OS X 使用 Terminal.app 编码首选项。)

这对于 Python、Haskell 或 Java,希望将命令行参数视为字符串。他们需要决定如何将 argv 解码为 String 内部使用的任何编码(对于这些语言来说是 UTF-16)。但是,如果他们仅使用区域设置编码来进行此解码,则输入中的有效文件名可能无法解码,从而导致异常。

Python 3 采用的解决方案是代理字节编码方案 (http:// /www.python.org/dev/peps/pep-0383/),它将 argv 中任何不可解码的字节表示为特殊的 Unicode 代码点。当该代码点被解码回字节流时,它又变成了原始字节。这允许将当前编码中无效的 argv 数据(即以当前语言环境以外的名称命名的文件名)通过本机 Python 字符串类型往返返回到字节,而不会丢失信息。

正如你所看到的,情况相当混乱:-)

Thanks everyone for your responses. I have learnt quite a lot about this issue and have discovered the following things that has resolved my question:

  1. As discussed, on Windows the argv is encoded using the current code page. However, you can retrieve the command line as UTF-16 using GetCommandLineW. Use of argv is not recommended for modern Windows apps with unicode support because code pages are deprecated.

  2. On Unixes, the argv has no fixed encoding:

    a) File names inserted by tab-completion/globbing will occur in argv verbatim as exactly the byte sequences by which they are named on disk. This is true even if those byte sequences make no sense in the current locale.

    b) Input entered directly by the user using their IME will occur in argv in the locale encoding. (Ubuntu seems to use LOCALE to decide how to encode IME input, whereas OS X uses the Terminal.app encoding Preference.)

This is annoying for languages such as Python, Haskell or Java, which want to treat command line arguments as strings. They need to decide how to decode argv into whatever encoding is used internally for a String (which is UTF-16 for those languages). However, if they just use the locale encoding to do this decoding, then valid filenames in the input may fail to decode, causing an exception.

The solution to this problem adopted by Python 3 is a surrogate-byte encoding scheme (http://www.python.org/dev/peps/pep-0383/) which represents any undecodable byte in argv as special Unicode code points. When that code point is decoded back to a byte stream, it just becomes the original byte again. This allows for roundtripping data from argv that is not valid in the current encoding (i.e. a filename named in something other than the current locale) through the native Python string type and back to bytes with no loss of information.

As you can see, the situation is pretty messy :-)

瑾兮 2024-11-01 08:37:13

我现在只能谈论Windows。在 Windows 上,代码页仅适用于旧应用程序,系统或现代应用程序不使用。 Windows 对所有内容都使用 UTF-16(并且已经这样做了很长时间):文本显示、文件名、终端、系统 API。 UTF-16 和遗留代码页之间的转换仅在尽可能最高的级别上直接在系统和应用程序之间的接口上执行(从技术上讲,旧的 API 函数被实现了两次 - 一个函数 FunctionW它执行真正的工作并需要 UTF-16 字符串,并且一个兼容性函数 FunctionA 只是将输入字符串从当前(线程)代码页转换为 UTF-16,调用 FunctionW code>,并将结果转换回)。制表符补全应始终生成 UTF-16 字符串(使用 TrueType 字体时肯定会生成),因为控制台也仅使用 UTF-16。将制表符完成的 UTF-16 文件名移交给应用程序。如果现在该应用程序是遗留应用程序(即,它使用 main 而不是 wmain/GetCommandLineW 等),则 Microsoft C 运行时(可能)使用 GetCommandLineA 让系统转换命令行。所以基本上我认为你关于 Windows 的说法是正确的(只是制表符完成时可能不涉及转换): argv 数组将始终包含 代码页中的参数当前应用程序,因为原始程序使用的代码页 (L1) 信息在中间 UTF-16 阶段已不可逆转地丢失。

结论在 Windows 上一如既往:避免遗留代码页;尽可能使用 UTF-16 API。如果必须使用 main 而不是 wmain(例如,与平台无关),请使用 GetCommandLineW 而不是 argv数组。

I can only speak about Windows for now. On Windows, code pages are only meant for legacy applications and not used by the system or by modern applications. Windows uses UTF-16 (and has done so for ages) for everything: text display, file names, the terminal, the system API. Conversions between UTF-16 and the legacy code pages are only performed at the highest possible level, directly at the interface between the system and the application (technically, the older API functions are implemented twice—one function FunctionW that does the real work and expects UTF-16 strings, and one compatibility function FunctionA that simply converts input strings from the current (thread) code page to UTF-16, calls the FunctionW, and converts back the results). Tab-completion should always yield UTF-16 strings (it definitely does when using a TrueType font) because the console uses only UTF-16 as well. The tab-completed UTF-16 file name is handed over to the application. If now that application is a legacy application (i.e., it uses main instead of wmain/GetCommandLineW etc.), then the Microsoft C runtime (probably) uses GetCommandLineA to have the system convert the command line. So basically I think what you're saying about Windows is correct (only that there is probably no conversion involved while tab-completing): the argv array will always contain the arguments in the code page of the current application because the information what code page (L1) the original program has uses has been irreversibly lost during the intermediate UTF-16 stage.

The conclusion is as always on Windows: Avoid the legacy code pages; use the UTF-16 API wherever you can. If you have to use main instead of wmain (e.g., to be platform independent), use GetCommandLineW instead of the argv array.

痴情 2024-11-01 08:37:13

您的测试应用程序的输出需要进行一些修改才能有意义,
您需要十六进制代码并且需要消除负值。
或者您无法打印 UTF-8 特殊字符之类的内容以便您可以读取它们。

首先是修改后的软件:

#include <stdio.h>

int main(int argc, char **argv)
{
    if (argc < 2) {
        printf("Not enough arguments\n");
        return 1;
    }

    int len = 0;
    for (unsigned char *c = argv[1]; *c; c++, len++) {
        printf("%x ", (*c));
    }

    printf("\nLength: %d\n", len);

    return 0;
}

然后在我使用 UTF-8 的 Ubuntu 机器上得到这个输出。

gt; gcc -std=c99 argc.c -o argc
gt; ./argc 1ü
31 c3 bc 
Length: 3

在这里你可以看到,在我的例子中 ü 被编码为 2 个字符,
并且 1 是单个字符。
或多或少正是您对 UTF-8 编码的期望。

这实际上与 env LANG 变量中的内容匹配。

gt; env | grep LANG
LANG=en_US.utf8

希望这能稍微澄清一下 Linux 的情况。

/祝你好运

The output from your test app needed some modifications to make any sense,
you need the hex codes and you need to get rid of the negative values.
Or you can't print things like UTF-8 special chars so you can read them.

First the modified SW:

#include <stdio.h>

int main(int argc, char **argv)
{
    if (argc < 2) {
        printf("Not enough arguments\n");
        return 1;
    }

    int len = 0;
    for (unsigned char *c = argv[1]; *c; c++, len++) {
        printf("%x ", (*c));
    }

    printf("\nLength: %d\n", len);

    return 0;
}

Then on my Ubuntu box that is using UTF-8 I get this output.

gt; gcc -std=c99 argc.c -o argc
gt; ./argc 1ü
31 c3 bc 
Length: 3

And here you can see that in my case ü is encoded over 2 chars,
and that the 1 is a single char.
More or less exactly what you expect from a UTF-8 encoding.

And this actually match what is in the env LANG varible.

gt; env | grep LANG
LANG=en_US.utf8

Hope this clarifies the linux case a little.

/Good luck

能否归途做我良人 2024-11-01 08:37:13

是的,一般来说,用户在 Unix 上混合语言环境时必须小心。显示和更改文件名的 GUI 文件管理器也存在此问题。在 Mac OS X 上,标准 Unix 编码是 UTF-8。事实上,当通过 Unix 接口调用时,HFS+ 文件系统强制使用 UTF-8 文件名,因为它需要将其转换为 UTF-16 以便存储在文件系统本身中。

Yep, users has to be careful when mixing locales on Unix in general. GUI file managers that displays and changes filenames also have this problem. On Mac OS X the standard Unix encoding is UTF-8. In fact the HFS+ filesystem, when called via the Unix interfaces, enforces UTF-8 filenames because it needs to convert it to UTF-16 for storage in the filesystem itself.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文