printf 字段宽度:字节还是字符?
printf/fprintf/sprintf 系列支持 其格式说明符中的宽度字段。我有一个疑问 对于(非宽)字符数组参数的情况:
宽度字段应该表示字节还是字符?
如果 char 数组,(事实上正确的)行为是什么 对应于(比如说)原始 UTF-8 字符串? (我知道通常我应该使用一些宽字符类型, 这不是重点)
例如,在
char s[] = "ni\xc3\xb1o"; // utf8 encoded "niño"
fprintf(f,"%5s",s);
Is that function Should try to ouput just 5 bytes (纯 C 字符)(并且您承担错位的责任 或其他问题(如果两个字节产生文本字符)?
或者它是否应该尝试计算“文本字符”的长度 数组的? (根据当前语言环境对其进行解码?) (在示例中,这相当于发现该字符串具有 4 个 unicode 字符,因此会添加一个空格用于填充)。
更新:我同意答案,printf 系列不这样做是合乎逻辑的 区分普通 C 字符和字节。问题是我的 glibc 好像没有 充分尊重这个概念,如果先前已设置区域设置,并且如果 一个有(今天最常用的) LANG/LC_CTYPE=en_US.UTF-8
恰当的例子:
#include<stdio.h>
#include<locale.h>
main () {
char * locale = setlocale(LC_ALL, ""); /* I have LC_CTYPE="en_US.UTF-8" */
char s[] = {'n','i', 0xc3,0xb1,'o',0}; /* "niño" in utf8: 5 bytes, 4 unicode chars */
printf("|%*s|\n",6,s); /* this should pad a blank - works ok*/
printf("|%.*s|\n",4,s); /* this should eat a char - works ok */
char s3[] = {'A',0xb1,'B',0}; /* this is not valid UTF8 */
printf("|%s|\n",s3); /* print raw chars - ok */
printf("|%.*s|\n",15,s3); /* panics (why???) */
}
因此,即使设置了非 POSIX-C 语言环境,printf 似乎仍然有权利计算宽度的概念:字节(c 普通字符)而不是 unicode 字符。没关系。但是,当给定一个在其语言环境中不可解码的 char 数组时,它会默默地发生恐慌(它会中止 - 在第一个“|”之后不打印任何内容 - 没有错误消息)...仅当它需要计算一些宽度时。我不明白为什么它甚至在不需要/必须时尝试从 utf-8 解码字符串。这是 glibc 中的错误吗?
使用 glibc 2.11.1 (Fedora 12)(还有 glibc 2.3.6)进行测试
注意:它与终端显示问题无关 - 您可以通过管道到 od 来检查输出: $ ./a.out | od -t cx1
这是我的输出:
0000000 | n i 303 261 o | \n | n i 303 261 | \n
7c 20 6e 69 c3 b1 6f 7c 0a 7c 6e 69 c3 b1 7c 0a
0000020 | A 261 B | \n |
7c 41 b1 42 7c 0a 7c
更新 2(2015 年 5 月):此可疑行为 已在较新版本的 glibc 中修复(似乎是从 2.17 开始)。使用 glibc-2.17-21.fc19 ,它对我来说工作正常。
The printf/fprintf/sprintf family supports
a width field in its format specifier. I have a doubt
for the case of (non-wide) char arrays arguments:
Is the width field supposed to mean bytes or characters?
What is the (correct-de facto) behaviour if the char array
corresponds to (say) a raw UTF-8 string?
(I know that normally I should use some wide char type,
that's not the point)
For example, in
char s[] = "ni\xc3\xb1o"; // utf8 encoded "niño"
fprintf(f,"%5s",s);
Is that function supposed to try to ouput just 5 bytes
(plain C chars) (and you take responsability of misalignments
or other problems if two bytes results in a textual characters) ?
Or is it supposed to try to compute the length of "textual characters"
of the array? (decodifying it... according to the current locale?)
(in the example, this would amount to find out that the string has
4 unicode chars, so it would add a space for padding).
UPDATE: I agree with the answers, it is logical that the printf family doesnt
distinguish plain C chars from bytes. The problem is my glibc doest not seem
to fully respect this notion, if the locale has been set previously, and if
one has the (today most used) LANG/LC_CTYPE=en_US.UTF-8
Case in point:
#include<stdio.h>
#include<locale.h>
main () {
char * locale = setlocale(LC_ALL, ""); /* I have LC_CTYPE="en_US.UTF-8" */
char s[] = {'n','i', 0xc3,0xb1,'o',0}; /* "niño" in utf8: 5 bytes, 4 unicode chars */
printf("|%*s|\n",6,s); /* this should pad a blank - works ok*/
printf("|%.*s|\n",4,s); /* this should eat a char - works ok */
char s3[] = {'A',0xb1,'B',0}; /* this is not valid UTF8 */
printf("|%s|\n",s3); /* print raw chars - ok */
printf("|%.*s|\n",15,s3); /* panics (why???) */
}
So, even when a non-POSIX-C locale has been set, still printf seems to have the right notion for counting width: bytes (c plain chars) and not unicode chars. That's fine. However, when given a char array that is not decodable in his locale, it silently panics (it aborts - nothing is printed after the first '|' - without error messages)... only if it needs to count some width. I dont understand why it even tries to decode the string from utf-8, when it doesn need/have to. Is this a bug in glibc ?
Tested with glibc 2.11.1 (Fedora 12) (also glibc 2.3.6)
Note: it's not related to terminal display issues - you can check the output by piping to od : $ ./a.out | od -t cx1
Here's my output:
0000000 | n i 303 261 o | \n | n i 303 261 | \n
7c 20 6e 69 c3 b1 6f 7c 0a 7c 6e 69 c3 b1 7c 0a
0000020 | A 261 B | \n |
7c 41 b1 42 7c 0a 7c
UPDATE 2 (May 2015): This questionable behaviour has been fixed in newer versions of glibc (from 2.17, it seems). With glibc-2.17-21.fc19
it works ok for me.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
它将导致输出五个字节。还有五个字符。在 ISO C 中,字符和字节之间没有区别。字节不一定是 8 位,而是定义为字符的宽度。
8 位值的 ISO 术语是八位字节。
就 C 环境而言,您的“niño”字符串实际上是五个字符宽(当然没有空终止符)。如果你的终端上只显示四个符号,那几乎肯定是终端的函数,而不是 C 的输出函数。
我并不是说 C 实现无法处理 Unicode。如果 CHAR_BITS 定义为 32,它可以很容易地执行 UTF-32。UTF-8 会更难,因为它是可变长度编码,但几乎有解决任何问题的方法:-)
根据您的更新,似乎您可能有一个问题。但是,在具有相同区域设置的设置中,我没有看到您所描述的行为。就我而言,我在最后两个 printf 语句中得到相同的输出。
如果您的设置只是在第一个
|
之后停止输出(我认为这就是您所说的中止的意思,但是,如果您的意思是整个程序中止,那就更严重了),我会向 GNU 提出问题(首先尝试您的特定发行版错误程序) 。您已经完成了所有重要的工作,例如生成了一个最小的测试用例,因此如果您的发行版没有完全达到目标(大多数都没有),那么有人甚至应该很乐意针对最新版本运行该测试用例。顺便说一句,我不确定您检查 od 输出是什么意思。在我的系统上,我得到:
所以你可以看到输出流包含 UTF-8,这意味着它是终端程序必须解释它。 C/glibc 根本不修改输出,所以也许我只是误解了你想说的内容。
尽管我刚刚意识到您可能会说您的
od
输出也只有该行的起始栏(与我的不同,它显示没有问题),这意味着它是 C/glibc 中的问题,而不是终端默默地删除字符的问题(老实说,我希望终端删除整个行或只是有问题的字符(即输出|A
) - 事实上,您只是得到|
似乎排除了终端问题)。请澄清这一点。It will result in five bytes being output. And five chars. In ISO C, there is no distinction between chars and bytes. Bytes are not necessarily 8 bits, instead being defined as the width of a char.
The ISO term for an 8-bit value is an octet.
Your "niño" string is actually five characters wide in terms of the C environment (sans the null terminator, of course). If only four symbols show up on your terminal, that's almost certainly a function of the terminal, not C's output functions.
I'm not saying a C implementation couldn't handle Unicode. It could quite easily do UTF-32 if CHAR_BITS was defined as 32. UTF-8 would be harder since it's a variable length encoding but there are ways around almost any problem :-)
Based on your update, it seems like you might have a problem. However, I'm not seeing your described behaviour in my setup with the same locale settings. In my case, I'm getting the same output in those last two
printf
statements.If your setup is just stopping output after the first
|
(I assume that's what you mean by abort but, if you meant the whole program aborts, that's much more serious), I would raise the issue with GNU (try your particular distributions bug procedures first). You've done all the important work such as producing a minimal test case so someone should even be happy to run that against the latest version if your distribution doesn't quite get there (most don't).As an aside, I'm not sure what you meant by checking the
od
output. On my system, I get:so you can see the output stream contains the UTF-8, meaning that it's the terminal program which must interpret this. C/glibc isn't modifying the output at all, so maybe I just misunderstood what you were trying to say.
Although I've just realised you may be saying that your
od
output has only the starting bar on that line as well (unlike mine which appears to not have the problem), meaning that it is something wrong within C/glibc, not something wrong with the terminal silently dropping the characters (in all honesty, I would expect the terminal to drop either the whole line or just the offending character (i.e., output|A
) - the fact that you're just getting|
seems to preclude a terminal problem). Please clarify that.字节(字符)。没有对 Unicode 语义的内置支持。您可以想象它会导致至少 5 次调用 fputc。
Bytes (chars). There is no built-in support for Unicode semantics. You can imagine it as resulting in at least 5 calls to fputc.
您发现的是 glibc 中的一个错误。不幸的是,这是一个故意的问题,开发人员拒绝修复。请参阅此处的说明:
http://www.kernel.org/ pub/linux/libs/uclibc/Glibc_vs_uClibc_Differences.txt
What you've found is a bug in glibc. Unfortunately it's an intentional one which the developers refuse to fix. See here for a description:
http://www.kernel.org/pub/linux/libs/uclibc/Glibc_vs_uClibc_Differences.txt
最初的问题(字节还是字符?)得到了几个人的正确回答:根据规范和 glibc 实现,printf C 中的宽度(或精度)函数计算字节(或普通 C 字符,它们是相同的东西)。因此,在我的第一个示例中,
fprintf(f,"%5s",s)
绝对意味着“尝试从以下位置输出至少 5 个字节(纯字符)”数组 s - 如果不够,则用空格填充”。字符串(在我的示例中,字节长度为 5)是否表示以 UTF8 编码的文本以及是否包含 4 个“文本(unicode)字符”并不重要。对于printf()来说,在内部,它只有 5 个(普通)C 字符,这才是最重要的。
好吧,这看起来很清楚。但这并不能解释我的另一个问题。那么我们一定错过了一些东西。
在 glibc bug-tracker 中搜索,我发现了一些相关的(相当旧的)问题 - 我不是第一个被这个功能捕获的人:
http://sources.redhat.com/bugzilla/show_bug.cgi?id=6530
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208308
http://sources.redhat.com/bugzilla/show_bug.cgi?id=649
来自最后一个链接的这句话在这里特别相关:
它是否是一个错误(可能在解释中或在 ISO 规范本身中)是有争议的。
但 glibc 正在做什么现在已经很清楚了。
回想一下我有问题的陈述:
printf("|%.*s|\n",15,s3)
。这里,glibc 必须找出 s3 的长度是否大于 15,如果大于,则将其截断。为了计算这个长度,根本不需要搞乱编码。但是,如果必须截断,glibc 会尽力小心:如果它只保留前 15 个字节,则可能会将多字节字符分成两半,从而产生无效的文本输出(I'我同意这一点 - 但 glibc 坚持其奇怪的 ISO C99 解释)。因此,不幸的是,它需要使用环境区域设置来解码 char 数组,以找出真正的字符边界在哪里。因此,例如,如果 LC_TYPE 表示 UTF-8 并且该数组不是有效的 UTF-8 字节序列,它将中止(还不错,因为
printf
返回 -1 ;不太好,因为无论如何,它都会打印部分字符串,因此很难干净地恢复)。显然只有在这种情况下,当为字符串指定精度并且存在截断的可能性时,glibc 需要将一些 Unicode 语义与普通字符/字节语义混合。在我看来,相当丑陋,但事实就是如此。
更新:请注意,此行为不仅与无效原始编码的情况相关,而且还与截断后的无效代码相关。例如:
Thi 将字段截断为 2 个字节,而不是 3 个字节,因为它拒绝输出无效的 UTF8 字符串:
UPDATE (2015 年 5 月) 这种 (IMO) 可疑行为已在较新版本的 glib 中更改(修复)。请参阅主要问题。
The original question (bytes or chars?) was rightly answered by several people: both according to the spec and the glibc implementation, the width (or precision) in the printf C function counts bytes (or plain C chars, which are the same thing). So,
fprintf(f,"%5s",s)
in my first example, means definitely "try to output at least 5 bytes (plain chars) from the array s -if not enough, pad with blanks".It does not matter whether the string (in my example, of byte-length 5) represents text encoded in -say- UTF8 and if fact contains 4 "textual (unicode) characters". To printf(), internally, it just has 5 (plain) C chars, and that's what counts.
Ok, this seems crystal clear. But it doesn't explain my other problem. Then we must be missing something.
Searching in glibc bug-tracker, I found some related (rather old) issues - I was not the first one caught by this... feature:
http://sources.redhat.com/bugzilla/show_bug.cgi?id=6530
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208308
http://sources.redhat.com/bugzilla/show_bug.cgi?id=649
This quote, from the last link, is specially relevant here:
Whether it is a bug (perhaps in interpretation or in the ISO spec itself) is debatable.
But what glibc is doing is clear now.
Recall my problematic statement:
printf("|%.*s|\n",15,s3)
. Here, glibc must find out if the length ofs3
is greater than 15 and, if so, truncate it. For computing this length it doesn't need to mess with encodings at all. But, if it must be truncated, glibc strives to be careful: if it just keeps the first 15 bytes, it could potentially break a multibyte character in half, and hence produce an invalid text output (I'd be ok with that - but glibc sticks to its curious ISO C99 interpretation).So, it unfortunately needs to decode the char array, using the environment locale, to find out where are the real characters boundaries. Hence, for example, if LC_TYPE says UTF-8 and the array is not a valid UTF-8 bytes sequence, it aborts (not so bad, because then
printf
returns -1 ; not so well, because it prints part of the string anyway, so it's difficult to recover cleanly).Apparently only in this case, when a precision is specified for a string and there is possibility of truncation, glibc needs to mix some Unicode semantics with the plain-chars/bytes semantics. Quite ugly, IMO, but so it is.
Update: Notice that this behaviour is relevant not only for the case of invalid original encodings, but also for invalid codes after the truncation. For example:
Thi truncates the field to 2 bytes, not 3, because it refuses to output an invalid UTF8 string:
UPDATE (May 2015) This (IMO) questionable behaviour has been changed (fixed) in newer versions of glib. See main question.
为了便于移植,请使用
mbstowcs
转换字符串并使用printf( "%6ls", wchar_ptr )
打印它。%ls
是宽字符串的说明符,根据 POSIX。不存在“事实上的”标准。通常,如果操作系统和区域设置已配置为将其视为 UTF-8 文件,我希望
stdout
接受 UTF-8,但我希望printf
为不了解多字节编码,因为它没有在这些术语中定义。To be portable, convert the string using
mbstowcs
and print it usingprintf( "%6ls", wchar_ptr )
.%ls
is the specifier for a wide string according to POSIX.There is no "de-facto" standard. Typically, I would expect
stdout
to accept UTF-8 if the OS and locale have been configured to treat it as a UTF-8 file, but I would expectprintf
to be ignorant of multibyte encoding because it isn't defined in those terms.不要使用 mbstowcs,除非您还确保 wchar_t 的长度至少为 32 位。
否则你可能最终会得到 UTF-16,它具有 UTF-8 的所有缺点,并且
UTF-32 的所有缺点。
我并不是说要避免 mbstowcs,我只是说不要让 Windows 程序员使用它。
使用 iconv 转换为 UTF-32 可能更简单。
Don't use mbstowcs unless you also make sure that wchar_t is at-least 32 bits long.
else you'll likely end up with UTF-16 which has all disadvantages of UTF-8 and
all the disadvantages of UTF-32.
I'm not saying avoid mbstowcs I just saying don't let windows programmers use it.
It might be simpler to use iconv to convert to UTF-32.