更改本地设置以使 sed 正常工作,但为什么呢?

发布于 11-23 20:25 字数 2534 浏览 0 评论 0原文

以下是我编写的一个 bash 文件,用于将 C 文件中的所有 C++ 风格(//)注释转换为 C 风格(/**/)。

#!/bin/bash
lang=`echo $LANG`
# It's necessary to change the local setting. I don't know why.
export LANG=C
# Can comment the following statement if there is not dos2unix command.
dos2unix -q $1
sed -i -e 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' $1
export LANG=$lang

有用。但我发现了一个无法解释的问题。默认情况下,我的本地设置是 en_US.UTF-8。在我的C代码中,有一些用中文写的注释,例如

// some english 一些中文注释

如果我不更改本地设置,即不运行语句export LANG=C,我会得到

/* some english */一些中文注释

而不是

/* some english 一些中文注释*/

我不知道为什么。我只是通过尝试和错误找到解决方案。


阅读乔纳森·莱夫勒的回答后,我认为我犯了一些错误,导致了一些误解。在问题中,这些中文单词是在 Google Chrome 中输入的,并不是我的 C 文件中的实际单词。 一些中文注释只是指一些中文注释

现在我在Windows XP中的Visual C++ 6.0中输入//一些英文一些中文注释,并将c文件复制到Debian。然后我只需运行 sed -i -e 's;^([[:blank:]])//(.);\1/ \2 /;' $1 并得到

/* some english 一些 */中文注释

我认为不同的字符编码(GB18030,GBK,UTF-8?)导致不同的结果。

以下是我在 Debian 上得到的结果,

~/sandbox$ uname -a
Linux xyt-dev 2.6.30-1-686 #1 SMP Sat Aug 15 19:11:58 UTC 2009 i686 GNU/Linux
~/sandbox$ echo $LANG
en_US.UTF-8
~/sandbox$ cat tt.c | od -c -t x1
0000000   /   /       s   o   m   e       e   n   g   l   i   s   h    
         2f  2f  20  73  6f  6d  65  20  65  6e  67  6c  69  73  68  20
0000020 322 273 320 251 326 320 316 304 327 242 312 315
         d2  bb  d0  a9  d6  d0  ce  c4  d7  a2  ca  cd
0000034
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000   /   *           s   o   m   e       e   n   g   l   i   s   h
         2f  2a  20  20  73  6f  6d  65  20  65  6e  67  6c  69  73  68
0000020     322 273 320 251       *   / 326 320 316 304 327 242 312 315
         20  d2  bb  d0  a9  20  2a  2f  d6  d0  ce  c4  d7  a2  ca  cd
0000040
~/sandbox$ 

我认为这些汉字是用 2 字节(Unicode)编码的。

还有一个例子:

~/sandbox$ cat tt.c | od -c -t x1
0000000   /   /       I   n   W   i   n   d   o   w   :     250 250   ?
         2f  2f  20  49  6e  57  69  6e  64  6f  77  3a  20  a8  a8  3f
0000020   1   ?
         31  3f
0000022
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000   /   *           I   n   W   i   n   d   o   w   :           *
         2f  2a  20  20  49  6e  57  69  6e  64  6f  77  3a  20  20  2a
0000020   / 250 250   ?   1   ?
         2f  a8  a8  3f  31  3f

The following is a bash file I wrote to convert all C++ style(//) comments in a C file to C style(/**/).

#!/bin/bash
lang=`echo $LANG`
# It's necessary to change the local setting. I don't know why.
export LANG=C
# Can comment the following statement if there is not dos2unix command.
dos2unix -q $1
sed -i -e 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' $1
export LANG=$lang

It works. But I found a problem I cannot explain. In default, my local setting is en_US.UTF-8. And in my C code, there are comments written in Chinese, such as

// some english 一些中文注释

If I don't change the local setting, i.e., do not run the statement export LANG=C, I'll get

/* some english */一些中文注释

instead of

/* some english 一些中文注释*/

I don't know why. I just find a solution by try and error.


After read Jonathan Leffler's answer, I think I've make some mistake leading to some misunderstand. In the question, those Chinese words were inputed in Google Chrome and were not the actual words in my C file. 一些中文注释 just means some Chinese comments.

Now I inputed // some english 一些中文注释 in Visual C++ 6.0 in Windows XP, and copied the c file to Debian. Then I just run sed -i -e 's;^([[:blank:]])//(.);\1/ \2 /;' $1 and got

/* some english 一些 */中文注释

I think it's different character coding(GB18030, GBK, UTF-8?) cause the different results.

The following is my results gotten on Debian

~/sandbox$ uname -a
Linux xyt-dev 2.6.30-1-686 #1 SMP Sat Aug 15 19:11:58 UTC 2009 i686 GNU/Linux
~/sandbox$ echo $LANG
en_US.UTF-8
~/sandbox$ cat tt.c | od -c -t x1
0000000   /   /       s   o   m   e       e   n   g   l   i   s   h    
         2f  2f  20  73  6f  6d  65  20  65  6e  67  6c  69  73  68  20
0000020 322 273 320 251 326 320 316 304 327 242 312 315
         d2  bb  d0  a9  d6  d0  ce  c4  d7  a2  ca  cd
0000034
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000   /   *           s   o   m   e       e   n   g   l   i   s   h
         2f  2a  20  20  73  6f  6d  65  20  65  6e  67  6c  69  73  68
0000020     322 273 320 251       *   / 326 320 316 304 327 242 312 315
         20  d2  bb  d0  a9  20  2a  2f  d6  d0  ce  c4  d7  a2  ca  cd
0000040
~/sandbox$ 

I think these Chinese Character encoding with 2 byte(Unicode).

There are another example:

~/sandbox$ cat tt.c | od -c -t x1
0000000   /   /       I   n   W   i   n   d   o   w   :     250 250   ?
         2f  2f  20  49  6e  57  69  6e  64  6f  77  3a  20  a8  a8  3f
0000020   1   ?
         31  3f
0000022
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000   /   *           I   n   W   i   n   d   o   w   :           *
         2f  2a  20  20  49  6e  57  69  6e  64  6f  77  3a  20  20  2a
0000020   / 250 250   ?   1   ?
         2f  a8  a8  3f  31  3f

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

扛刀软妹2024-11-30 20:25:36

您在哪个平台上工作?您的 sed 脚本在 MacOS X 上运行良好,无需更改区域设置。 Linux 终端对中文字符不太满意,但它未设置为使用 UTF-8。此外,它确实获得的字符串的十六进制转储包含中文开头的零字节 0x00,这可能会导致混乱。 (我注意到,如果您的正则表达式以 // 以空格开头,则您的正则表达式会在注释文本之前添加一个空格。)

MacOS X (10.6.8)

使用的“odx”命令是一个十六进制转储程序。

$ echo "// some english 一些中文注释" > x3.utf8
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20   // some english 
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9   ................
0x0020: 87 8A 0A                                          ...
0x0023:
$ utf8-unicode x3.utf8
0x2F = U+002F
0x2F = U+002F
0x20 = U+0020
0x73 = U+0073
0x6F = U+006F
0x6D = U+006D
0x65 = U+0065
0x20 = U+0020
0x65 = U+0065
0x6E = U+006E
0x67 = U+0067
0x6C = U+006C
0x69 = U+0069
0x73 = U+0073
0x68 = U+0068
0x20 = U+0020
0xE4 0xB8 0x80 = U+4E00
0xE4 0xBA 0x9B = U+4E9B
0xE4 0xB8 0xAD = U+4E2D
0xE6 0x96 0x87 = U+6587
0xE6 0xB3 0xA8 = U+6CE8
0xE9 0x87 0x8A = U+91CA
0x0A = U+000A
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/*  some english 一些中文注释 */
$

一切看起来干净整洁。

Linux (RHEL 5)

我将 x3.utf8 文件复制到 Linux 机器上,并将其转储。然后我在其上运行 sed 脚本,一切看起来都很好:

$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20   // some english 
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9   ................
0x0020: 87 8A 0A                                          ...
0x0023:
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8 | odx
0x0000: 2F 2A 20 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68   /*  some english
0x0010: 20 E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8    ...............
0x0020: E9 87 8A 20 2A 2F 0A                              ... */.
0x0027:
$

到目前为止,一切都很好。我也尝试过:

$ echo $LANG
en_US.UTF-8
$ echo $LC_CTYPE

$ env | grep LC_
$ bash --version
GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2005 Free Software Foundation, Inc.
$ cat x3.utf8
// some english 一些中文注释
$ echo $(<x3.utf8)
// some english 一些中文注释
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8      
/*  some english 一些中文注释 */
$

所以,终端名义上毕竟是在 UTF-8 下工作的,而且它看起来确实可以正常显示数据。

但是,如果我在终端回显该字符串,它就会变得混乱。当我将字符串剪切并粘贴到 Linux 终端时,它显示:

$ echo "// some english d8d^G:
> "
// some english d8d:

$

并发出蜂鸣声。

$ echo "// some english d8d^G:
> " | odx
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20   // some english 
0x0010: 64 38 64 07 3A 0A 0A                              d8d.:..
0x0017:
$

我不太确定该怎么做。我认为这意味着 bash 输入端的某些内容存在一些问题,但我不太确定。我也得到了稍微不一致的结果。我第一次尝试时,得到:

$ cat > xxx
's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;'
// some english d8^@d:^[d8-f^Gf3(i^G

$ odx xxx
0x0000: 27 73 3B 5E 5C 28 5B 5B 3A 62 6C 61 6E 6B 3A 5D   's;^\([[:blank:]
0x0010: 5D 2A 5C 29 2F 2F 5C 28 2E 2A 5C 29 3B 5C 31 2F   ]*\)//\(.*\);\1/
0x0020: 2A 20 5C 32 20 2A 2F 3B 27 0A 2F 2F 20 73 6F 6D   * \2 */;'.// som
0x0030: 65 20 65 6E 67 6C 69 73 68 20 64 38 00 64 3A 1B   e english d8.d:.
0x0040: 64 38 2D 66 07 66 33 28 69 07 0A 0A               d8-f.f3(i...
0x004C:
$

在该十六进制转储中,您可以看到 0x00 字节(偏移量 0x003C)。它出现在您获得结束注释的位置,并且那里的 null 可能会混淆 sed;但整个输入是如此混乱,很难知道如何理解它。

Which platform are you working on? Your sed script works fine on MacOS X without changing locale. The Linux terminal was less happy with the Chinese characters, but it is not setup to use UTF-8. Moreover, a hex dump of the string that it did get contained a zero byte 0x00 where the Chinese started, which might lead to the confusion. (I note that your regex adds a space before the comment text if it starts // with a space.)

MacOS X (10.6.8)

The 'odx' command use is a hex-dump program.

$ echo "// some english 一些中文注释" > x3.utf8
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20   // some english 
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9   ................
0x0020: 87 8A 0A                                          ...
0x0023:
$ utf8-unicode x3.utf8
0x2F = U+002F
0x2F = U+002F
0x20 = U+0020
0x73 = U+0073
0x6F = U+006F
0x6D = U+006D
0x65 = U+0065
0x20 = U+0020
0x65 = U+0065
0x6E = U+006E
0x67 = U+0067
0x6C = U+006C
0x69 = U+0069
0x73 = U+0073
0x68 = U+0068
0x20 = U+0020
0xE4 0xB8 0x80 = U+4E00
0xE4 0xBA 0x9B = U+4E9B
0xE4 0xB8 0xAD = U+4E2D
0xE6 0x96 0x87 = U+6587
0xE6 0xB3 0xA8 = U+6CE8
0xE9 0x87 0x8A = U+91CA
0x0A = U+000A
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/*  some english 一些中文注释 */
$

All of which looks clean and tidy.

Linux (RHEL 5)

I copied the x3.utf8 file to a Linux box, and dumped it. Then I ran the sed script on it, and all seemed OK:

$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20   // some english 
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9   ................
0x0020: 87 8A 0A                                          ...
0x0023:
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8 | odx
0x0000: 2F 2A 20 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68   /*  some english
0x0010: 20 E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8    ...............
0x0020: E9 87 8A 20 2A 2F 0A                              ... */.
0x0027:
$

So far, so good. I also tried:

$ echo $LANG
en_US.UTF-8
$ echo $LC_CTYPE

$ env | grep LC_
$ bash --version
GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2005 Free Software Foundation, Inc.
$ cat x3.utf8
// some english 一些中文注释
$ echo $(<x3.utf8)
// some english 一些中文注释
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8      
/*  some english 一些中文注释 */
$

So, the terminal is nominally working in UTF-8 after all, and it certainly seems display the data OK.

However, if I echo the string at the terminal, it gets into a tizzy. When I cut'n'pasted the string to the Linux terminal, it said:

$ echo "// some english d8d^G:
> "
// some english d8d:

$

and beeped.

$ echo "// some english d8d^G:
> " | odx
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20   // some english 
0x0010: 64 38 64 07 3A 0A 0A                              d8d.:..
0x0017:
$

I'm not quite sure what to make of that. I think it means that something in the input side of bash is having some problems, but I'm not quite sure. I also am getting slightly inconsistent results. The first time I tried it, I got:

$ cat > xxx
's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;'
// some english d8^@d:^[d8-f^Gf3(i^G

$ odx xxx
0x0000: 27 73 3B 5E 5C 28 5B 5B 3A 62 6C 61 6E 6B 3A 5D   's;^\([[:blank:]
0x0010: 5D 2A 5C 29 2F 2F 5C 28 2E 2A 5C 29 3B 5C 31 2F   ]*\)//\(.*\);\1/
0x0020: 2A 20 5C 32 20 2A 2F 3B 27 0A 2F 2F 20 73 6F 6D   * \2 */;'.// som
0x0030: 65 20 65 6E 67 6C 69 73 68 20 64 38 00 64 3A 1B   e english d8.d:.
0x0040: 64 38 2D 66 07 66 33 28 69 07 0A 0A               d8-f.f3(i...
0x004C:
$

And in that hex dump, you can see a 0x00 byte (offset 0x003C). That appears at the position where you got the end comment, and a null there could confuse sed; but the whole input is such a mess it is hard to know what to make of it.

甜心小果奶2024-11-30 20:25:36

好的,这是正确的答案...

当您在表达式中放入 . 时,GNU 正则表达式库 (regex) 并不匹配所有内容。是的,我知道这听起来有多脑残。

问题出在“字符”这个词上,现在有理智的人会说 sed 输入文件中的所有内容都是字符。即使在你的情况下,它们也是完全正确的。但正则表达式已被编程为要求输入是当前区域设置字符集 (UTF-8) 的完全正确格式的字符,如果它们是 Windows 字符集 (UTF-16) 的正确格式的字符,则它们不是“字符” 。

因此,由于 . 仅匹配“字符”,因此与您的字符不匹配。

如果您使用正则表达式 //.*$,即:将其固定到行尾,则它根本不匹配,因为 / 之间有一些不是“字符”的内容/ 和行尾。

不,您不能执行 //\(.\|[^.]\)*$ 之类的操作,如果不切换到 C 语言环境,就不可能匹配这些字符。

有时,这也会破坏 8 位透明度;即:即使不进行任何更改,通过 sed 传输的二进制文件也会被损坏。

幸运的是,C 语言环境仍然使用合理的解释,因此任何格式不完全正确的 ASCII-68 字符仍然是“字符”。

Okay, here's the correct answer...

The GNU regular expression library (regex) doesn't match everything when you put a . in your expression. Yup, I know how braindead that sounds.

The problem comes from the word "character", now reasonable people will say that everything that's in the input file for sed is characters. And even in your case they are perfectly correct. But regex has been programmed to required that the input be perfectly correctly formatted characters of the current locale character set (UTF-8) if they're correctly formatted characters for the Windows character set (UTF-16) they're not "characters".

So as . only matches "characters" it doesn't match your characters.

If you used the regex //.*$, ie: pinned it to the end of the line it wouldn't match at all because there's something that's not a "character" between the // and the end of the line.

And no you can't do anything like //\(.\|[^.]\)*$, it's just impossible to match those characters without switching to the C locale.

This will also, sometimes, destroy 8-bit transparency; ie: a binary piped through sed will get corrupted even if no changes are made.

Fortunately the C locale still uses the reasonable interpretation so anything that's not a perfectly correctly formatted ASCII-68 character is still a "character".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文