更改本地设置以使 sed 正常工作，但为什么呢？

发布于 11-23 20:25 字数 2534 浏览 0 评论 0原文

以下是我编写的一个 bash 文件，用于将 C 文件中的所有 C++ 风格（//）注释转换为 C 风格（/**/）。

#!/bin/bash
lang=`echo $LANG`
# It's necessary to change the local setting. I don't know why.
export LANG=C
# Can comment the following statement if there is not dos2unix command.
dos2unix -q $1
sed -i -e 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' $1
export LANG=$lang

有用。但我发现了一个无法解释的问题。默认情况下，我的本地设置是 en_US.UTF-8。在我的C代码中，有一些用中文写的注释，例如

// some english 一些中文注释

如果我不更改本地设置，即不运行语句export LANG=C，我会得到

/* some english */一些中文注释

而不是

/* some english 一些中文注释*/

我不知道为什么。我只是通过尝试和错误找到解决方案。

阅读乔纳森·莱夫勒的回答后，我认为我犯了一些错误，导致了一些误解。在问题中，这些中文单词是在 Google Chrome 中输入的，并不是我的 C 文件中的实际单词。一些中文注释只是指一些中文注释。

现在我在Windows XP中的Visual C++ 6.0中输入//一些英文一些中文注释，并将c文件复制到Debian。然后我只需运行 sed -i -e 's;^([[:blank:]])//(.);\1/ \2 /;' $1 并得到

/* some english 一些 */中文注释

我认为不同的字符编码（GB18030，GBK，UTF-8？）导致不同的结果。

以下是我在 Debian 上得到的结果，

~/sandbox$ uname -a
Linux xyt-dev 2.6.30-1-686 #1 SMP Sat Aug 15 19:11:58 UTC 2009 i686 GNU/Linux
~/sandbox$ echo $LANG
en_US.UTF-8
~/sandbox$ cat tt.c | od -c -t x1
0000000   /   /       s   o   m   e       e   n   g   l   i   s   h    
         2f  2f  20  73  6f  6d  65  20  65  6e  67  6c  69  73  68  20
0000020 322 273 320 251 326 320 316 304 327 242 312 315
         d2  bb  d0  a9  d6  d0  ce  c4  d7  a2  ca  cd
0000034
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000   /   *           s   o   m   e       e   n   g   l   i   s   h
         2f  2a  20  20  73  6f  6d  65  20  65  6e  67  6c  69  73  68
0000020     322 273 320 251       *   / 326 320 316 304 327 242 312 315
         20  d2  bb  d0  a9  20  2a  2f  d6  d0  ce  c4  d7  a2  ca  cd
0000040
~/sandbox$

我认为这些汉字是用 2 字节（Unicode）编码的。

还有一个例子：

~/sandbox$ cat tt.c | od -c -t x1
0000000   /   /       I   n   W   i   n   d   o   w   :     250 250   ?
         2f  2f  20  49  6e  57  69  6e  64  6f  77  3a  20  a8  a8  3f
0000020   1   ?
         31  3f
0000022
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000   /   *           I   n   W   i   n   d   o   w   :           *
         2f  2a  20  20  49  6e  57  69  6e  64  6f  77  3a  20  20  2a
0000020   / 250 250   ?   1   ?
         2f  a8  a8  3f  31  3f

原文

The following is a bash file I wrote to convert all C++ style(//) comments in a C file to C style(/**/).

#!/bin/bash
lang=`echo $LANG`
# It's necessary to change the local setting. I don't know why.
export LANG=C
# Can comment the following statement if there is not dos2unix command.
dos2unix -q $1
sed -i -e 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' $1
export LANG=$lang

It works. But I found a problem I cannot explain. In default, my local setting is en_US.UTF-8. And in my C code, there are comments written in Chinese, such as

// some english 一些中文注释

If I don't change the local setting, i.e., do not run the statement export LANG=C, I'll get

/* some english */一些中文注释

instead of

/* some english 一些中文注释*/

I don't know why. I just find a solution by try and error.

After read Jonathan Leffler's answer, I think I've make some mistake leading to some misunderstand. In the question, those Chinese words were inputed in Google Chrome and were not the actual words in my C file. 一些中文注释 just means some Chinese comments.

Now I inputed // some english 一些中文注释 in Visual C++ 6.0 in Windows XP, and copied the c file to Debian. Then I just run sed -i -e 's;^([[:blank:]])//(.);\1/ \2 /;' $1 and got

/* some english 一些 */中文注释

I think it's different character coding(GB18030, GBK, UTF-8?) cause the different results.

The following is my results gotten on Debian

~/sandbox$ uname -a
Linux xyt-dev 2.6.30-1-686 #1 SMP Sat Aug 15 19:11:58 UTC 2009 i686 GNU/Linux
~/sandbox$ echo $LANG
en_US.UTF-8
~/sandbox$ cat tt.c | od -c -t x1
0000000   /   /       s   o   m   e       e   n   g   l   i   s   h    
         2f  2f  20  73  6f  6d  65  20  65  6e  67  6c  69  73  68  20
0000020 322 273 320 251 326 320 316 304 327 242 312 315
         d2  bb  d0  a9  d6  d0  ce  c4  d7  a2  ca  cd
0000034
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000   /   *           s   o   m   e       e   n   g   l   i   s   h
         2f  2a  20  20  73  6f  6d  65  20  65  6e  67  6c  69  73  68
0000020     322 273 320 251       *   / 326 320 316 304 327 242 312 315
         20  d2  bb  d0  a9  20  2a  2f  d6  d0  ce  c4  d7  a2  ca  cd
0000040
~/sandbox$

I think these Chinese Character encoding with 2 byte(Unicode).

There are another example:

~/sandbox$ cat tt.c | od -c -t x1
0000000   /   /       I   n   W   i   n   d   o   w   :     250 250   ?
         2f  2f  20  49  6e  57  69  6e  64  6f  77  3a  20  a8  a8  3f
0000020   1   ?
         31  3f
0000022
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000   /   *           I   n   W   i   n   d   o   w   :           *
         2f  2a  20  20  49  6e  57  69  6e  64  6f  77  3a  20  20  2a
0000020   / 250 250   ?   1   ?
         2f  a8  a8  3f  31  3f

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

扛刀软妹2024-11-30 20:25:36

您在哪个平台上工作？您的 sed 脚本在 MacOS X 上运行良好，无需更改区域设置。 Linux 终端对中文字符不太满意，但它未设置为使用 UTF-8。此外，它确实获得的字符串的十六进制转储包含中文开头的零字节 0x00，这可能会导致混乱。（我注意到，如果您的正则表达式以 // 以空格开头，则您的正则表达式会在注释文本之前添加一个空格。）

MacOS X (10.6.8)

使用的“odx”命令是一个十六进制转储程序。

$ echo "// some english 一些中文注释" > x3.utf8
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20   // some english 
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9   ................
0x0020: 87 8A 0A                                          ...
0x0023:
$ utf8-unicode x3.utf8
0x2F = U+002F
0x2F = U+002F
0x20 = U+0020
0x73 = U+0073
0x6F = U+006F
0x6D = U+006D
0x65 = U+0065
0x20 = U+0020
0x65 = U+0065
0x6E = U+006E
0x67 = U+0067
0x6C = U+006C
0x69 = U+0069
0x73 = U+0073
0x68 = U+0068
0x20 = U+0020
0xE4 0xB8 0x80 = U+4E00
0xE4 0xBA 0x9B = U+4E9B
0xE4 0xB8 0xAD = U+4E2D
0xE6 0x96 0x87 = U+6587
0xE6 0xB3 0xA8 = U+6CE8
0xE9 0x87 0x8A = U+91CA
0x0A = U+000A
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/*  some english 一些中文注释 */
$

一切看起来干净整洁。

Linux (RHEL 5)

我将 x3.utf8 文件复制到 Linux 机器上，并将其转储。然后我在其上运行 sed 脚本，一切看起来都很好：

$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20   // some english 
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9   ................
0x0020: 87 8A 0A                                          ...
0x0023:
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8 | odx
0x0000: 2F 2A 20 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68   /*  some english
0x0010: 20 E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8    ...............
0x0020: E9 87 8A 20 2A 2F 0A                              ... */.
0x0027:
$

到目前为止，一切都很好。我也尝试过：

$ echo $LANG
en_US.UTF-8
$ echo $LC_CTYPE

$ env | grep LC_
$ bash --version
GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2005 Free Software Foundation, Inc.
$ cat x3.utf8
// some english 一些中文注释
$ echo $(<x3.utf8)
// some english 一些中文注释
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8      
/*  some english 一些中文注释 */
$

所以，终端名义上毕竟是在 UTF-8 下工作的，而且它看起来确实可以正常显示数据。

但是，如果我在终端回显该字符串，它就会变得混乱。当我将字符串剪切并粘贴到 Linux 终端时，它显示：

$ echo "// some english d8d^G:
> "
// some english d8d:

$

并发出蜂鸣声。

$ echo "// some english d8d^G:
> " | odx
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20   // some english 
0x0010: 64 38 64 07 3A 0A 0A                              d8d.:..
0x0017:
$

我不太确定该怎么做。我认为这意味着 bash 输入端的某些内容存在一些问题，但我不太确定。我也得到了稍微不一致的结果。我第一次尝试时，得到：

$ cat > xxx
's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;'
// some english d8^@d:^[d8-f^Gf3(i^G

$ odx xxx
0x0000: 27 73 3B 5E 5C 28 5B 5B 3A 62 6C 61 6E 6B 3A 5D   's;^\([[:blank:]
0x0010: 5D 2A 5C 29 2F 2F 5C 28 2E 2A 5C 29 3B 5C 31 2F   ]*\)//\(.*\);\1/
0x0020: 2A 20 5C 32 20 2A 2F 3B 27 0A 2F 2F 20 73 6F 6D   * \2 */;'.// som
0x0030: 65 20 65 6E 67 6C 69 73 68 20 64 38 00 64 3A 1B   e english d8.d:.
0x0040: 64 38 2D 66 07 66 33 28 69 07 0A 0A               d8-f.f3(i...
0x004C:
$

在该十六进制转储中，您可以看到 0x00 字节（偏移量 0x003C）。它出现在您获得结束注释的位置，并且那里的 null 可能会混淆 sed；但整个输入是如此混乱，很难知道如何理解它。

Which platform are you working on? Your sed script works fine on MacOS X without changing locale. The Linux terminal was less happy with the Chinese characters, but it is not setup to use UTF-8. Moreover, a hex dump of the string that it did get contained a zero byte 0x00 where the Chinese started, which might lead to the confusion. (I note that your regex adds a space before the comment text if it starts // with a space.)

MacOS X (10.6.8)

The 'odx' command use is a hex-dump program.

$ echo "// some english 一些中文注释" > x3.utf8
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20   // some english 
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9   ................
0x0020: 87 8A 0A                                          ...
0x0023:
$ utf8-unicode x3.utf8
0x2F = U+002F
0x2F = U+002F
0x20 = U+0020
0x73 = U+0073
0x6F = U+006F
0x6D = U+006D
0x65 = U+0065
0x20 = U+0020
0x65 = U+0065
0x6E = U+006E
0x67 = U+0067
0x6C = U+006C
0x69 = U+0069
0x73 = U+0073
0x68 = U+0068
0x20 = U+0020
0xE4 0xB8 0x80 = U+4E00
0xE4 0xBA 0x9B = U+4E9B
0xE4 0xB8 0xAD = U+4E2D
0xE6 0x96 0x87 = U+6587
0xE6 0xB3 0xA8 = U+6CE8
0xE9 0x87 0x8A = U+91CA
0x0A = U+000A
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/*  some english 一些中文注释 */
$

All of which looks clean and tidy.

Linux (RHEL 5)

I copied the x3.utf8 file to a Linux box, and dumped it. Then I ran the sed script on it, and all seemed OK:

$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20   // some english 
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9   ................
0x0020: 87 8A 0A                                          ...
0x0023:
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8 | odx
0x0000: 2F 2A 20 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68   /*  some english
0x0010: 20 E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8    ...............
0x0020: E9 87 8A 20 2A 2F 0A                              ... */.
0x0027:
$

So far, so good. I also tried:

$ echo $LANG
en_US.UTF-8
$ echo $LC_CTYPE

$ env | grep LC_
$ bash --version
GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2005 Free Software Foundation, Inc.
$ cat x3.utf8
// some english 一些中文注释
$ echo $(<x3.utf8)
// some english 一些中文注释
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8      
/*  some english 一些中文注释 */
$

So, the terminal is nominally working in UTF-8 after all, and it certainly seems display the data OK.

However, if I echo the string at the terminal, it gets into a tizzy. When I cut'n'pasted the string to the Linux terminal, it said:

$ echo "// some english d8d^G:
> "
// some english d8d:

$

and beeped.

$ echo "// some english d8d^G:
> " | odx
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20   // some english 
0x0010: 64 38 64 07 3A 0A 0A                              d8d.:..
0x0017:
$

I'm not quite sure what to make of that. I think it means that something in the input side of bash is having some problems, but I'm not quite sure. I also am getting slightly inconsistent results. The first time I tried it, I got:

$ cat > xxx
's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;'
// some english d8^@d:^[d8-f^Gf3(i^G

$ odx xxx
0x0000: 27 73 3B 5E 5C 28 5B 5B 3A 62 6C 61 6E 6B 3A 5D   's;^\([[:blank:]
0x0010: 5D 2A 5C 29 2F 2F 5C 28 2E 2A 5C 29 3B 5C 31 2F   ]*\)//\(.*\);\1/
0x0020: 2A 20 5C 32 20 2A 2F 3B 27 0A 2F 2F 20 73 6F 6D   * \2 */;'.// som
0x0030: 65 20 65 6E 67 6C 69 73 68 20 64 38 00 64 3A 1B   e english d8.d:.
0x0040: 64 38 2D 66 07 66 33 28 69 07 0A 0A               d8-f.f3(i...
0x004C:
$

And in that hex dump, you can see a 0x00 byte (offset 0x003C). That appears at the position where you got the end comment, and a null there could confuse sed; but the whole input is such a mess it is hard to know what to make of it.

回复收藏 0 原文

甜心小果奶2024-11-30 20:25:36

好的，这是正确的答案...

当您在表达式中放入 . 时，GNU 正则表达式库 (regex) 并不匹配所有内容。是的，我知道这听起来有多脑残。

问题出在“字符”这个词上，现在有理智的人会说 sed 输入文件中的所有内容都是字符。即使在你的情况下，它们也是完全正确的。但正则表达式已被编程为要求输入是当前区域设置字符集 (UTF-8) 的完全正确格式的字符，如果它们是 Windows 字符集 (UTF-16) 的正确格式的字符，则它们不是“字符” 。

因此，由于 . 仅匹配“字符”，因此与您的字符不匹配。

如果您使用正则表达式 //.*$，即：将其固定到行尾，则它根本不匹配，因为 / 之间有一些不是“字符”的内容/ 和行尾。

不，您不能执行 //$.\|[^.]$*$ 之类的操作，如果不切换到 C 语言环境，就不可能匹配这些字符。

有时，这也会破坏 8 位透明度；即：即使不进行任何更改，通过 sed 传输的二进制文件也会被损坏。

幸运的是，C 语言环境仍然使用合理的解释，因此任何格式不完全正确的 ASCII-68 字符仍然是“字符”。