utf-8 编码是否会导致文件通配和 grep 混乱?

发布于 2024-09-24 13:10:45 字数 929 浏览 4 评论 0原文

我正在玩 bash,体验 utf-8 编码。我是 unicode 的新手。 以下命令(好吧,他们的输出)让我感到惊讶:

$ 语言环境
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL=
$ printf '1\né\n12\n123\n' | egrep '^(.|...)$'
1
é
12
$ 触摸 1 é 12 123
$ ls | egrep '^(.|...)$'
1
123

好的。两个egrep 过滤包含一或三个字符的行。它们的输入非常相似,但输出因字符 é 而不同。有什么解释吗?

有关我的环境的更多详细信息:

$ uname -a
Darwin macbook-pro-de-admin-6.local 10.4.0 Darwin 内核版本 10.4.0:2010 年 4 月 23 日星期五 18:28:53 PDT;根目录:xnu-1504.7.4~1/RELEASE_I386 i386
$egrep -V
egrep (GNU grep) 2.5.1

版权所有 1988、1992-1999、2000、2001 Free Software Foundation, Inc.
这是免费软件;请参阅复制条件的来源。没有
保修单;甚至不是为了适销性或特定用途的适用性。

I'm playing with bash, experiencing with utf-8 encoding. I'm new to unicode.
The following command (well, their output) surprises me :

$ locale
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL=
$ printf '1\né\n12\n123\n' | egrep '^(.|...)$'
1
é
12
$ touch 1 é 12 123
$ ls | egrep '^(.|...)$'
1
123

Ok. The two egrep filters lines with one or three characters. Their input is quite similar, but the output differs with the character é. Any explanation?

More details on my environment :

$ uname -a
Darwin macbook-pro-de-admin-6.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386
$ egrep -V
egrep (GNU grep) 2.5.1

Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

很快妥协 2024-10-01 13:10:45

当您使用单字符通配符时,任何可变长度编码都可能会扰乱不知道编码的工具,并且会考虑字节而不是字符(因为该工具假设字节=字符)。如果您使用文字字符,那么对于 UTF-8 来说,这并不重要,因为 UTF-8 的结构会阻止字符中间的匹配(假设编码正确)。

根据 http://mailman.uib.no/public/corpora/2006-December/003760.html,只要有适当的 LANG,GNU grep 2.5.1 及更高版本就包含在那里已设置。但是,如果您使用旧版本或 GNU grep 以外的其他版本,则可能是问题的原因,因为 é 是一个两字节字符 (0xC3 0xA9)。

编辑:根据您最近的评论,您的 grep 可能是 Unicode 感知的,但它不执行任何类型的 Unicode 规范化(老实说,我并不真正期望它如此)。

0x65 0xCC 0x81 是一个 e,后跟 组合急性重音符号 (U+0301 )。这实际上是两个字符,但由于组合字符的语义,它被呈现为一个字符。然后这会导致 grep 将其检测为两个字符;一种用于 e,另一种用于重音。

似乎分解的 Unicode 就是文件名实际存储在文件系统中的方式 - 否则,您可以存储出于所有意图和目的而具有完全相同名称的文件,但仅在组合字符的使用方面有所不同。

Any variable length encoding can mess with tools that is not aware of the encoding, and considers bytes, not characters, when you use single-character wildcards (because the tool assumes that byte=character). If you use literal characters, then for UTF-8, it doesn't matter since the structure of UTF-8 prevents matches in the middle of a character (assuming proper encoding).

At least some versions of grep are supposed to be UTF-8 aware, according to http://mailman.uib.no/public/corpora/2006-December/003760.html, GNU grep 2.5.1 and later is included there as long as an appropriate LANG is set. If you use an older version, however, or something other than GNU grep, that would likely be the cause of your problem, since é is a two-byte character (0xC3 0xA9).

EDIT: Based on your recent comment, your grep is probably Unicode-aware, but it does not perform any sort of Unicode normalization (and I wouldn't really expect it to, to be honest).

0x65 0xCC 0x81 is an e, followed by COMBINING ACUTE ACCENT (U+0301). This is effectively two characters, but it's rendered as one due to the semantics of combining characters. This then causes grep to detect it as two characters; one for the e and one for the accent.

It seems likely that decomposed Unicode is how the file name is actually stored in your file system - otherwise, you could store files that, for all intent and purposes, have the exact same name, but only differ in their use of combining characters.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文