如何使 grep [AZ] 独立于语言环境?

发布于 2024-11-25 15:34:27 字数 884 浏览 6 评论 0原文

我每天都在做一些 grep 操作,突然发现一些看似微不足道的东西不起作用:

$ echo T | grep [A-Z]

没有匹配项。

为什么T不在AZ范围内?

我稍微改变了正则表达式:

$ echo T | grep [A-Y]

匹配!

哇!为什么 T 在 AY 内但不在 AZ 内?

显然这是因为我的环境设置为爱沙尼亚语言环境,其中 Y 位于字母表的末尾,而 Z 位于中间的某个位置: ABCDEFGHIJKLMNOPQRSŠZŽTUVWÕAUÜXY

$ echo $LANG
et_EE.UTF-8

这一切都让我感到有点震惊。 99% 的时间我都是 grep 计算机代码,而不是爱沙尼亚文献。我一直以来都以错误的方式使用 grep 吗?我过去曾因此犯过哪些错误?

在尝试了几件事之后,我得到了以下解决方案:

$ echo T | LANG=C grep [A-Z]

这是使 grep 区域设置独立的推荐方法吗?

此外...定义这样的别名是否安全:

$ alias grep="LANG=C grep"

PS . 我还想知道为什么像 [AZ] 这样的字符范围首先依赖于语言环境,而 \w 似乎不受语言环境的影响(尽管手册说 \w 是等效的[[:alnum:]] - 但我发现后者取决于语言环境,而 \w 则不然)。

I was doing some everyday grepping and suddenly discovered that something seemingly trivial does not work:

$ echo T | grep [A-Z]

No match.

How come T is not within A-Z range?

I changed the regex a tiny bit:

$ echo T | grep [A-Y]

A match!

Whoa! How is T within A-Y but not within A-Z?

Apparently this is because my environment is set to Estonian locale where Y is at the end of the alphabet but Z is somewhere in the middle: ABCDEFGHIJKLMNOPQRSŠZŽTUVWÕÄÖÜXY

$ echo $LANG
et_EE.UTF-8

This all came as a bit of a shock to me. 99% of the time I grep computer code, not Estonian literature. Have I been using grep the wrong way all the time? What all kind of mistakes have I made because of this in the past?

After trying several things I arrived at the following solution:

$ echo T | LANG=C grep [A-Z]

Is this the recommended way to make grep locale-independent?

Further more... would it be safe to define an alias like that:

$ alias grep="LANG=C grep"

PS. I'm also wondering of why are the character ranges like [A-Z] locale dependent in the first place while \w seems to be unaffected by locale (although the manual says \w is equivalent of [[:alnum:]] - but I found out the latter depends on locale while \w does not).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

記柔刀 2024-12-02 15:34:27

POSIX 正则表达式(Linux 和 FreeBSD grep 自然支持,以及其他一些根据请求支持)具有一系列尊重语言环境的 [:xxx:] 模式。有关详细信息,请参阅手册页。

   grep '[[:upper:]]' 

由于 [] 是模式名称的一部分,因此您还需要外部 [],无论它看起来多么奇怪。

随着这些 : 代码的出现,经典的 \w 等代码仍然严格保留在 C 语言环境中。因此,您对模式的选择决定了 grep 是否使用当前区域设置。

[AZ] 应遵循区域设置,但您可能需要设置 LC_ALL 而不是 LANG,特别是当系统将 LC_ALL 设置为不同的值时。

POSIX regular expressions, which Linux and FreeBSD grep support naturally, and some others support on request, have a series of [:xxx:] patterns that honor locales. See the man page for details.

   grep '[[:upper:]]' 

As the []s are part of the pattern name you need the outer [] as well, regardless of how strange it looks.

With the advent of these : codes the classic \w, etc., remain strictly in the C locale. Thus your choice of patterns determines if grep uses the current locale or not.

[A-Z] should follow locale, but you may need to set LC_ALL rather than LANG, especially if the system sets LC_ALL to a different value for your.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文