当前位置：文江博客话题详情

LC_ALL=C 对加速 grep 的影响

发布于 2024-12-15 19:14:02 字数 130 浏览 0 评论 0原文

我刚刚发现，如果我在 grep 命令前加上 LC_ALL=C 前缀，它会对加速 grep 产生奇迹。

但我想知道其中的含义。

使用 UTF-8 的模式会不匹配吗？如果 grep 文件使用 UTF-8 会发生什么？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

聊慰 2024-12-22 19:14:02

您不一定需要 UTF-8 才能在这里遇到麻烦。区域设置负责设置字符类，即确定哪个字符是空格、字母或数字。考虑这两个例子：

$ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep '[[:alnum:]]' || echo false
ä
$ echo -e '\xe4' | LC_ALL=C grep '[[:alnum:]]' || echo false
false

当尝试相互匹配精确的二进制模式时，区域设置并没有什么区别，但是：

$ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep "$(echo -e '\xe4')" || echo false
ä
$ echo -e '\xe4' | LC_ALL=C grep "$(echo -e '\xe4')" || echo false
ä

我不确定 grep 实现 unicode 的程度，以及不同代码点彼此匹配的程度，但匹配 ASCII 的任何子集以及匹配单个字符而无需替代二进制表示形式应该可以正常工作，无论区域设置如何。

You don't necessarily need UTF-8 to run into trouble here. The locale is responsible for setting the character classes, i.e. determining which character is a space, a letter or a digit. Consider these two examples:

$ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep '[[:alnum:]]' || echo false
ä
$ echo -e '\xe4' | LC_ALL=C grep '[[:alnum:]]' || echo false
false

When trying to match exact binary patterns against each other, the locale doesn't make a difference, however:

$ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep "$(echo -e '\xe4')" || echo false
ä
$ echo -e '\xe4' | LC_ALL=C grep "$(echo -e '\xe4')" || echo false
ä

I'm not sure about the extent of grep implementing unicode, and how well different codepoints are matched to each other, but matching any subset of ASCII and the matching of single characters without alternate binary representations should work fine regardless of locale.

回复收藏 0 原文

~没有更多了~