'git grep' Mac OS X 和 BSD 上的字边界

发布于 2024-12-07 05:50:47 字数 225 浏览 14 评论 0原文

我在 Linux 开发服务器上定期运行 git grep "\",但我刚刚发现我无法使用 \< 和 < code>\> 在 Mac (Mac OS X 10.6.8) 上(无法使用 = 没有找到任何内容)。 Mac 中的正则表达式语法是否不同?

我尝试使用 git grep -E "\" 但无济于事! :-(

I run git grep "\<blah\>" regularly on my linux development server, but I just discovered that I am not able to use \< and \> on Mac (Mac OS X 10.6.8) (not able to use = it does not find anything). Is the regular expressions syntax different in Mac?

I tried using git grep -E "\<blah\>" but to no avail! :-(

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

临走之时 2024-12-14 05:50:48

在苦苦挣扎之后,我也发现了 这篇非常有用的帖子 在 BSD 邮件列表上。所以这是(尽管相当难看)的解决方案:

git grep "[[:<:]]blah[[:>:]]"

git-grep 的 -w 标志也可以工作,但有时您只想匹配单词的开头或结尾。

更新:这在 OS X 10.9“Mavericks”中已更改。现在您可以使用 \<\>\b。不再支持 [[:<:]][[:>:]]

After struggling with this, too, I found this very helpful post on a BSD mailing list. So here's the (albeit rather ugly) solution:

git grep "[[:<:]]blah[[:>:]]"

The -w flag of git-grep also works but sometimes you want to only match the beginning or end of a word.

Update: This has changed in OS X 10.9 "Mavericks". Now you can use \<, \>, and \b. [[:<:]] and [[:>:]] are no longer supported.

红尘作伴 2024-12-14 05:50:48

我猜这是由 BSD 与 Linux grep 库引起的。

看看 git grep 的 -w (仅在单词边界匹配模式)选项是否适合您:

$ git grep -w blah

I guess it's caused by the BSD vs Linux grep library.

See if the -w (match pattern only at word boundary) option to git grep does it for you:

$ git grep -w blah
小帐篷 2024-12-14 05:50:48

您可以使用 PCRE 支持来编译 git,并使用 git grep -P "\bblah\b" 作为字边界。

以下是有关如何使用 OSX Homebrew 编译 git 的指南:
http:// /realultimateprogramming.blogspot.com/2012/01/how-to-enable-git-grep-p-on-os-x-using.html

You can compile git with PCRE support and use git grep -P "\bblah\b" for word boundaries.

Here's a guide on how to compile git using OSX Homebrew:
http://realultimateprogramming.blogspot.com/2012/01/how-to-enable-git-grep-p-on-os-x-using.html

吃→可爱长大的 2024-12-14 05:50:48

如果您确实使用 -P,请确保使用 Git 2.40 (Q1 2023):“grep -P” 学会了在处理 < 时使用 Unicode 字符属性来理解字符类。 code>\b 和 \w 等。

参见 提交 acabd20(2023 年 1 月 8 日),作者:卡洛·马塞洛·阿里纳斯·贝隆(carenas
(由 Junio C Hamano -- gitster -- 合并于 提交 557d93a,2023 年 1 月 27 日)

grep:正确识别utf-8字符与 -P

中的 \{b,w}

签字人:卡洛·马塞洛·阿里纳斯·贝隆
确认者:Ævar Arnfjörð Bjarmason

为 PCRE 匹配启用 UTF 时,相应的标志将添加到 pcre2_compile() 调用中,但不包括 PCRE2_UCP

这可以防止扩展字符类的含义以包含这些新的有效字符,从而导致依赖该扩展的表达式匹配失败,例如:

$ git grep -P '\bÆvar'

添加 PCRE2_UCP,以便 \w 将包含 Æ,因此 \b 可以正确匹配那个词。

这对性能的影响估计在 20% 到 40% 之间,并且通过添加的性能测试显示了这一点。

这意味着这些模式适用于任何字符:

'\bhow' 
'\bÆvar'
'\d+ \bÆvar'
'\bBelón\b'
'\w{12}\b'

在 Git 2.41(2023 年第 2 季度)中,最近的一项允许 Unicode 字符类与“grep -P”一起使用的更改触发了 JIT bug较旧的 pcre2 库。
使用这些旧库构建的 Git 中存在问题的更改已被禁用,以解决该错误。

请参阅 提交 14b9a04(2023 年 3 月 23 日),作者:Mathias Krause (mathiaskrause)
(由 Junio C Hamano -- gitster -- 合并于 提交 d35cd54,2023 年 3 月 30 日)

grep:解决与UTF-8相关的问题PCRE2 中的 JIT 错误 <= 10.34

报告人:Stephane Odul
签字人:Mathias Krause

Stephane 正在报告 Git v2.40.0 中引入的回归导致 'git grep'(man) 他的 CI 管道中出现段错误。
事实证明,他使用的是旧版本的 libpcre2,该版本会在生成的 JIT 代码中触发野指针取消引用,该代码已在 PCRE2 10.35 中修复。

不要完全禁用有缺陷版本的 JIT 编译器,只需屏蔽 Unicode 属性处理,就像我们在 提交 acabd20 (grep:使用 {b, 2023-01-08, Git v2.40.0-rc0 -- 合并列于批次#11) ("grep:正确识别 utf-8 字符\{b,w} in -P")。


Git 2.48(2025 年第 1 季度),第 7 批,修复了 'git grep'(man):通过在遇到无效的 UTF-8 字节序列时禁用前瞻来修复 macOS 上的回归。

请参阅 commit ce025ae(2024 年 10 月 20 日),作者:René Scharfe (rscharfe)
(由 Taylor Blau -- ttaylorr -- 合并于 提交 43ac239,2024 年 11 月 1 日)

grep:禁用错误前瞻

报告人:David Gstir
签字人:René Scharfe
测试者:David Gstir
签字人:Taylor Blau

regexec(3) 可能会失败。
例如
在 macOS 上,如果与 UTF-8 区域设置一起使用来将有效的正则表达式与包含无效 UTF-8 字符的缓冲区进行匹配,则会失败。

git grep(man) 有两种方法在文件中搜索匹配项:

  • 要么将其内容分成几行并分别匹配它们,
  • 或者它匹配整个内容并稍后计算出行边界。
    后者由 look_ahead() 完成,在大多数文件不包含匹配项的常见情况下速度更快。

如果 look_ahead() 遇到 regexec(3) 错误,则通过从 patmatch() 传播错误来回退到逐行匹配 code> 并退出 look_ahead()(如果有的话)。
这样我们至少可以在仅包含有效字符的行中找到匹配项。
这与 macOS 上 grep(1) 的行为相匹配。

如果 pcre2_jit_match()pcre2_match() 失败,

pcre2match() 就会死亡,但由于我们使用标志 PCRE2_MATCH_INVALID_UTF 它可以优雅地处理无效的 UTF-8 字符。
因此,仅对 regexec(3) 实施回退,并保持 PCRE2 匹配不变。


If you do use -P, make sure to use Git 2.40 (Q1 2023): "grep -P" learned to use Unicode Character Property to grok character classes when processing \b and \w etc.

See commit acabd20 (08 Jan 2023) by Carlo Marcelo Arenas Belón (carenas).
(Merged by Junio C Hamano -- gitster -- in commit 557d93a, 27 Jan 2023)

grep: correctly identify utf-8 characters with \{b,w} in -P

Signed-off-by: Carlo Marcelo Arenas Belón
Acked-by: Ævar Arnfjörð Bjarmason

When UTF is enabled for a PCRE match, the corresponding flags are added to the pcre2_compile() call, but PCRE2_UCP wasn't included.

This prevents extending the meaning of the character classes to include those new valid characters and therefore result in failed matches for expressions that rely on that extention, for ex:

$ git grep -P '\bÆvar'

Add PCRE2_UCP so that \w will include Æ and therefore \b could correctly match the beginning of that word.

This has an impact on performance that has been estimated to be between 20% to 40% and that is shown through the added performance test.

That means those patterns will work, with any character:

'\bhow' 
'\bÆvar'
'\d+ \bÆvar'
'\bBelón\b'
'\w{12}\b'

With Git 2.41 (Q2 2023), a recent-ish change to allow Unicode character classes to be used with "grep -P" triggered a JIT bug in older pcre2 libraries.
The problematic change in Git built with these older libraries has been disabled to work around the bug.

See commit 14b9a04 (23 Mar 2023) by Mathias Krause (mathiaskrause).
(Merged by Junio C Hamano -- gitster -- in commit d35cd54, 30 Mar 2023)

grep: work around UTF-8 related JIT bug in PCRE2 <= 10.34

Reported-by: Stephane Odul
Signed-off-by: Mathias Krause

Stephane is reporting a regression introduced in Git v2.40.0 that leads to 'git grep'(man) segfaulting in his CI pipeline.
It turns out, he's using an older version of libpcre2 that triggers a wild pointer dereference in the generated JIT code that was fixed in PCRE2 10.35.

Instead of completely disabling the JIT compiler for the buggy version, just mask out the Unicode property handling as we used to do prior to commit acabd20 (grep: correctly identify utf-8 characters with {b, 2023-01-08, Git v2.40.0-rc0 -- merge listed in batch #11) ("grep: correctly identify utf-8 characters with \{b,w} in -P").


Git 2.48 (Q1 2025), batch 7, fixes another issue with 'git grep'(man): a regression on macOS fixed by disabling lookahead when encountering invalid UTF-8 byte sequences.

See commit ce025ae (20 Oct 2024) by René Scharfe (rscharfe).
(Merged by Taylor Blau -- ttaylorr -- in commit 43ac239, 01 Nov 2024)

grep: disable lookahead on error

Reported-by: David Gstir
Signed-off-by: René Scharfe
Tested-by: David Gstir
Signed-off-by: Taylor Blau

regexec(3) can fail.
E.g.
on macOS it fails if it is used with an UTF-8 locale to match a valid regex against a buffer containing invalid UTF-8 characters.

git grep(man) has two ways to search for matches in a file:

  • Either it splits its contents into lines and matches them separately,
  • or it matches the whole content and figures out line boundaries later.
    The latter is done by look_ahead() and it's quicker in the common case where most files don't contain a match.

Fall back to line-by-line matching if look_ahead() encounters an regexec(3) error by propagating errors out of patmatch() and bailing out of look_ahead() if there is one.
This way we at least can find matches in lines that contain only valid characters.
That matches the behavior of grep(1) on macOS.

pcre2match() dies if pcre2_jit_match() or pcre2_match() fail, but since we use the flag PCRE2_MATCH_INVALID_UTF it handles invalid UTF-8 characters gracefully.
So implement the fall-back only for regexec(3) and leave the PCRE2 matching unchanged.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文