双锚定正则表达式

发布于 2024-09-11 18:09:56 字数 703 浏览 4 评论 0 原文

我想接受用户的任意正则表达式并将其锚定在两侧以强制完全匹配 (^$) 但是我不知道我是否有考虑到用户可能已经锚定了他的正则表达式。

看起来 Perl、C++、.NET 和 JavaScript 都允许 double 多重锚定。

"hello" =~ /^h/ # true
"hello" =~ /^^h/ # true
"hello" =~ /^^^h/ # true
"hello" =~ /e/ # true
"hello" =~ /^e/ # false
"hello" =~ /^^e/ # false

有谁知道这是否指定以这种方式工作?我可以依赖这种行为吗?或者这是一个将来可能会改变的意外事件吗?


编辑:我们需要这个的原因是我们使用VBScript的正则表达式(来自COM),我们使用match,但是这会返回所有匹配项,因此匹配速度要慢得多将字符串 abc 转换为 .*a.* 而不是 ^.*a.*$。通过使用 @Tim 建议的锚定,我们将匹配速度(对于长字符串)提高了超过一个因子12.

I want to accept an arbitrary regular expression from the user and anchor it on both sides in order to enforce a full match (^<user's-regex>$) however I don't know if I have to take into account the fact that the user may have already anchored his regex.

It looks like Perl, C++, .NET and JavaScript all allow double multiple anchoring.

"hello" =~ /^h/ # true
"hello" =~ /^^h/ # true
"hello" =~ /^^^h/ # true
"hello" =~ /e/ # true
"hello" =~ /^e/ # false
"hello" =~ /^^e/ # false

Does anyone know if this is specified to work this way? Can I depend on this behaviour or is it an accident that is liable to change in the future?


Edit: The reason we need this is that we're using VBScript's regex's (from COM), we're using match however this returns all matches so it's much slower to match the string abc to .*a.* than to ^.*a.*$. By using the anchoring as suggested by @Tim we speed matches up (for long strings) by more than a factor of 12.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

深海夜未眠 2024-09-18 18:09:56

您可以依赖这种行为。正则表达式引擎不介意连续断言相同的事情一次、两次或一百次。

但是,您不应简单地在正则表达式周围添加锚点,还应该在其周围添加一个非捕获组:

^(?: - user regex - )$ 或者最好是,如果您的正则表达式风格允许这样做: \A(?: - user regex - )\Z

否则,如果用户在其正则表达式中使用交替,您就会出错。比较:

user regex:         hello|bye
anchored regex:     ^hello|bye$      // alternation now affects anchors
correctly anchored: ^(?:hello|bye)$

You can depend on this behavior. The regex engine doesn't mind asserting the same thing once, twice, or a hundred times in a row.

However, instead of simply adding anchors around the regex, you should also add a non-capturing group around it:

^(?: - user regex - )$ or preferably, if your regex flavor allows this: \A(?: - user regex - )\Z

Otherwise, you'll trip up if the user uses alternation in his regex. Compare:

user regex:         hello|bye
anchored regex:     ^hello|bye$      // alternation now affects anchors
correctly anchored: ^(?:hello|bye)$
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文