有什么方法可以在 perl RE 中将 .* 视为 .{0,1024} 吗?

发布于 2024-12-21 06:01:47 字数 329 浏览 0 评论 0原文

我们允许一些用户提供的 RE 来过滤电子邮件。早期,当匹配任意大的电子邮件时,我们遇到了一些包含 .* 等 RE 的性能问题。我们发现一个简单的解决方案是在用户提供的 RE 上使用 s/\*/{0,1024}/。然而,这不是一个完美的解决方案,因为它会打破以下模式:

/[*]/

并且我不想想出一些复杂的方法来解释用户提供的 RE 输入的每个可能的突变,我只想限制 perl 对*+ 字符的最大长度为 1024 个字符。

有什么办法可以做到这一点吗?

We allow some user-supplied REs for the purpose of filtering email. Early on we ran into some performance issues with REs that contained, for example, .*, when matching against arbitrarily-large emails. We found a simple solution was to s/\*/{0,1024}/ on the user-supplied RE. However, this is not a perfect solution, as it will break with the following pattern:

/[*]/

And rather than coming up with some convoluted recipe to account for every possible mutation of user-supplied RE input, I'd like to just limit perl's interpretation of the * and + characters to have a maximum length of 1024 characters.

Is there any way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

音盲 2024-12-28 06:01:47

这并不能真正回答您的问题,但您应该注意用户提供的正则表达式的其他问题,例如请参阅 OWASP 的摘要。根据您的具体情况,编写或查找自定义的简单模式匹配库可能会更好?

This does not really answer your question, but you should be aware of other issues with user-supplied regular expressions, see for example this summary at OWASP. Depending on your exact situation, it might be better to write or find a custom simple pattern matching library?

小瓶盖 2024-12-28 06:01:47

更新

在量词之前添加了 (?,因为不应匹配转义的 *+。如果存在\\*(匹配\ 0次或多次),替换仍然会失败。

改进是这样的

s/(?<!\\)\*(?!(?<!\\)[^[]*?(?<!\\)\])/{0,1024}/
s/(?<!\\)\+(?!(?<!\\)[^[]*?(?<!\\)\])/{1,1024}/

See it here on Regexr

这意味着匹配 [*+] 但是仅当前面没有结束 ] 且在此之前没有 [ 时。并且方括号之前不允许有 \ (? 部分)

(?! ... ) 是负向前瞻

(? 是负向后向

查看 perlretut 了解详细信息

更新 2 包括所有格量词

s/(?<!(?<!\\)[\\+*?])\+(?!(?<!\\)[^[]*?(?<!\\)\])/{1,1024}/   # for +
s/(?<!\\)\*(?!(?<!\\)[^[]*?(?<!\\)\])/{0,1024}/    # for *

查看它 Regexr 上

似乎有效,但现在变得非常复杂!

Update

Added a (?<!\\) before the quantifiers, because escaped *+ should not be matched. Replacement will still fail if there is an \\* (match \ 0 or more times).

An improvement would be this

s/(?<!\\)\*(?!(?<!\\)[^[]*?(?<!\\)\])/{0,1024}/
s/(?<!\\)\+(?!(?<!\\)[^[]*?(?<!\\)\])/{1,1024}/

See it here on Regexr

That means match [*+] but only if there is no closing ] ahead and no [ till then. And there is no \ (the (?<!\\) part) allowed before the square brackets.

(?! ... ) is a negative lookahead

(?<! ... ) is a negative lookbehind

See perlretut for details

Update 2 include possessive quantifiers

s/(?<!(?<!\\)[\\+*?])\+(?!(?<!\\)[^[]*?(?<!\\)\])/{1,1024}/   # for +
s/(?<!\\)\*(?!(?<!\\)[^[]*?(?<!\\)\])/{0,1024}/    # for *

See it here on Regexr

Seems to be working, but its getting real complicated now!

风和你 2024-12-28 06:01:47

使用 Regexp::Parser 获取树并根据需要修改正则表达式,或提供Regexp::English 的 GUI 界面

Get a tree using Regexp::Parser and modify regex as you want, or provide GUI interface to Regexp::English

迷爱 2024-12-28 06:01:47

你的意思是除了修补源?

  1. 您可以将输入文本分成较短的块并仅匹配这些块。但话又说回来,你不会在“换行”中断处进行匹配。
  2. 您可以破坏正则表达式,仅搜索它的第一个字符,加载接下来的 1024 个字符的文本,然后匹配整个正则表达式(显然,这不适用于以 开头的正则表达式。)
  3. 找到不是 .*+()\ 的正则表达式,发现加载前后 1024 个字符,然后匹配该字符串上的整个正则表达式。 (复杂并修剪奇怪的不可预见的正则表达式中的错误)

You mean except of patching the source?

  1. You can break the input texts in shorter chunks and match only those. But then again, you wouldn't match over a "line" break.
  2. You can break the regex, search only for the 1st char of it, load the next 1024 chars of text and then match the whole regex on this (obviously, that doesn't work with regex starting with .)
  3. Find the first char of the regex that is not .*+()\, find that, load 1024 chars before and after and then match the whole regex on this string. (complicated and prune to errors in strange unforeseen regex)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文