如何检查正则表达式的相关性?

发布于 2024-07-27 04:17:31 字数 578 浏览 3 评论 0原文

假设我们有两个正则表达式:

1234.*

.*

输入:

1234567

显然它们都匹配,但 1234.* 匹配得更好,因为它更具体。 ie 更相关。 是否有标准方法来检查哪个更相关?

编辑:

一些澄清。 我想通过检查哪个正则表达式与输入最匹配来做出决定。 在这种情况下,我只匹配数字。

电话号码示例:

输入:

31882481337

我们对以下每个正则表达式都有一个规则:

31.*
.*

在这种情况下,我希望使用绑定到 31.* 的规则,因为这对于给定的输入来说更具体。 如果我没有使用正则表达式,那就很容易,因为我可以使用评分机制来检查它的匹配程度,但是这些规则可能有一些更高级的正则表达式,例如:

31[89].*

Let's say we have two regular expressions:

1234.*

and

.*

Input:

1234567

Obviously they both match, but 1234.* matches better since it is more specific. i.e. is more relevant. Is there a standard way for checking which is more relevant?

edit:

Some clarification. I want to make decisions by checking which regexp matches the input best. In this case I am only matching numbers.

Example with telephone numbers:

Input:

31882481337

We have a rule for each of the following regexps:

31.*
.*

In this scenario I would like the rule to be used that is bound to 31.* because that is more specific for the input given. If I was not using regexps it would be easy, because I could use a scoring mechanism to check how much it matches, however these rules may have some more advanced regexps, like:

31[89].*

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

我三岁 2024-08-03 04:17:31

我认为没有简单的方法可以做到这一点。 如果您查看更复杂的示例,您很快就会意识到很难准确定义“更相关”。 所有诸如断言和反向引用之类的东西都会发挥作用。

我可以想出两种方法来粗略地估计“相关性”。

  1. 随机修改输入并比较有多少修改导致每个表达式失败。

  2. 分析表达式本身。 计算并比较终结符号与通配符的数量、断言的数量以及您喜欢的任何内容。

特别是在第二个解决方案中,您必须意识到实际匹配未使用的许多替代方案可能会使结果变得无关紧要。

h.*|verylongtext|anotherverylongtext

hell.*|v.*

当匹配“hello”时,第二个表达式“更相关”,但第一个表达式包含更多终端符号,并且第二个解决方案可能会获得更好的排名。 但对于匹配“verylongtext”,第一个是“更相关”。 这表明“相关性”在很大程度上取决于实际输入,您必须分析实际匹配路径 - 这是第一个解决方案隐式完成的事情。 但是随机修改输入是一项非常艰巨的任务,因为可能的输入空间非常大。 我认为这也不会起到很好的作用。

I think there is no easy way to do this. If you look at complexer examples, you will soon realize that it is quite hard to exactly define "more relevant" at all. All the things like assertions and backreference come into play.

I can think of two ways to roughly estimate the "relevance".

  1. Randomly modify the input and compare how many modifaction cause each expression to fail.

  2. Analyze the expressions itselve. Count and comapre the number of terminal symbols vs wildcards, the number of assertion and whatever you like.

Esspecially in the second solution you have to be aware that many alternatives, that are not used by the actual matching, might render the result irrelevant.

h.*|verylongtext|anotherverylongtext

hell.*|v.*

When matching "hello", the second expression is "more relevant", but the first contains much more terminal symbols and might get a much better ranking by the second solution. But for matching "verylongtext" the first is "more relevant". This shows that the "relevance" heavily depends on the actual input and you would have to analyze the actual matching path - something that is implicitly done by the first solution. But randomly modifying the input is quite a hard task because the space of possible inputs is quite large. I think this will not work very well, too.

小巷里的女流氓 2024-08-03 04:17:31

我能想到的一个因素是一种语言是否是无限的。 非无限肯定比无限更相关,因为语言中可接受的单词数量有限。

如果像你的例子一样测量无限的语言,两者都会永远持续下去,你可以继续计算语言中的每个单词,直到你脸色发青,你永远不会得出结论。

直到您认为第一个正则表达式的语言是第二个正则表达式的语言的真子集。 那么你可能会说其中一个更相关。

我不确定如何衡量正则表达式相关性的任何标准。

为了阐述真子集的概念,您可能会问您的语言是什么,您的正则表达式是否接受该语言之外的单词? 你的表达可能仍然有效,但它的单词范围比你预期的要广泛……当然,如果你的输入受到控制,这可能并不重要,但这是你可以衡量相关性的一种方法。 它是否完全接受我的语言?

你的就是一个很好的例子,也许你想接受以 1234 开头的数字。1234.* 就像一个魅力......但这不是你指定的语言。 `1234\d* 更具体,并且与您指定的语言完全匹配......因此更相关。

但这完全是从纯粹的理论角度出发,可能不会帮助您以编程方式确定一个正则表达式是否比另一个更好。

one factor i can think of is whether a language is infinite or not infinite. not infinite is definately more relevant than infinite as there are a finite number of acceptable words in the language.

if measuring infinite languages like your examples, both just go on forever and you can keep on counting each word in the language until you're blue in the face, you'll never reach a conclusion.

until you consider that the first regex's language is a proper subset of the second's language. Then you might say one is more relevent.

I'm not sure of any standard though of how to measure regex relevancy.

to expound on the idea of proper subsets, you may ask what is your language and does your regex accept words outside of that? your expression might still work, but it has a wider range of words than you intended... of course this may not matter if your input is controlled, but that's one way you could measure relevance. is it accepting my language exactly?

yours is a good example, perhaps you want to accept numbers starting with 1234. 1234.* works like a charm... but that isn't the language you specified. `1234\d* is more specific and matches your language exactly as you specified... thus being more relevant.

but this is all from a purely theoretical standpoint and probably won't help you much to programatically determine if one regex is better than another.

那一片橙海, 2024-08-03 04:17:31

自从我问这个问题以来已经很长时间了,但我想让你知道我最终想到了什么。 我采用了一种更简单的方法,我只是在正则表达式中添加了一个权重因子。 所以你可以说我自己定义了正则表达式的相关性,而不是尝试使用正则表达式来定义它:

Expression      Relevance
31.*              1
.*              0

Has been a long time since I asked this question, but I wanted to let you know what I came up with in the end. I went for a far simpler approach, I just added a weight factor to my regular expressions. So you could say I defined the relevance of the regular expression myself instead of trying to define it using regular expressions:

Expression      Relevance
31.*              1
.*              0
秋风の叶未落 2024-08-03 04:17:31

我不知道“相关性”是否是真正的问题。 正如您所建议的,每个都是相关的,并且每个都将匹配“1234567”。 然而,正如您所说,一个(“1234.*”)更具体。 对于正则表达式,特异性非常好(在像这样的简单情况下),有时您可以磨练它直到您意识到您根本不需要一个(正则表达式)。 正则表达式的规则#1:如果不需要,就不要使用它们。 例如,要匹配“1234567”,我会选择:

$source = '1234567';
if ( stripos( $source, '1234' ) === 0 ) {
  $foo = substr( $source, 4 );
  // $source began with '1234' and $foo holds the rest
} else {
  // it didn't begin with '1234'
}

这是一个 PHP 示例,但其想法是,由于您已经如此严格地磨练了可接受的值,因此您甚至不再需要 PCRE。 “相关性”并不会真正告诉您有关正则表达式的更多信息(在这种情况下您如何定义“相关性”?),但是我认为特异性是一个更客观的衡量标准,并且能够使用非正则表达式字符串函数肯定会非常具体(事实上,它是布尔值 - 是否有正则表达式?)。

除了能够减少等式之外的正则表达式之外:要测量给定正则表达式的特异性,只需比较(如果需要,试探性地)有多少个不同的值可以满足该表达式。 在这个测试中得分最低的表达将被证明是最具体的。

I don't know whether "relevancy" is the real issue. Each is relevant, and each will match "1234567," as you suggest. As you also say, however, one ("1234.*") is more specific. With regular expressions, specificity is great (in a simple case like this), and sometimes you can hone in on it so far that you realize you didn't need one (a regex) after all. Rule #1 of regular expressions: Don't use them if you don't have to. For example, to match "1234567", I'd go with:

$source = '1234567';
if ( stripos( $source, '1234' ) === 0 ) {
  $foo = substr( $source, 4 );
  // $source began with '1234' and $foo holds the rest
} else {
  // it didn't begin with '1234'
}

That's a PHP example, but the idea is that, since you've honed your accepted value in so tightly, you don't even need PCRE anymore. "Relevancy" won't really tell you much about a regular expression (how would you define "relevancy" in this context?), however I think specificity a more objective measurement, and being able to use non-regex string functions instead would sure as heck be very measurably specific (in fact, it's boolean - are there regular expression or not?).

Outside of being able to reduce the regex out of the equation: To measure the specificity of a given regular expression, simply compare (heuristically, if necessary) how many different values would satisfy the expression. The expression with the least score in this test would prove the most specific.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文