正则表达式之间的距离

发布于 2024-08-19 01:43:43 字数 56 浏览 9 评论 0原文

我们可以计算正则表达式之间的距离吗?

这个想法是测量两个正则表达式在哪些方面相似。

Can we compute a sort of distance between regular expressions ?

The idea is to mesure in which way two regular expression are similar.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

绮烟 2024-08-26 01:43:43

您可以为这两个正则表达式构建确定性有限状态机并比较转换。然后可以使用两个转换的差异来测量这些正则表达式的距离。

You can build deterministic finite-state machines for both regular expressions and compare the transitions. The difference of both transitions can then be used to measure the distance of these regular expressions.

眸中客 2024-08-26 01:43:43

您可以使用一些指标:

  1. 有效匹配的长度。有些正则表达式有固定大小,有些有上限,有些有下限。比较它们的长度或可能长度的相似程度。

  2. 匹配的字符。任何正则表达式都会有一组匹配可以包含的字符(也许是所有字符)。比较包含的字符集。

  3. 使用一个大文档,看看每个正则表达式有多少个匹配,其中有多少是相同的。

    使用

您是否正在寻找严格等效?

There are a few of metrics you could use:

  1. The length of a valid match. Some regexs have a fixed size, some an upper limit and some a lower limit. Compare how similar their lengths or possible lengths are.

  2. The characters that match. Any regex will have a set of characters a match can contain (maybe all characters). Compare the set of included characters.

  3. Use a large document and see how many matches each regex makes and how many of those are identical.

Are you looking for strict equivalence?

缱绻入梦 2024-08-26 01:43:43

我想您可以计算实际正则表达式字符串之间的 Levenshtein 距离。这当然是测量两个不同正则表达式字符串之间“距离”的一种方法。

当然,我认为这里可能根本不需要正则表达式,并且计算正则表达式应用到的实际“值”字符串的编辑距离可能会产生更好的结果。

I suppose you could compute a Levenshtein Distance between the actual Regular Experssion strings. That's certainly one way of measuring a "distance" between two different Regular Expression strings.

Of course, I think it's possible that regular expressions are not required here at all, and computing the Levenshtein Distance of the actual "value" strings that the Regular Expressions would otherwise be applied to, may yield a better result.

初见你 2024-08-26 01:43:43

如果您有两个正则表达式并且有一组示例输入,您可以尝试将每个输入与每个正则表达式进行匹配。对于每个输入:

  • 如果它们都匹配或都不匹配,则得分 0。
  • 如果一个匹配而另一个不匹配,则得分 1。

将所有输入的得分相加,这将为您提供常规输入之间的“距离”表达式。这将使您了解两个正则表达式对于典型输入的不同频率。如果你的样本输入集很大,计算会非常慢。如果两个正则表达式无法匹配几乎所有随机字符串并且您的预期输入完全是随机的,那么它根本不起作用。例如,如果在随机输入上进行测试,正则表达式“sgjlkwren”和正则表达式“ueuenwbkaalf”可能永远不会匹配任何内容,因此该指标会表明它们之间的距离为零。这可能是也可能不是您想要的(可能不是)。

您也许能够分析正则表达式的结构,并使用有偏差的随机采样来故意命中比完全随机输入更频繁匹配的字符串。例如,如果两个正则表达式都要求字符串以“foo”开头,您可以确保您的测试输入也始终以 foo 开头,以避免浪费时间测试您知道这两个字符串都会失败的字符串。

所以总而言之:除非您遇到非常特殊的情况,并且输入集和/或正则表达式语言受到限制,否则我认为这是不可能的。如果您确实对输入和正则表达式有一些限制,那么这是可能的。请具体说明这些限制是什么,也许我可以想出更好的办法。

If you have two regular expressions and have a set of example inputs you could try matching every input against each regex. For each input:

  • If they both match or both don't match, score 0.
  • If one matches and the other doesn't, score 1.

Sum this score over all inputs, and this will give you a 'distance' between the regular expressions. This will give you an idea of how often two regular expressions will differ for typical input. It will be very slow to calculate if your sample input set is large. It won't work at all if both regexes fail to match for almost all random strings and your expected input is entirely random. For example the regex 'sgjlkwren' and the regex 'ueuenwbkaalf' would probably both never match anything if tested on random input, so this metric would say the distance between them is zero. That might or might not be what you want (probably not).

You might be able to analyze the structure of the regex and use biased random sampling to deliberately hit strings that match more frequently than in completely random input. For example, if both regex require that the string starts with 'foo', you could make sure that your test inputs also always start with foo, to avoid wasting time testing strings that you know will fail for both.

So in conclusion: unless you have a very specific situation with a restricted input set and/or restricted regular expression language, I'd say its not possible. If you do have some restrictions on your input and on the regular expression, it might be possible. Please specify what these restrictions are and maybe I can come up with something better.

亢潮 2024-08-26 01:43:43

这里的一个早期问题隐藏了一个答案:生成字符串来自正则表达式。您可以通过使用一个正则表达式生成字符串并检查其中有多少与其他正则表达式匹配来计算(不对称)距离度量。

这可以通过删除共享前缀/后缀来优化。例如 a[0-9]*a[0-7]* 共享 a 前缀,因此您可以计算 < 之间的距离改为代码>[0-9]* 和[0-7]*

There's an answer hidden in an earlier question here on SO: Generating strings from regexes. You can calculate an (asymmetric) distance measure by generating strings using one regex and checking how many of those match the other regex.

This can be optimized by stripping out shared prefixes/suffixes. E.g. a[0-9]* and a[0-7]* share the a prefix, so you can calculate the distance between [0-9]* and [0-7]* instead.

贱人配狗天长地久 2024-08-26 01:43:43

我认为首先您需要自己了解如何看待两种表达方式之间的“差异”。基本上,定义一个距离度量。

在一般情况下,制作起来会有很大不同。根据您需要执行的操作,您可能会发现在某个位置允许使用不同的角色是一个很大的区别。在另一种情况下,允许任意数量的后续但相同的字符可能不会产生太大的差异。

我还想强调,通常当他们谈论距离函数时,他们将它们应用于......,好吧,我们称它们为标记。在我们的例子中,是字符序列。您愿意做的就是不将此方法应用于这些令牌,而是应用于多个令牌将匹配的规则。我不太确定它是否有意义。

尽管如此,我相信我们可以想出一些办法,但不是一般性的,而是针对一个特定且相当有限的情况。您有什么例子可以向我们展示吗?

I think first you need to understand for yourself how you see a "difference" between two expressions. Basically, define a distance metric.

In general case, it would be quite different to make. Depending on what you need to do, you may see allowing one different character in some place as a big difference. In the other case, allowing any number of consequent but same characters may not yield much difference.

I'd like to emphasize as well that normally when they talk about distance functions, they apply them to..., well, let's call them, tokens. In our case, character sequences. What you are willing to do, is to apply this method not to those tokens, but to the rules a multitude of tokens will match. I'm not quite sure it even makes sense.

Still, I believe we could think of something, but not in general, but for one particular and quite restricted case. Do you have some sort of example to show us?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文