确定正则表达式的特异性

发布于 2024-09-16 22:11:15 字数 532 浏览 7 评论 0原文

给定以下正则表达式：

 - alice@[a-z]+\.[a-z]+
 - [a-z]+@[a-z]+\.[a-z]+
 - .*

字符串 [email protected]< /code> 显然会匹配所有三个正则表达式。在我正在开发的应用程序中，我们只对“最具体”的匹配感兴趣。在这种情况下，这显然是第一个。不幸的是，似乎没有办法做到这一点。我们正在使用 PCRE，但我没有找到方法来做到这一点，并且在互联网上搜索也没有结果。一种可能的方法是保持正则表达式按特异性降序排序，然后简单地获取第一个匹配项。当然，下一个问题是如何对正则表达式数组进行排序。不能将确保数组排序的责任交给最终用户。所以我希望你们能帮助我......

谢谢！

保罗

原文

Given the following regular expressions:

 - alice@[a-z]+\.[a-z]+
 - [a-z]+@[a-z]+\.[a-z]+
 - .*

The string [email protected] will obviously match all three regular expressions. In the application I am developing, we are only interested in the 'most specific' match. In this case this is obviously the first one.
Unfortunately there seems no way to do this. We are using PCRE and I did not find a way to do this and a search on the Internet was also not fruitful.
A possible way would be to keep the regular expressions sorted on descending specificity and then simply take the first match. Of course then the next question would be how to sort the array of regular expressions. It is not an option to give the responsability to the end-user to ensure that the array is sorted.
So I hope you guys could help me out here...

Thanks !!

Paul

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

⊕婉儿 2024-09-23 22:11:15

以下是我根据 Donald Miner 的研究论文开发的针对此问题的解决方案，用 Python 实现，用于应用于 MAC 地址的规则。

基本上，最具体的匹配来自不是任何其他匹配模式的超集的模式。对于特定的问题域，您创建一系列测试（函数）来比较两个 RE，并返回哪个是超集，或者它们是否正交。这可以让您构建匹配树。对于特定的输入字符串，您可以遍历根模式并找到任何匹配项。然后浏览它们的子模式。如果在任何时候正交模式匹配，就会引发错误。

设置

import re

class RegexElement:
    def __init__(self, string,index):
        self.string=string
        self.supersets = []
        self.subsets = []
        self.disjoints = []
        self.intersects = []
        self.maybes = []
        self.precompilation = {}
        self.compiled = re.compile(string,re.IGNORECASE)
        self.index = index

SUPERSET  = object()
SUBSET    = object()
INTERSECT = object()
DISJOINT  = object()
EQUAL     = object()

测试

每个测试采用 2 个字符串（a 和 b）并尝试确定它们之间的相关性。如果测试无法确定关系，则返回 None。

SUPERSET 表示a 是b 的超集。 b 的所有匹配项都将匹配 a。

SUBSET 表示 b 是 a 的超集。

INTERSECT 表示 a 的某些匹配将与 b 匹配，但某些不会，而 b 的某些匹配将匹配' t 匹配a。

DISJOINT 表示 a 的任何匹配项都不会与 b 匹配。

EQUAL 表示 a 的所有匹配项都将匹配 b，并且 b 的所有匹配项都将匹配 a代码>.

    def equal_test(a, b):  
        if a == b: return EQUAL

图表

  class SubsetGraph(object):
    def __init__(self, tests):
        self.regexps = []
        self.tests = tests
        self._dirty = True
        self._roots = None

    @property
    def roots(self):
        if self._dirty:
            r = self._roots = [i for i in self.regexps if not i.supersets]
            return r
        return self._roots

    def add_regex(self, new_regex):
        roots = self.roots
        new_re = RegexElement(new_regex)
        for element in roots:
            self.process(new_re, element)
        self.regexps.append(new_re)

    def process(self, new_re, element):
        relationship = self.compare(new_re, element)
        if relationship:
            getattr(self, 'add_' + relationship)(new_re, element)

    def add_SUPERSET(self, new_re, element):
        for i in element.subsets:
            i.supersets.add(new_re)
            new_re.subsets.add(i)

        element.supersets.add(new_re)
        new_re.subsets.add(element)

    def add_SUBSET(self, new_re, element):
        for i in element.subsets:
            self.process(new_re, i)

        element.subsets.add(new_re)
        new_re.supersets.add(element)

    def add_DISJOINT(self, new_re, element):
        for i in element.subsets:
            i.disjoints.add(new_re)
            new_re.disjoints.add(i)

        new_re.disjoints.add(element)
        element.disjoints.add(new_re)

    def add_INTERSECT(self, new_re, element):
        for i in element.subsets:
            self.process(new_re, i)

        new_re.intersects.add(element)
        element.intersects.add(new_re)

    def add_EQUAL(self, new_re, element):
        new_re.supersets = element.supersets.copy()
        new_re.subsets = element.subsets.copy()
        new_re.disjoints = element.disjoints.copy()
        new_re.intersects = element.intersects.copy()

    def compare(self, a, b):
        for test in self.tests:
            result = test(a.string, b.string)
            if result:
                return result

    def match(self, text, strict=True):
        matches = set()
        self._match(text, self.roots, matches)
        out = []
        for e in matches:
            for s in e.subsets:
                if s in matches:
                    break
            else:
                out.append(e)
        if strict and len(out) > 1:
            for i in out:
                print(i.string)
            raise Exception("Multiple equally specific matches found for " + text)
        return out

    def _match(self, text, elements, matches):
        new_elements = []
        for element in elements:
            m = element.compiled.match(text)
            if m:
                matches.add(element)
                new_elements.extend(element.subsets)
        if new_elements:
            self._match(text, new_elements, matches)

用法

graph = SubsetGraph([equal_test, test_2, test_3, ...])
graph.add_regex("00:11:22:..:..:..")
graph.add_regex("..(:..){5,5}"
graph.match("00:de:ad:be:ef:00")

完整的可用版本位于此处< /a>.

The following is the solution to this problem I developed based on Donald Miner's research paper, implemented in Python, for rules applied to MAC addresses.

Basically, the most specific match is from the pattern that is not a superset of any other matching pattern. For a particular problem domain, you create a series of tests (functions) which compare two REs and return which is the superset, or if they are orthogonal. This lets you build a tree of matches. For a particular input string, you go through the root patterns and find any matches. Then go through their subpatterns. If at any point, orthogonal patterns match, an error is raised.

Setup

import re

class RegexElement:
    def __init__(self, string,index):
        self.string=string
        self.supersets = []
        self.subsets = []
        self.disjoints = []
        self.intersects = []
        self.maybes = []
        self.precompilation = {}
        self.compiled = re.compile(string,re.IGNORECASE)
        self.index = index

SUPERSET  = object()
SUBSET    = object()
INTERSECT = object()
DISJOINT  = object()
EQUAL     = object()

The Tests

Each test takes 2 strings (a and b) and tries to determine how they are related. If the test cannot determine the relation, None is returned.

SUPERSET means a is a superset of b. All matches of b will match a.

SUBSET means b is a superset of a.

INTERSECT means some matches of a will match b, but some won't and some matches of b won't match a.

DISJOINT means no matches of a will match b.

EQUAL means all matches of a will match b and all matches of b will match a.

    def equal_test(a, b):  
        if a == b: return EQUAL

The graph

  class SubsetGraph(object):
    def __init__(self, tests):
        self.regexps = []
        self.tests = tests
        self._dirty = True
        self._roots = None

    @property
    def roots(self):
        if self._dirty:
            r = self._roots = [i for i in self.regexps if not i.supersets]
            return r
        return self._roots

    def add_regex(self, new_regex):
        roots = self.roots
        new_re = RegexElement(new_regex)
        for element in roots:
            self.process(new_re, element)
        self.regexps.append(new_re)

    def process(self, new_re, element):
        relationship = self.compare(new_re, element)
        if relationship:
            getattr(self, 'add_' + relationship)(new_re, element)

    def add_SUPERSET(self, new_re, element):
        for i in element.subsets:
            i.supersets.add(new_re)
            new_re.subsets.add(i)

        element.supersets.add(new_re)
        new_re.subsets.add(element)

    def add_SUBSET(self, new_re, element):
        for i in element.subsets:
            self.process(new_re, i)

        element.subsets.add(new_re)
        new_re.supersets.add(element)

    def add_DISJOINT(self, new_re, element):
        for i in element.subsets:
            i.disjoints.add(new_re)
            new_re.disjoints.add(i)

        new_re.disjoints.add(element)
        element.disjoints.add(new_re)

    def add_INTERSECT(self, new_re, element):
        for i in element.subsets:
            self.process(new_re, i)

        new_re.intersects.add(element)
        element.intersects.add(new_re)

    def add_EQUAL(self, new_re, element):
        new_re.supersets = element.supersets.copy()
        new_re.subsets = element.subsets.copy()
        new_re.disjoints = element.disjoints.copy()
        new_re.intersects = element.intersects.copy()

    def compare(self, a, b):
        for test in self.tests:
            result = test(a.string, b.string)
            if result:
                return result

    def match(self, text, strict=True):
        matches = set()
        self._match(text, self.roots, matches)
        out = []
        for e in matches:
            for s in e.subsets:
                if s in matches:
                    break
            else:
                out.append(e)
        if strict and len(out) > 1:
            for i in out:
                print(i.string)
            raise Exception("Multiple equally specific matches found for " + text)
        return out

    def _match(self, text, elements, matches):
        new_elements = []
        for element in elements:
            m = element.compiled.match(text)
            if m:
                matches.add(element)
                new_elements.extend(element.subsets)
        if new_elements:
            self._match(text, new_elements, matches)

Usage

graph = SubsetGraph([equal_test, test_2, test_3, ...])
graph.add_regex("00:11:22:..:..:..")
graph.add_regex("..(:..){5,5}"
graph.match("00:de:ad:be:ef:00")

A complete usable version is here.

回复收藏 0 原文

深居我梦 2024-09-23 22:11:15

我的直觉告诉我，这不仅是一个难题，无论是在计算成本还是实现难度方面，而且在任何现实情况下都可能无法解决。考虑以下两个正则表达式来接受字符串 [email protected]

    alice@[a-z]+\.[a-z]+ 
    [a-z][email protected]

其中哪一项更具体？

My gut instinct says that not only is this a hard problem, both in terms of computational cost and implementation difficulty, but it may be unsolvable in any realistic fashion. Consider the two following regular expressions to accept the string [email protected]

    alice@[a-z]+\.[a-z]+ 
    [a-z][email protected]

Which one of these is more specific?

回复收藏 0 原文

天气好吗我好吗 2024-09-23 22:11:15

这有点像黑客，但它可以为近 10 年前提出的这个问题提供实用的解决方案。

正如@torak 所指出的，定义一个正则表达式比另一个正则表达式更具体意味着什么是很困难的。

我的建议是查看正则表达式相对于与其匹配的字符串的稳定性。研究稳定性的常用方法是对输入进行微小的更改，看看是否仍然得到相同的结果。

例如，字符串 [电子邮件受保护]与正则表达式 /alice@myprovider\.com/ 匹配，但如果您对字符串进行任何更改，它将不匹配。所以这个正则表达式非常不稳定。但正则表达式 /.*/ 非常稳定，因为您可以对字符串进行任何更改，它仍然匹配。

因此，在寻找最具体的正则表达式时，我们正在寻找相对于与其匹配的字符串而言最不稳定的正则表达式。

为了实现此稳定性测试，我们需要定义如何选择对与正则表达式匹配的字符串进行较小的更改。这是另一堆蠕虫。例如，我们可以选择将字符串的每个字符更改为随机字符，并针对正则表达式或任何其他可能的选择进行测试。为简单起见，我建议一次从字符串中删除一个字符，然后进行测试。

因此，如果匹配的字符串长度为 N 个字符，则我们需要进行 N 个测试。让我们看一下从字符串中一次删除一个字符 [email protected]< /a>，匹配下表中的所有正则表达式。它有 12 个字符长，因此有 12 个测试。在下表中，

0 表示正则表达式不匹配（不稳定），
1 表示匹配（稳定）

              /alice@[a-z]+\.[a-z]+/    /[a-z]+@[a-z]+\.[a-z]+/     /.*/
  
[email protected]           0                           1                  1
[email protected]           0                           1                  1
[email protected]           0                           1                  1
[email protected]           0                           1                  1
[email protected]           0                           1                  1
alicefoo.com           0                           0                  1
[email protected]           1                           1                  1
[email protected]           1                           1                  1
[email protected]           1                           1                  1
alice@foocom           0                           0                  1 
[email protected]           1                           1                  1
[email protected]           1                           1                  1
                      ---                         ---                ---  
total score:           5                          10                 12

得分最低的正则表达式最具体。当然，一般来说，可能有多个正则表达式具有相同的分数，这反映了这样一个事实：存在通过任何合理的测量特异性的方式来衡量的正则表达式彼此是一样具体的。尽管它也可能对正则表达式产生相同的分数，但人们很容易认为它们彼此不那么具体（如果您能想到一个例子，请发表评论）。

但回到@torak 提出的问题，哪个更具体：

alice@[a-z]+\.[a-z]+ 
[a-z][email protected]

我们可以认为第二个更具体，因为它限制了更多字符，并且上述测试将同意该观点。

正如我所说，我们选择对匹配多个正则表达式的字符串进行微小更改的方式是一罐蠕虫，上述方法产生的答案可能取决于该选择。但正如我所说，这是一个很容易实现的黑客 - 它并不严格。

当然，如果匹配的字符串为空，该方法就会中断。测试的有用性会随着字符串长度的增加而增加。对于非常短的字符串，它更有可能为特异性明显不同的正则表达式产生相同的分数。

This is a bit of a hack, but it could provide a practical solution to this question asked nearly 10 years ago.

As pointed out by @torak, there are difficulties in defining what it means for one regular expression to be more specific than another.

My suggestion is to look at how stable the regular expression is with respect to a string that matches it. The usual way to investigate stability is to make minor changes to the inputs, and see if you still get the same result.

For example, the string [email protected] matches the regex /alice@myprovider\.com/, but if you make any change to the string, it will not match. So this regex is very unstable. But the regex /.*/ is very stable, because you can make any change to the string, and it still matches.

So, in looking for the most specific regex, we are looking for the least stable one with respect to a string that matches it.

In order to implement this test for stability, we need to define how we choose a minor change to the string that matches the regex. This is another can of worms. We could for example, choose to change each character of the string to something random and test that against the regex, or any number of other possible choices. For simplicity, I suggest deleting one character at a time from the string, and testing that.

So, if the string that matches is N characters long, we have N tests to make. Lets's look at deleting one character at a time from the string [email protected], which matches all of the regular expressions in the table below. It's 12 characters long, so there are 12 tests. In the table below,

0 means the regex does not match (unstable),
1 means it matches (stable)

              /alice@[a-z]+\.[a-z]+/    /[a-z]+@[a-z]+\.[a-z]+/     /.*/
  
[email protected]           0                           1                  1
[email protected]           0                           1                  1
[email protected]           0                           1                  1
[email protected]           0                           1                  1
[email protected]           0                           1                  1
alicefoo.com           0                           0                  1
[email protected]           1                           1                  1
[email protected]           1                           1                  1
[email protected]           1                           1                  1
alice@foocom           0                           0                  1 
[email protected]           1                           1                  1
[email protected]           1                           1                  1
                      ---                         ---                ---  
total score:           5                          10                 12

The regex with the lowest score is the most specific. Of course, in general, there may be more than one regex with the same score, which reflects the fact there are regular expressions which by any reasonable way of measuring specificity are as specific as one another. Although it may also yield the same score for regular expressions that one can easily argue are not as specific as each other (if you can think of an example, please comment).

But coming back to the question asked by @torak, which of these is more specific:

alice@[a-z]+\.[a-z]+ 
[a-z][email protected]

We could argue that the second is more specific because it constrains more characters, and the above test will agree with that view.

As I said, the way we choose to make minor changes to the string that matches more than one regex is a can of worms, and the answer that the above method yields may depend on that choice. But as I said, this is an easily implementable hack - it is not rigourous.

And, of course the method breaks if the string that matches is empty. The usefulness if the test will increase as the length of the string increases. With very short strings, it is more likely produce equal scores for regular expressions that are clearly different in their specificity.

回复收藏 0 原文

蓝眼睛不忧郁 2024-09-23 22:11:15

我正在考虑 PHP 项目路由解析器的类似问题。在阅读了这里的其他答案和评论，并考虑了所涉及的成本之后，我可能会完全走向另一个方向。

然而，一个解决方案是简单地按照字符串长度的顺序对正则表达式列表进行排序。

它并不完美，但只需删除 [] 组，它就会更接近。在问题中的第一个示例中，它将列出以下内容：

- alice@[a-z]+\.[a-z]+
- [a-z]+@[a-z]+\.[a-z]+
- .*

对此，删除任何 []-组的内容后：

- alice@+\.+
- +@+\.+
- .*

另一个答案中的第二个示例也是如此，[]-组完全删除并按长度排序，这：

alice@[a-z]+\.[a-z]+ 
[a-z][email protected]

将排序为：

[email protected]
alice@+\.+

如果我选择使用它，至少对我来说这是一个足够好的解决方案。缺点是在排序之前删除所有 [] 组并将排序应用于未修改的正则表达式列表的开销，但是嘿 - 你无法获得所有内容。

I'm thinking of a similar problem for a PHP projects route parser. After reading the other answers and comments here, and also thinking about the cost involved I might go in another direction altogether.

A solution however, would be to simply sort the regular expression list in order of it's string length.

It's not perfect, but simply by removing the []-groups it would be much closer. On the first example in the question it would this list:

- alice@[a-z]+\.[a-z]+
- [a-z]+@[a-z]+\.[a-z]+
- .*

To this, after removing content of any []-group:

- alice@+\.+
- +@+\.+
- .*

Same thing goes for the second example in another answer, with the []-groups completely removed and sorted by length, this:

alice@[a-z]+\.[a-z]+ 
[a-z][email protected]

Would become sorted as:

[email protected]
alice@+\.+

This is a good enough solution at least for me, if I choose to use it. Downside would be the overhead of removing all groups of [] before sorting and applying the sort on the unmodified list of regexes, but hey - you can't get everything.

回复收藏 0 原文

~没有更多了~