判断正则表达式是否只匹配固定长度的字符串

发布于 2024-09-17 04:55:57 字数 185 浏览 5 评论 0原文

有没有办法确定正则表达式是否只匹配固定长度的字符串? 我的想法是扫描 *、+ 和 ?然后,需要一些智能逻辑来查找 {m,n},其中 m!=n。 没有必要采取 |考虑到运营商。
小例子:^\d{4} 是固定长度; ^\d{4,5} 或 ^\d+ 是可变长度

我正在使用 PCRE。

谢谢。

保罗·普拉特

Is there a way of determining if the regular expression only matches fixed-length strings ?
My idea would be to scan for *,+ and ? Then, some intelligent logic would be required to to look for {m,n} where m!=n.
It is not necessary to take the | operator into account.
Small example: ^\d{4} is fixed-length; ^\d{4,5} or ^\d+ are variable-length

I am using PCRE.

Thanks.

Paul Praet

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

却一份温柔 2024-09-24 04:55:57

好吧,您可以利用 Python 的正则表达式引擎仅允许在后向断言中使用固定长度正则表达式的事实:

import re
regexes = [r".x{2}(abc|def)", # fixed
           r"a|bc",           # variable/finite
           r"(.)\1",          # fixed
           r".{0,3}",         # variable/finite
           r".*"]             # variable/infinite

for regex in regexes:
    try:
        r = re.compile("(?<=" + regex + ")")
    except:
        print("Not fixed length: {}".format(regex))
    else:
        print("Fixed length: {}".format(regex))

将输出

Fixed length: .x{2}(abc|def)
Not fixed length: a|bc
Fixed length: (.)\1
Not fixed length: .{0,3}
Not fixed length: .*

I'm 假设正则表达式本身是有效的。

现在,Python 如何知道正则表达式是否是固定长度的?只需阅读源代码 - 在 sre_parse.py 中,有一个名为 getwidth() 的方法,它返回一个由最小和最大可能长度组成的元组,如果这些是在后向断言中不相等,re.compile() 将引发错误。 getwidth() 方法递归地遍历正则表达式:

def getwidth(self):
    # determine the width (min, max) for this subpattern
    if self.width:
        return self.width
    lo = hi = 0
    UNITCODES = (ANY, RANGE, IN, LITERAL, NOT_LITERAL, CATEGORY)
    REPEATCODES = (MIN_REPEAT, MAX_REPEAT)
    for op, av in self.data:
        if op is BRANCH:
            i = sys.maxsize
            j = 0
            for av in av[1]:
                l, h = av.getwidth()
                i = min(i, l)
                j = max(j, h)
            lo = lo + i
            hi = hi + j
        elif op is CALL:
            i, j = av.getwidth()
            lo = lo + i
            hi = hi + j
        elif op is SUBPATTERN:
            i, j = av[1].getwidth()
            lo = lo + i
            hi = hi + j
        elif op in REPEATCODES:
            i, j = av[2].getwidth()
            lo = lo + int(i) * av[0]
            hi = hi + int(j) * av[1]
        elif op in UNITCODES:
            lo = lo + 1
            hi = hi + 1
        elif op == SUCCESS:
            break
    self.width = int(min(lo, sys.maxsize)), int(min(hi, sys.maxsize))
    return self.width

Well, you could make use of the fact that Python's regex engine only allows fixed-length regular expressions in lookbehind assertions:

import re
regexes = [r".x{2}(abc|def)", # fixed
           r"a|bc",           # variable/finite
           r"(.)\1",          # fixed
           r".{0,3}",         # variable/finite
           r".*"]             # variable/infinite

for regex in regexes:
    try:
        r = re.compile("(?<=" + regex + ")")
    except:
        print("Not fixed length: {}".format(regex))
    else:
        print("Fixed length: {}".format(regex))

will output

Fixed length: .x{2}(abc|def)
Not fixed length: a|bc
Fixed length: (.)\1
Not fixed length: .{0,3}
Not fixed length: .*

I'm assuming that the regex itself is valid.

Now, how does Python know whether the regex is fixed-length or not? Just read the source - in sre_parse.py, there is a method called getwidth() that returns a tuple consisting of the lowest and the highest possible length, and if these are not equal in a lookbehind assertion, re.compile() will raise an error. The getwidth() method walks through the regex recursively:

def getwidth(self):
    # determine the width (min, max) for this subpattern
    if self.width:
        return self.width
    lo = hi = 0
    UNITCODES = (ANY, RANGE, IN, LITERAL, NOT_LITERAL, CATEGORY)
    REPEATCODES = (MIN_REPEAT, MAX_REPEAT)
    for op, av in self.data:
        if op is BRANCH:
            i = sys.maxsize
            j = 0
            for av in av[1]:
                l, h = av.getwidth()
                i = min(i, l)
                j = max(j, h)
            lo = lo + i
            hi = hi + j
        elif op is CALL:
            i, j = av.getwidth()
            lo = lo + i
            hi = hi + j
        elif op is SUBPATTERN:
            i, j = av[1].getwidth()
            lo = lo + i
            hi = hi + j
        elif op in REPEATCODES:
            i, j = av[2].getwidth()
            lo = lo + int(i) * av[0]
            hi = hi + int(j) * av[1]
        elif op in UNITCODES:
            lo = lo + 1
            hi = hi + 1
        elif op == SUCCESS:
            break
    self.width = int(min(lo, sys.maxsize)), int(min(hi, sys.maxsize))
    return self.width
罪#恶を代价 2024-09-24 04:55:57

只是为了好玩。

假设我们测试的正则表达式仅支持 +*?{m,n}、< code>{n} 和 [...] (除了一些奇怪的语法,例如 []][^]] )。那么只有当正则表达式遵循以下语法时,它才是固定长度的:

 REGEX     -> ELEMENT *
 ELEMENT   -> CHARACTER ( '{' ( \d+ ) ( ',' \1 )? '}' )?
 CHARACTER -> [^+*?\\\[] | '\\' . | '[' ( '\\' . | [^\\\]] )+ ']'

可以在 PCRE 中将其重写为:

^(?:(?:[^+*?\\\[{]|\\.|\[(?:\\.|[^\\\]])+\])(?:\{(\d+)(?:,\1)?\})?)*$

Just for fun.

Assuming the regex we are testing against only support +, *, ?, {m,n}, {n} and [...] (except some weird syntax like []] and [^]]). Then the regex is fixed length only if it follows the grammar:

 REGEX     -> ELEMENT *
 ELEMENT   -> CHARACTER ( '{' ( \d+ ) ( ',' \1 )? '}' )?
 CHARACTER -> [^+*?\\\[] | '\\' . | '[' ( '\\' . | [^\\\]] )+ ']'

which can be rewritten in PCRE as:

^(?:(?:[^+*?\\\[{]|\\.|\[(?:\\.|[^\\\]])+\])(?:\{(\d+)(?:,\1)?\})?)*$
尴尬癌患者 2024-09-24 04:55:57

根据 regular-expressions.info,PCRE 引擎仅支持固定长度的正则表达式, Lookbehind 内部的交替。

因此,如果您有一个有效的正则表达式,请用 (?<=) 包围它,看看它是否仍然可以编译。然后您就知道它是固定大小的正则表达式或固定大小的正则表达式的交替。

我不确定像 a(b|cd)e 这样的东西 - 这绝对不是固定大小的,但它可能仍然可以编译。你需要尝试一下(我没有安装 C/PCRE)。

According to regular-expressions.info, the PCRE engine supports only fixed-length regexes and alternation inside lookbehinds.

So if you have a valid regex, surround it with (?<= and ) and see if it still compiles. Then you know that it's either fixed-size or an alternation of fixed-size regexes.

I'm not sure about something like a(b|cd)e - this is definitely not fixed-size, but it might still compile. You'd need to try it out (I don't have C/PCRE installed).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文