当前位置：文江博客话题详情

再次计算重叠的正则表达式匹配

发布于 2025-01-06 11:43:37 字数 615 浏览 0 评论 0原文

如何使用 Python 获取重叠正则表达式匹配的数量？

我已阅读并尝试了这个, 那个和其他一些问题，但发现没有一个适合我的场景。这里是：

输入示例字符串：akka
搜索模式：a.*k

正确的函数应该产生 2 作为匹配数，因为有两个可能的结束位置（k 字母）。

模式也可能更复杂，例如 a.*k.*a 也应该在 akka 中匹配两次（因为有两个 k > 在中间）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

忆梦 2025-01-13 11:43:37

我认为您正在寻找的内容可能最好使用像 lepl 这样的解析库来完成：

>>> from lepl import *
>>> parser = Literal('a') + Any()[:] + Literal('k')
>>> parser.config.no_full_first_match()
>>> list(parser.parse_all('akka'))
[['akk'], ['ak']]
>>> parser = Literal('a') + Any()[:] + Literal('k') + Any()[:] + Literal('a')
>>> list(parser.parse_all('akka'))
[['akka'], ['akka']]

我相信parser.parse_all 的输出长度就是您要查找的内容。

请注意，如果您的模式与整个字符串不匹配，您需要使用 parser.config.no_full_first_match() 来避免错误。

编辑：根据@Shamanu4的评论，我看到你想要从任何位置开始匹配结果，你可以这样做：

>>> text = 'bboo'
>>> parser = Literal('b') + Any()[:] + Literal('o')
>>> parser.config.no_full_first_match()
>>> substrings = [text[i:] for i in range(len(text))]
>>> matches = [list(parser.parse_all(substring)) for substring in substrings]
>>> matches = filter(None, matches) # Remove empty matches
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results (again)
>>> matches
['bboo', 'bbo', 'boo', 'bo']

I think that what you're looking for is probably better done with a parsing library like lepl:

>>> from lepl import *
>>> parser = Literal('a') + Any()[:] + Literal('k')
>>> parser.config.no_full_first_match()
>>> list(parser.parse_all('akka'))
[['akk'], ['ak']]
>>> parser = Literal('a') + Any()[:] + Literal('k') + Any()[:] + Literal('a')
>>> list(parser.parse_all('akka'))
[['akka'], ['akka']]

I believe that the length of the output from parser.parse_all is what you're looking for.

Note that you need to use parser.config.no_full_first_match() to avoid errors if your pattern doesn't match the whole string.

Edit: Based on the comment from @Shamanu4, I see you want matching results starting from any position, you can do that as follows:

>>> text = 'bboo'
>>> parser = Literal('b') + Any()[:] + Literal('o')
>>> parser.config.no_full_first_match()
>>> substrings = [text[i:] for i in range(len(text))]
>>> matches = [list(parser.parse_all(substring)) for substring in substrings]
>>> matches = filter(None, matches) # Remove empty matches
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results (again)
>>> matches
['bboo', 'bbo', 'boo', 'bo']

回复收藏 0 原文

北城孤痞 2025-01-13 11:43:37

是的，它很丑陋且未经优化，但它似乎正在工作。这是对所有可能但独特变体

def myregex(pattern,text,dir=0):
    import re
    m = re.search(pattern, text)
    if m:
        yield m.group(0)
        if len(m.group('suffix')):
            for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[1:],m.group('end')),1):
                yield r
            if dir<1 :
                for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[:-1],m.group('end')),-1):
                    yield r


def myprocess(pattern, text):    
    parts = pattern.split("*")    
    for i in range(0, len(parts)-1 ):
        res=""
        for j in range(0, len(parts) ):
            if j==0:
                res+="(?P<prefix>"
            if j==i:
                res+=")(?P<suffix>"
            res+=parts[j]
            if j==i+1:
                res+=")(?P<end>"
            if j<len(parts)-1:
                if j==i:
                    res+=".*"
                else:
                    res+=".*?"
            else:
                res+=")"
        for r in myregex(res,text):
            yield r

def mycount(pattern, text):
    return set(myprocess(pattern, text))

测试的简单尝试：

>>> mycount('a*b*c','abc')
set(['abc'])
>>> mycount('a*k','akka')
set(['akk', 'ak'])
>>> mycount('b*o','bboo')
set(['bbo', 'bboo', 'bo', 'boo'])
>>> mycount('b*o','bb123oo')
set(['b123o', 'bb123oo', 'bb123o', 'b123oo'])
>>> mycount('b*o','ffbfbfffofoff')
set(['bfbfffofo', 'bfbfffo', 'bfffofo', 'bfffo'])

Yes, it is ugly and unoptimized but it seems to be working. This is a simple try of all possible but unique variants

def myregex(pattern,text,dir=0):
    import re
    m = re.search(pattern, text)
    if m:
        yield m.group(0)
        if len(m.group('suffix')):
            for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[1:],m.group('end')),1):
                yield r
            if dir<1 :
                for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[:-1],m.group('end')),-1):
                    yield r


def myprocess(pattern, text):    
    parts = pattern.split("*")    
    for i in range(0, len(parts)-1 ):
        res=""
        for j in range(0, len(parts) ):
            if j==0:
                res+="(?P<prefix>"
            if j==i:
                res+=")(?P<suffix>"
            res+=parts[j]
            if j==i+1:
                res+=")(?P<end>"
            if j<len(parts)-1:
                if j==i:
                    res+=".*"
                else:
                    res+=".*?"
            else:
                res+=")"
        for r in myregex(res,text):
            yield r

def mycount(pattern, text):
    return set(myprocess(pattern, text))

test:

>>> mycount('a*b*c','abc')
set(['abc'])
>>> mycount('a*k','akka')
set(['akk', 'ak'])
>>> mycount('b*o','bboo')
set(['bbo', 'bboo', 'bo', 'boo'])
>>> mycount('b*o','bb123oo')
set(['b123o', 'bb123oo', 'bb123o', 'b123oo'])
>>> mycount('b*o','ffbfbfffofoff')
set(['bfbfffofo', 'bfbfffo', 'bfffofo', 'bfffo'])

回复收藏 0 原文

~没有更多了~