如何让 Python 像 grep 一样重新工作以重复组?
我有以下字符串:
seq = 'MNRYLNRQRLYNMYRNKYRGVMEPMSRMTMDFQGRYMDSQGRMVDPRYYDHYGRMHDYDRYYGRSMFNQGHSMDSQRYGGWMDNPERYMDMSGYQMDMQGRWMDAQGRYNNPFSQMWHSRQGH'
也保存在名为 seq.dat 的文件中。如果我使用以下 grep
命令,
grep '\([MF]D.\{4,6\}\)\{3,10\}' seq.dat
我会得到以下匹配字符串:
MDNPERYMDMSGYQMDMQGRWMDAQGRYN
这就是我想要的。换句话说,我想要匹配的是与 [MF]D.{4,6}
字符串相同的连续重复次数。我不想匹配连续重复次数少于 3 次的情况,但我希望它能够捕获最多 6 次
。现在,我尝试使用 python 来做到这一点。我
p = re.compile("(?:[MF]D.{4,6}){3,10}")
尝试 search()
返回
MDNPERYMDMSGYQMDMQGRWM
它接近我寻求的答案,但仍然缺少最后一个 MDAQGRYN
。我猜这是因为 .{4,6}
与 M
匹配,这反过来又阻止了 {3,10}
捕获此内容([MF]D.{4,6})
第 4 次出现,但由于我要求至少 3 个,它很高兴并停止了。
如何使 Python 正则表达式像 grep 一样运行?
I have the following string:
seq = 'MNRYLNRQRLYNMYRNKYRGVMEPMSRMTMDFQGRYMDSQGRMVDPRYYDHYGRMHDYDRYYGRSMFNQGHSMDSQRYGGWMDNPERYMDMSGYQMDMQGRWMDAQGRYNNPFSQMWHSRQGH'
also saved in a file called seq.dat
. If I use the following grep
command
grep '\([MF]D.\{4,6\}\)\{3,10\}' seq.dat
I get the following matching string:
MDNPERYMDMSGYQMDMQGRWMDAQGRYN
which is what I want. In words, what I want to match is as many consecutive repeats as the string has of [MF]D.{4,6}
. I don't want to match cases where it has less than 3 consecutive repeats, but I want it to be able to capture up to 6.
Now, I'm trying to do this with python. I have
p = re.compile("(?:[MF]D.{4,6}){3,10}")
Trying search()
returns
MDNPERYMDMSGYQMDMQGRWM
It is the close to the answer I seek, but is still missing the last MDAQGRYN
. I'm guessing this is because .{4,6}
matches the M
, which in turn prevents {3,10}
from capturing this 4th occurence of ([MF]D.{4,6})
, but since I asked for at least 3, it's happy and it stops.
How do I make Python regex behave like grep does?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
POSIX(“文本导向”)和 NFA(“正则表达式导向”)引擎之间存在根本区别。 POSIX 引擎(
grep
这里使用 POSIX BRE 正则表达式风格,它是默认使用的风格)将解析应用正则表达式的输入文本并返回可能的最长匹配。这里的 NFA 引擎(Pythonre
引擎是一个 NFA 引擎)在后续模式部分匹配时不会重新消费(回溯)。请参阅有关正则表达式导向和文本导向引擎的参考 :
最后一句说“在大多数情况下”,但不是所有情况,你的例子很好地说明了可能会出现差异。
为了避免消耗紧随
D
的M
或F
,我建议使用请参阅 正则表达式演示。 详细信息:
(?:
- 外部非捕获容器组的开始:[MF]D
-M
或F
,然后D
(?:(?![MF]D).){4,6}
- 任何字符(换行符除外)重复四到六次,不开始 < code>MD 或FD
字符序列){3,10}
- 外部的末尾一组,重复3~10次。顺便说一句,如果您只想匹配大写 ASCII 字母,请将
.
替换为[AZ]
。There is a fundamental difference between POSIX ("text-directed") and NFA ("regex-directed") engines. POSIX engines (
grep
here uses a POSIX BRE regex flavor, it is the flavor used by default) will parse the input text applying the regex to it and return the longest match possible. NFA engine (Pythonre
engine is an NFA engine) here does not re-consume (backtrack) when the subsequent pattern parts match.See reference on regex-directed and text-directed engines:
The last sentence says "in most cases", but not all cases, and yours is a good illustration that discrepances may occur.
To avoid consuming
M
orF
that are immediately followed withD
, I'd suggest usingSee the regex demo. Details:
(?:
- start of an outer non-capturing container group:[MF]D
-M
orF
and thenD
(?:(?![MF]D).){4,6}
- any char (other than a line break) repeated four to six times, that does not start anMD
orFD
char sequence){3,10}
- end of the outer group, repeat 3 to 10 times.By the way, if you only want to match uppercase ASCII letters, replace the
.
with[A-Z]
.