如何让 Python 像 grep 一样重新工作以重复组?

发布于 2025-01-14 08:19:09 字数 881 浏览 3 评论 0原文

我有以下字符串:

seq = 'MNRYLNRQRLYNMYRNKYRGVMEPMSRMTMDFQGRYMDSQGRMVDPRYYDHYGRMHDYDRYYGRSMFNQGHSMDSQRYGGWMDNPERYMDMSGYQMDMQGRWMDAQGRYNNPFSQMWHSRQGH'

也保存在​​名为 seq.dat 的文件中。如果我使用以下 grep 命令,

grep '\([MF]D.\{4,6\}\)\{3,10\}' seq.dat

我会得到以下匹配字符串:

MDNPERYMDMSGYQMDMQGRWMDAQGRYN

这就是我想要的。换句话说,我想要匹配的是与 [MF]D.{4,6} 字符串相同的连续重复次数。我不想匹配连续重复次数少于 3 次的情况,但我希望它能够捕获最多 6 次

。现在,我尝试使用 python 来做到这一点。我

p = re.compile("(?:[MF]D.{4,6}){3,10}")

尝试 search() 返回

MDNPERYMDMSGYQMDMQGRWM

它接近我寻求的答案,但仍然缺少最后一个 MDAQGRYN。我猜这是因为 .{4,6}M 匹配,这反过来又阻止了 {3,10} 捕获此内容([MF]D.{4,6}) 第 4 次出现,但由于我要求至少 3 个,它很高兴并停止了。

如何使 Python 正则表达式像 grep 一样运行?

I have the following string:

seq = 'MNRYLNRQRLYNMYRNKYRGVMEPMSRMTMDFQGRYMDSQGRMVDPRYYDHYGRMHDYDRYYGRSMFNQGHSMDSQRYGGWMDNPERYMDMSGYQMDMQGRWMDAQGRYNNPFSQMWHSRQGH'

also saved in a file called seq.dat. If I use the following grep command

grep '\([MF]D.\{4,6\}\)\{3,10\}' seq.dat

I get the following matching string:

MDNPERYMDMSGYQMDMQGRWMDAQGRYN

which is what I want. In words, what I want to match is as many consecutive repeats as the string has of [MF]D.{4,6}. I don't want to match cases where it has less than 3 consecutive repeats, but I want it to be able to capture up to 6.

Now, I'm trying to do this with python. I have

p = re.compile("(?:[MF]D.{4,6}){3,10}")

Trying search() returns

MDNPERYMDMSGYQMDMQGRWM

It is the close to the answer I seek, but is still missing the last MDAQGRYN. I'm guessing this is because .{4,6} matches the M, which in turn prevents {3,10} from capturing this 4th occurence of ([MF]D.{4,6}), but since I asked for at least 3, it's happy and it stops.

How do I make Python regex behave like grep does?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

念三年u 2025-01-21 08:19:09

POSIX(“文本导向”)和 NFA(“正则表达式导向”)引擎之间存在根本区别。 POSIX 引擎(grep 这里使用 POSIX BRE 正则表达式风格,它是默认使用的风格)将解析应用正则表达式的输入文本并返回可能的最长匹配。这里的 NFA 引擎(Python re 引擎是一个 NFA 引擎)在后续模式部分匹配时不会重新消费(回溯)。

请参阅有关正则表达式导向和文本导向引擎的参考 :

正则表达式引导的引擎遍历正则表达式,尝试将正则表达式中的下一个标记与下一个字符匹配。如果找到匹配项,引擎将继续执行正则表达式和主题字符串。如果令牌无法匹配,引擎将回溯到正则表达式和主题字符串中的前一个位置,在该位置可以尝试通过正则表达式的不同路径...使用正则表达式定向引擎的现代正则表达式风格具有许多功能,例如原子分组以及所有格量词,让您可以控制这种回溯。

文本导向引擎遍历主题字符串,在前进到字符串中的下一个字符之前尝试正则表达式的所有排列。文本导向的引擎永远不会回溯。因此,关于文本导向引擎的匹配过程没有太多可讨论的。在大多数情况下,文本导向引擎会找到与正则表达式导向引擎相同的匹配项。

最后一句说“在大多数情况下”,但不是所有情况,你的例子很好地说明了可能会出现差异。

为了避免消耗紧随 DMF,我建议使用

(?:[MF]D(?:(?![MF]D).){4,6}){3,10}

请参阅 正则表达式演示详细信息

  • (?: - 外部非捕获容器组的开始:
    • [MF]D - MF,然后 D
    • (?:(?![MF]D).){4,6} - 任何字符(换行符除外)重复四到六次,不开始 < code>MD 或 FD 字符序列
  • ){3,10} - 外部的末尾一组,重复3~10次。

顺便说一句,如果您只想匹配大写 ASCII 字母,请将 . 替换为 [AZ]

There is a fundamental difference between POSIX ("text-directed") and NFA ("regex-directed") engines. POSIX engines (grep here uses a POSIX BRE regex flavor, it is the flavor used by default) will parse the input text applying the regex to it and return the longest match possible. NFA engine (Python re engine is an NFA engine) here does not re-consume (backtrack) when the subsequent pattern parts match.

See reference on regex-directed and text-directed engines:

A regex-directed engine walks through the regex, attempting to match the next token in the regex to the next character. If a match is found, the engine advances through the regex and the subject string. If a token fails to match, the engine backtracks to a previous position in the regex and the subject string where it can try a different path through the regex... Modern regex flavors using regex-directed engines have lots of features such as atomic grouping and possessive quantifiers that allow you to control this backtracking.

A text-directed engine walks through the subject string, attempting all permutations of the regex before advancing to the next character in the string. A text-directed engine never backtracks. Thus, there isn’t much to discuss about the matching process of a text-directed engine. In most cases, a text-directed engine finds the same matches as a regex-directed engine.

The last sentence says "in most cases", but not all cases, and yours is a good illustration that discrepances may occur.

To avoid consuming M or F that are immediately followed with D, I'd suggest using

(?:[MF]D(?:(?![MF]D).){4,6}){3,10}

See the regex demo. Details:

  • (?: - start of an outer non-capturing container group:
    • [MF]D - M or F and then D
    • (?:(?![MF]D).){4,6} - any char (other than a line break) repeated four to six times, that does not start an MD or FD char sequence
  • ){3,10} - end of the outer group, repeat 3 to 10 times.

By the way, if you only want to match uppercase ASCII letters, replace the . with [A-Z].

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文