Python re 中的贪婪匹配与非贪婪匹配

发布于 2024-09-12 12:23:54 字数 1109 浏览 12 评论 0原文

请帮助我发现这是否是 Python (2.6.5) 中的错误、我编写正则表达式的能力或我对模式匹配的理解的错误。

(我接受可能的答案是“升级你的Python”。)

我正在尝试解析 Yubikey 令牌,允许可选的附加功能。

当我使用此正则表达式来匹配没有任何可选附加项的标记(即,仅包含与两个捕获组匹配的内容)时,匹配失败:

r'^\t?[^a-z0-9]?([cbdefghijklnrtuv1-8]{0,32})\t?([cbdefghijklnrtuv1-8]{32})\t?\r?\n?$'

但是,如果我使第一组非贪婪:

r'^\t?[^a-z0-9]?([cbdefghijklnrtuv1-8]{0,32}?)\t?([cbdefghijklnrtuv1-8]{32})\t?\r?\n?$'

它会成功。

所以,好吧,它有效,但我认为这两个正则表达式之间最终结果的唯一区别是性能。

Expresso 和 Regex Coach 都喜欢这两种模式。

我错过了什么?


这是我正在测试的两个字符串。

没有可选的附加功能(可能会失败的附加功能):

"vvbrentlnccnhgfgrtetilbvckjcegblehfvbihrdcui"

有可选的附加功能(到目前为止还没有失败;实际的选项卡在此处显示为“_”):

"_!_8R5Gkruvfgheufhcnhllchgrfiutujfh_"
"_!1U4Knivdgvkfthrd_brvejhudrdnbunellrjjkkccfnggbdng_"

我尝试使用 Alex Martelli 的建议来重现它,但它没有在原始 Python 环境中不会失败,所以我将重新审视我的代码(我实际上是在 yubikey-python 上进行黑客攻击);大约一天后我会回来报告。


我向大家道歉。我无法重现该问题。当它发生时,我正在通过 getpass 读取输入;我怀疑是意外的外键敲击造成的。

我要结束这个问题了。如果投票该问题的人希望取消投票,那是公平的。

非常抱歉。

Please help me to discover whether this is a bug in Python (2.6.5), in my competence at writing regexes, or in my understanding of pattern matching.

(I accept that a possible answer is "Upgrade your Python".)

I'm trying to parse a Yubikey token, allowing for the optional extras.

When I use this regex to match a token without any optional extras (that is, containing only the stuff that matches the two capture groups), the match fails:

r'^\t?[^a-z0-9]?([cbdefghijklnrtuv1-8]{0,32})\t?([cbdefghijklnrtuv1-8]{32})\t?\r?\n?

However, if I make the first group non-greedy:

r'^\t?[^a-z0-9]?([cbdefghijklnrtuv1-8]{0,32}?)\t?([cbdefghijklnrtuv1-8]{32})\t?\r?\n?

it succeeds.

So, OK, it's working, but I would have thought that the only difference in end result between these two regexes would be performance.

Both Expresso and Regex Coach like both patterns.

What have I missed?


Here are two of the strings I'm testing with.

No optional extras (the ones that can fail):

"vvbrentlnccnhgfgrtetilbvckjcegblehfvbihrdcui"

With optional extras (haven't failed so far; actual tabs are shown here as "_"):

"_!_8R5Gkruvfgheufhcnhllchgrfiutujfh_"
"_!1U4Knivdgvkfthrd_brvejhudrdnbunellrjjkkccfnggbdng_"

I've tried to reproduce it using the suggestion from Alex Martelli, and it doesn't fail in the raw Python environment, so I'm going to revisit my code (I'm actually hacking on yubikey-python); I'll report back in a day or so.


My apologies to everyone. I cannot reproduce the problem. When it occurred, I was reading input via getpass; I suspect that an accidental foreign keystroke got in the way.

I am going to close the question. If whoever upvoted the question wishes to remove their vote, that is fair.

Very sorry.

However, if I make the first group non-greedy:


it succeeds.

So, OK, it's working, but I would have thought that the only difference in end result between these two regexes would be performance.

Both Expresso and Regex Coach like both patterns.

What have I missed?


Here are two of the strings I'm testing with.

No optional extras (the ones that can fail):


With optional extras (haven't failed so far; actual tabs are shown here as "_"):



I've tried to reproduce it using the suggestion from Alex Martelli, and it doesn't fail in the raw Python environment, so I'm going to revisit my code (I'm actually hacking on yubikey-python); I'll report back in a day or so.


My apologies to everyone. I cannot reproduce the problem. When it occurred, I was reading input via getpass; I suspect that an accidental foreign keystroke got in the way.

I am going to close the question. If whoever upvoted the question wishes to remove their vote, that is fair.

Very sorry.

it succeeds.

So, OK, it's working, but I would have thought that the only difference in end result between these two regexes would be performance.

Both Expresso and Regex Coach like both patterns.

What have I missed?


Here are two of the strings I'm testing with.

No optional extras (the ones that can fail):

With optional extras (haven't failed so far; actual tabs are shown here as "_"):


I've tried to reproduce it using the suggestion from Alex Martelli, and it doesn't fail in the raw Python environment, so I'm going to revisit my code (I'm actually hacking on yubikey-python); I'll report back in a day or so.


My apologies to everyone. I cannot reproduce the problem. When it occurred, I was reading input via getpass; I suspect that an accidental foreign keystroke got in the way.

I am going to close the question. If whoever upvoted the question wishes to remove their vote, that is fair.

Very sorry.

However, if I make the first group non-greedy:

it succeeds.

So, OK, it's working, but I would have thought that the only difference in end result between these two regexes would be performance.

Both Expresso and Regex Coach like both patterns.

What have I missed?


Here are two of the strings I'm testing with.

No optional extras (the ones that can fail):

With optional extras (haven't failed so far; actual tabs are shown here as "_"):


I've tried to reproduce it using the suggestion from Alex Martelli, and it doesn't fail in the raw Python environment, so I'm going to revisit my code (I'm actually hacking on yubikey-python); I'll report back in a day or so.


My apologies to everyone. I cannot reproduce the problem. When it occurred, I was reading input via getpass; I suspect that an accidental foreign keystroke got in the way.

I am going to close the question. If whoever upvoted the question wishes to remove their vote, that is fair.

Very sorry.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

痴者 2024-09-19 12:23:54

我建议使用 yubikey-python 来与 yubikey 进行 Python 接口 - 但是,这是一个次要的(并且严格实用的)问题;-)。

理论上,不应该存在贪婪和非贪婪之间的选择导致 RE 在一种情况下匹配而在另一种情况下失败的情况 - 它应该只影响匹配的内容(正如您提到的性能),而不是匹配是否匹配完全成功了,因为 RE 应该为此目的而回溯。

问题是,我无法重现该问题 - 我手头没有 yubikey,也没有 此文件 显示两个 RE 的匹配/不匹配行为之间没有差异。

您能否发布一些失败的示例(其中一个匹配,另一个不匹配),最好通过编辑您的问题,以便我可以重现问题并尝试将其减少到最低限度?听起来可能存在 RE 错误,但如果没有可重现的情况,我无法检查它是否以及何时被修复、已经报告或什么。谢谢!

编辑 OP现在发布了一个失败的示例,但我仍然无法重现:

$ py26
Python 2.6.5 (r265:79359, Mar 24 2010, 01:32:55) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> r1 = re.compile(r'^\t?[^a-z0-9]?([cbdefghijklnrtuv1-8]{0,32})\t?([cbdefghijklnrtuv1-8]{32})\t?\r?\n?

即,匹配在两种情况下都成功,因为它应该 - 这与OP完全相同的2.6.5 Python版本正在使用。 OP,请在您的平台上显示这个简单命令序列的结果,并准确地告诉我们该平台是什么,因为它看起来像一个奇怪的依赖于平台的错误...谢谢!

) >>> r2 = re.compile(r'^\t?[^a-z0-9]?([cbdefghijklnrtuv1-8]{0,32}?)\t?([cbdefghijklnrtuv1-8]{32})\t?\r?\n?

即,匹配在两种情况下都成功,因为它应该 - 这与OP完全相同的2.6.5 Python版本正在使用。 OP,请在您的平台上显示这个简单命令序列的结果,并准确地告诉我们该平台是什么,因为它看起来像一个奇怪的依赖于平台的错误...谢谢!

... ) >>> nox="vvbrentlnccnhgfgrtetilbvckjcegblehfvbihrdcui" >>> r1.match(nox) <_sre.SRE_Match object at 0xcc458> >>> r2.match(nox) <_sre.SRE_Match object at 0xcc920> >>>

即,匹配在两种情况下都成功,因为它应该 - 这与OP完全相同的2.6.5 Python版本正在使用。 OP,请在您的平台上显示这个简单命令序列的结果,并准确地告诉我们该平台是什么,因为它看起来像一个奇怪的依赖于平台的错误...谢谢!

I'd recommend using yubikey-python for Python interfacing to yubikey -- but, that's a side (and strictly pragmatical) issue;-).

In theory, there should be no cases where a choice between greedy and non-greedy causes a RE to match in one case and fail in another -- it should only affects what gets matched (and as you mention performance), not whether the match succeeds at all, since REs are supposed to backtrack for the purpose.

Problem is, I cannot reproduce the problem -- I don't have a yubikey at hand and the tests in this file show no differences between the two REs' match/no-match behavior.

Could you please post a couple of failing examples (where one matches and the other one doesn't), ideally by editing your question, so I can reproduce the problem and try to cut it down to its minimum? Sounds like there may be a RE bug, but without reproducible cases I can't check if and when it's been fixed, already reported, or what. Thanks!

Edit the OP has now posted one failing example but I still can't reproduce:

$ py26
Python 2.6.5 (r265:79359, Mar 24 2010, 01:32:55) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> r1 = re.compile(r'^\t?[^a-z0-9]?([cbdefghijklnrtuv1-8]{0,32})\t?([cbdefghijklnrtuv1-8]{32})\t?\r?\n?

i.e., match succeeds in both cases, as it should -- and that's exactly the same 2.6.5 Python version as the OP is using. OP, pls, show the results of this simple sequence of commands on your platform and tell us exactly what the platform is, since it looks like a weird platform-dependent bug... thanks!

) >>> r2 = re.compile(r'^\t?[^a-z0-9]?([cbdefghijklnrtuv1-8]{0,32}?)\t?([cbdefghijklnrtuv1-8]{32})\t?\r?\n?

i.e., match succeeds in both cases, as it should -- and that's exactly the same 2.6.5 Python version as the OP is using. OP, pls, show the results of this simple sequence of commands on your platform and tell us exactly what the platform is, since it looks like a weird platform-dependent bug... thanks!

... ) >>> nox="vvbrentlnccnhgfgrtetilbvckjcegblehfvbihrdcui" >>> r1.match(nox) <_sre.SRE_Match object at 0xcc458> >>> r2.match(nox) <_sre.SRE_Match object at 0xcc920> >>>

i.e., match succeeds in both cases, as it should -- and that's exactly the same 2.6.5 Python version as the OP is using. OP, pls, show the results of this simple sequence of commands on your platform and tell us exactly what the platform is, since it looks like a weird platform-dependent bug... thanks!

小猫一只 2024-09-19 12:23:54

你是对的:简单地从贪婪量词切换到非贪婪量词不应导致正则表达式停止工作。它可以改变正则表达式匹配(或匹配失败)的速度、匹配的程度以及在哪些组中捕获哪些部分,仅此而已。

(以下“解决方案”不适用,但问题仍然没有表明正在执行不区分大小写的匹配,所以我将保留它。)

您的问题是带有可选附加项的字符串也有大写字母其中,您的正则表达式仅允许使用小写字母。在前面或正则表达式上粘贴 (?i) ,它就可以正常工作。

You're right: simply switching from greedy to non-greedy quantifiers should not cause a regex to stop working. It can change how quickly the regex matches (or fails to match), how much it matches, and which parts get captured in which groups, that's all.

(The following "solution" is not applicable, but the question still doesn't indicate that a case-insensitive match is being performed, so I'll leave it.)

Your problem is that the strings with the optional extras also have uppercase letters in them, and your regex only allows for lowercase letters. Stick a (?i) on the front or the regex and it works just fine.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文