使用正则表达式和 Python 进行短语匹配

发布于 2025-01-07 18:38:10 字数 274 浏览 1 评论 0原文

我有一些想要匹配的短语。我使用了一个正则表达式,如下所示:

(^|)(piston|piston ring)( |$)

使用上面的内容,regex.match(“pistonring”)匹配“piston”。如果我更改正则表达式,使较长的短语“活塞环”首先出现,那么它会按预期工作。

我对这种行为感到惊讶,因为我假设正则表达式的贪婪本质会尝试“免费”匹配最长的字符串。

我缺少什么?有人可以解释一下吗?谢谢!

I have some short phrases that I want to match on. I used a regex as follows:

(^|)(piston|piston ring)( |$)

Using the above, regex.match("piston ring") matches on "piston". If I change the regex such that the longer phrase "piston ring" comes first then it work as expected.

I was surprised by this behavior as I was assuming that the greedy nature of regex would try to match the longest string "for free."

What am I missing? Can somebody explain this? Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

就此别过 2025-01-14 18:38:10

在正则表达式中使用交替 (|) 时,会按从左到右的顺序尝试每个选项,直到找到匹配项。因此,在您的示例中,由于可以使用活塞进行匹配,因此永远不会尝试活塞环

编写此正则表达式的更好方法如下:

(^|)(piston( ring)?)( |$)

这将尝试匹配 'piston',然后立即尝试匹配 'ring',其中 >? 使其可选。或者,只需确保较长的选项出现在轮换开始时即可。

您可能还需要考虑使用 单词边界\b,而不是 (^|)( |$)

When using alternation (|) in regular expressions, each option is attempted in order from left to right until a match can be found. So in your example since a match can be made with piston, piston ring will never be attempted.

A better way to write this regex would be something like this:

(^|)(piston( ring)?)( |$)

This will attempt to match 'piston', and then immediately attempt to match ' ring', with the ? making it optional. Alternatively just make sure your longer options occur at the beginning of the alternation.

You may also want to consider using a word boundary, \b, instead of (^|) and ( |$).

再见回来 2025-01-14 18:38:10

来自 http://www.regular-expressions.info/alternation.html(第一个 Google结果):

正则表达式引擎很渴望。一旦找到有效的匹配项,它将停止搜索。结果是,在某些情况下,替代方案的顺序很重要

一个例外:

POSIX 标准要求返回最长的匹配,无论正则表达式引擎是使用 NFA 还是 DFA 算法实现。

可能的解决方案:

  • piston(ring)?
  • (pistonring|piston)(将最长的放在前面)

from http://www.regular-expressions.info/alternation.html (first Google result):

the regex engine is eager. It will stop searching as soon as it finds a valid match. The consequence is that in certain situations, the order of the alternatives matters

one exception:

the POSIX standard mandates that the longest match be returned, regardless if the regex engine is implemented using an NFA or DFA algorithm.

possible solutions:

  • piston( ring)?
  • (piston ring|piston) (put the longest before)
沫雨熙 2025-01-14 18:38:10

这就是交替的行为。它尝试匹配第一个替代方案,即“活塞”,如果成功则完成。

这意味着它不会尝试所有替代方案,而是以第一个匹配的替代方案结束。

您可以在 regular-expressions.info 上找到更多详细信息

,您可能还会感兴趣是单词边界\b。我想你正在寻找的是

\bpiston(?: ring)?\b

Thats the behaviour of Alternations. It tries to match the first alternative, that is "piston" if it is successful it is done.

That means it will not try all alternatives, it will finish with the first that matches.

You can find more details here on regular-expressions.info

What could also be interesting for you are word boundaries \b. I think what you are looking for is

\bpiston(?: ring)?\b
ゝ杯具 2025-01-14 18:38:10
Edit2: It wasn't clear if your test data 
contained pipes or not. I saw the pipes in 
the regex and assumed you are searching 
for pipe delim. Oh well.. not sure if below
helps. 

使用正则表达式来匹配管道分隔的文本将需要更多的交替来选取开始和结束列。

另一种方法怎么样?

text='start piston|xxx|piston ring|xxx|piston cast|xxx|piston|xxx|stock piston|piston end'
j=re.split(r'\|',text)

k = [ x for x in j if x.find('piston') >= 0 ]
['start piston', 'piston ring', 'piston cast', 'piston', 'stock piston', 'piston end']

k = [ x for x in j if x.startswith('piston')  ]
['piston ring', 'piston cast', 'piston', 'piston end']

k = [ x for x in j if x == 'piston' ]
['piston']

j=re.split(r'\|',text)
if 'piston ring' in j: 
    print True
> True

编辑:为了澄清 - 举个例子:

text2 ='piston1 | xxx | spiston2 | xxx |活塞环| xxx |piston3'

我添加'。'匹配任何内容以显示匹配的项目

re.findall('piston.',text2)
['piston1', 'piston2', 'piston ', 'piston3']

为了使其更准确,您将需要使用后视断言。
这保证您匹配 '|piston' 但不将管道包含在结果中

re.findall('(?<=\|)piston.',text2)
['piston ', 'piston3']

限制从贪婪匹配到第一个匹配字符 .*?<停止字符>
添加分组括号以排除管道。匹配.*?足够聪明,可以检测是否在组内并忽略括号并使用下一个字符作为停止匹配哨兵。这似乎有效,但它忽略了最后一列。

re.findall('(?<=\|)(piston.*?)\|',text2)
['piston ring']

添加分组时,您现在只需指定以转义管道开头。

re.findall('\|(piston.*?)\|',text2)
['piston ring']

要搜索最后一列,请添加此非分组匹配 (?:\||$) - 这意味着管道上的匹配(需要转义)或 ( |) 字符串的结尾 ($)。
非分组匹配 (?:x1|x2) 不会包含在结果中。一个额外的好处是它得到了优化。

re.findall('\|(piston.*?)(?:\||$)',text2)
['piston ring', 'piston3']

最后,要修复字符串的开头,请添加另一个更改,与前一个用于结束字符串匹配的更改非常相似

re.findall('(?:\||^)(piston.*?)(?:\||$)',text2)
['piston1', 'piston ring', 'piston3']

希望它有帮助。 :)

Edit2: It wasn't clear if your test data 
contained pipes or not. I saw the pipes in 
the regex and assumed you are searching 
for pipe delim. Oh well.. not sure if below
helps. 

Using regex to match text that's pipe delimited will need more alternations to pick up the beginning and ending columns.

What about another approach?

text='start piston|xxx|piston ring|xxx|piston cast|xxx|piston|xxx|stock piston|piston end'
j=re.split(r'\|',text)

k = [ x for x in j if x.find('piston') >= 0 ]
['start piston', 'piston ring', 'piston cast', 'piston', 'stock piston', 'piston end']

k = [ x for x in j if x.startswith('piston')  ]
['piston ring', 'piston cast', 'piston', 'piston end']

k = [ x for x in j if x == 'piston' ]
['piston']

j=re.split(r'\|',text)
if 'piston ring' in j: 
    print True
> True

Edit: To clarify - take this example:

text2='piston1|xxx|spiston2|xxx|piston ring|xxx|piston3'

I add '.' to match anything to show the items matched

re.findall('piston.',text2)
['piston1', 'piston2', 'piston ', 'piston3']

To make it more accurate, you will need to use look-behind assertion.
This guarantees you match '|piston' but doesn't include the pipe in the result

re.findall('(?<=\|)piston.',text2)
['piston ', 'piston3']

Limit matching from greedy to first matching character .*?< stop char >
Add grouping parens to exclude the pipe. The match .*? is smart enough to detect if inside a group and ignores the paren and uses the next character as the stop matching sentinel. This seems to work, but it ignores the last column.

re.findall('(?<=\|)(piston.*?)\|',text2)
['piston ring']

When you add grouping you can now just specify starts with an escaped pipe

re.findall('\|(piston.*?)\|',text2)
['piston ring']

To search the last column as well, add this non-grouping match (?:\||$) - which means match on pipe (needs to be escaped) or (|) the end ($) of string.
The non-grouping match (?:x1|x2) doesn't get included in the result. An added bonus it gets optimized.

re.findall('\|(piston.*?)(?:\||$)',text2)
['piston ring', 'piston3']

Finally, to fix for the beginning of the string, add another alteration much like the previous one for end string match

re.findall('(?:\||^)(piston.*?)(?:\||$)',text2)
['piston1', 'piston ring', 'piston3']

Hope it helps. :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文