我目前正在尝试根据其包含的单词和订单来匹配特定的句子。我主要是根据基于此结构的lookahead断言来做到这一点:
[^>。\ \]]*(?=“所需的单词)[^<。\ \]]*
,
所以我是 在判决中谈论假期的句子
例如, ;。\ \]]*(?=([vv]阳离子。*[mm] alledives))[^<。\ \]]*使用。*
引起问题,因为该词“恶意人”也可以出现在以后的句子中(示例错误)
我的解决方案以使用表达式( [,'()()\\]*\ w+){0,x} \ s*
而不是。在相同的句子中,它们之间的最大x单词在此结构中更改为:
[^>。\]] *\ w+){0,x} \ s*[mm] alledives))[^<。\ \]]*
(示例正确)
不幸的是,如果设置为{0,x}范围较高,则该表达式在计算上很密集,并导致灾难性回溯。
您还有其他建议如何寻找包含特定单词的句子?
i am currently trying to match specific sentences based on the words they contain and their order. I am doing this mostly with the lookahead assertion based on this structure:
[^>.\]]*(?="The desired Words)[^<.\]]*
So i am for example looking for sentences that talk about Vacation in the Maledives. To match the sentence: i´ll book a vacation to the Maledives. I could look for sentences that contain the word vacation and afterwards maledives.
The expression [^>.\]]*(?=([Vv]cation.*[Mm]aledives))[^<.\]]*
using .*
causes problems because the word "Maledives" can also appear in later sentences (Example Wrong)
My solutionw as to use the expression ([,'`´() \\]*\w+){0,X}\s*
instead of .*
, to indicate that "Maledives" has to follow "vacation" within the same sentences and with a maximum of X words between them, changing to this structure:
[^>.\]]*(?=([Vv]cation([,'`´() \\]*\w+){0,X}\s*[Mm]aledives))[^<.\]]*
(Example Correct)
Unfortunately this expression is quite computationally intensive and leads to catastrophic backtracking if the range {0,X} is set to high.
Do you have any other suggestions how to look for sentences containing specific words in order?
发布评论
评论(2)
您可以尝试这样的事情。
见演示。
https://regex101.com/r/97uvch/1
lookahead不需要。它会减慢事情的速度。
You can try something like this.
See demo.
https://regex101.com/r/97UvCH/1
There is no need for lookahead .It will slow things down.
您的模式很容易出现灾难性的回溯,因为
{0,3}
重复零件中有嵌套的量词,并且在模式开始时也有领先的可选量词。python
re
不支持所有格量化词或原子组,但是您可以模仿lookahead断言中使用捕获组的捕获组,然后在第一部分对第一部分进行backe to to to n of lookahead主张中,然后使用对该组的反向表示。减少回溯。但是,使用量化器
{0,200}
的第二部分不应该是原子,因为您要允许回溯以在匹配恶性员之前符合可变数量的单词。因此,量词的数量越高,可以探索的路径越多。
模式匹配:
(-
-
-
(
捕获组1[^&lt;&gt;。\]]*[vv] acation \ b
匹配列出的可选chars,然后匹配度假,然后匹配一个单词boundard)
关闭组1)
关闭LookAhead\ 1
匹配回归与组1(在LookAhead中匹配)(?:[,[,' ]+\ w+){0,200}
重复0-n乘以字符类中的一个或多个字符,然后是1+字字符\ s*[mm] alledives
然后maledives[^&lt;。\ \]]*
可选地匹配了字符 “ rel =“ nofollow noreferrer”> REGEX DEMO 。另一个选项可以是使用原子组
(
? gLxbcUBiKhKgQ0CHybfd/v3zy2b5iRTmai7pWiwItgZVz@cRYIYikNXAFNghHg1N@G4WjYXI/GMY8TTvJXZmKqSCFhk985jLpnT1s1mEEnKddXnWj537vvN9/4a6TjBep0DJTpXTeoOEDxpz3DoMhjIUBBRpLCeXOMoYbcDkudQZzBPKRS6/becRwPaWl0HoFarPWGiaIjyA@6E8AGKxiGHCzXQE7CQ7joZX/i/z9tH6lRXLlwIcAko5gImdoZeOxEDTNoVKU@6RGApxBhTZzX7XVUW3zcKZF/gLZuwz7j1LbPy1yQP/8@Fdr8ONk3mr7VlVbYRCxwipDYTO7xDNlMk@EuxGmBy6KGKvrVw" rel="nofollow noreferrer">Python demo
Your pattern is prone to catastrophic backtracking as there are nested quantifiers in the
{0,3}
repeating part and there are also leading optional quantifiers at the start of the pattern.Python
re
does not support possessive quantifiers or atomic groups, but you can mimic that using a capture group in a lookahead assertion, and then use the backreference to that group when the assertion is true for the first part to reduce the backtracking a bit.But the second part with the quantifier
{0,200}
should not be atomic because you want to allow backtracking to fit a variable number of words before matching maledives.So the higher the number for the quantifier will be, the more possible paths are there to explore.
The pattern matches:
(?<!\S)
Assert a whitespace boundary to the left(?=
Positive lookahead assertion, assert what is to the right is(
Capture group 1[^<>.\]]*[Vv]acation\b
Match optional chars other than the listed and then match vacation followed by a word boundary)
Close group 1)
Close the lookahead\1
Match a backreference to group 1 (that is matched in the lookahead)(?:[,'`´() \\]+\w+){0,200}
Repeat 0-n times one or more chars from the character class and then 1+ word characters\s*[Mm]aledives
Match optional whitespace chars and then maledives[^<.\]]*
Optionally match any character other than the listed in the character classSee a regex demo.
Another option could be using the PyPi regex module with an atomic group
(?>
for the first part:See a Python demo