在第一次出现特定单词之前提取三个单词
我一直在尝试在特定单词第一次出现之前提取三个单词。 例如, 输入:喀拉拉邦高等法院管辖区。 已知词:管辖权。 输出:喀拉拉邦高等法院
我已经尝试了以下常规例外,但没有成功。
m = re.search("((?:\S+\s+){3,}\JURISDICTION\b\s*(?:\S+\b\s*){3,})",contents)
print(m)
I have been trying to extract three words before the first occurrence of a particular word.
For eg,
Input: Kerala High Court Jurisdiction.
Known Word: Jurisdiction.
Output: Kerala High Court
I have tried the following regular exception, but it didn't work.
m = re.search("((?:\S+\s+){3,}\JURISDICTION\b\s*(?:\S+\b\s*){3,})",contents)
print(m)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这是多种方法:
(){3}
:捕获组,重复3次。\ w+
:匹配一个和无限时间之间的单词字符。\ w+
:匹配一个与一个和无限时间之间的单词字符不同的字符。(?=)
:正lookahead。管辖权
:匹配管辖权
。Here is multiple ways to do so:
(){3}
: capturing group, repeated 3 times.\w+
: matches a word character between one and unlimited times.\W+
: matches any character different than a word character between one and unlimited times.(?=)
: Positive lookahead.Jurisdiction
: MatchesJurisdiction
.您可以为此使用
re
,该模式看起来像:^([\ w]+)管辖权
说明:
给您
['Kerala High Court']
获取上述列表的第一个元素,剥离空格,然后将其分配在空格上。
You can use
re
for this, the pattern could look like:^([\w ]+)Jurisdiction
Explanation:
gives you
['Kerala High Court ']
Takes the first element of above list, strips the whitespaces and then splits it at whitespace.
该表达在“管辖权”一词之前寻找三个单词。
re.i
是为了使其案例不敏感。您应该使用前向外观
(?= ...)
检查匹配是否在模式之前。您可以删除?=
,如果要在匹配中包含管辖权
。The expression looks for three words before the word 'Jurisdiction'.
re.I
is to make it case insensitive.You're supposed to use a forward look ahead
(?=...)
to check if the match precedes a pattern. You can remove?=
if you want to include the wordJurisdiction
in your matches.关于您尝试过的模式:
{3,}
重复 3 次或以上,而不是恰好 3 次\J
要提取第一次出现之前的 3 个单词,您可以使用 re.search,并使用捕获组而不是前瞻。
模式匹配:
(
捕获组 1\S+
匹配 1 个以上非空白字符(?:\s+\S+){2}
重复 2 次匹配 1 个以上空白字符和 1 个以上非空白字符)
关闭组 1\s+JURISDICTION\b
匹配 1 个以上空白字符,JURISDICTION 后跟单词边界请参阅 正则表达式演示。
例如,使用
re.I
进行不区分大小写的匹配:输出
About the pattern that you tried:
{3,}
repeats 3 or more times instead of exactly 3\J
\s*(?:\S+\b\s*){3,}
which means that the repeating pattern should be present after matching JURISDICTIONTo extract 3 words before the first occurrence, you can use re.search, and use a capture group instead of a lookahead.
The pattern matches:
(
Capture group 1\S+
Match 1+ non whitespace chars(?:\s+\S+){2}
Repeat 2 times matching 1+ whitespace chars and 1+ non whitspace chars)
Close group 1\s+JURISDICTION\b
Match 1+ whitespace chars, JURISDICTION followed by a word boundarySee a regex demo.
For example, using
re.I
for a case insensitive match:Output