在第一次出现特定单词之前提取三个单词

发布于 2025-01-17 10:34:37 字数 267 浏览 1 评论 0原文

我一直在尝试在特定单词第一次出现之前提取三个单词。 例如, 输入:喀拉拉邦高等法院管辖区。 已知词:管辖权。 输出:喀拉拉邦高等法院


我已经尝试了以下常规例外,但没有成功。

m = re.search("((?:\S+\s+){3,}\JURISDICTION\b\s*(?:\S+\b\s*){3,})",contents)
print(m)

I have been trying to extract three words before the first occurrence of a particular word.
For eg,
Input: Kerala High Court Jurisdiction.
Known Word: Jurisdiction.
Output: Kerala High Court


I have tried the following regular exception, but it didn't work.

m = re.search("((?:\S+\s+){3,}\JURISDICTION\b\s*(?:\S+\b\s*){3,})",contents)
print(m)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

苏别ゝ 2025-01-24 10:34:37

这是多种方法:

# Method 1
# Split the sentence into words and get the index of "Jurisdiction"
data = "Word Kerala High Court Jurisdiction"
words = data.split()
new_data = words[words.index('Jurisdiction')-3:words.index('Jurisdiction')]
print(new_data)  # ['Kerala', 'High', 'Court']

# Method 2
# Split the sentence to "Jurisdiction" and the text before into word
data = "Word Kerala High Court Jurisdiction"
new_data = data.split('Jurisdiction')[0].split()[-3:]
print(new_data)  # ['Kerala', 'High', 'Court']


# Method 3
# Using regex
import re

data = "Word Kerala High Court Jurisdiction"
new_data = re.search(r"(\w+\W+){3}(?=Jurisdiction)", data)
print(new_data.group())  # Kerala High Court

  • (){3}:捕获组,重复3次。
    • \ w+:匹配一个和无限时间之间的单词字符。
    • \ w+:匹配一个与一个和无限时间之间的单词字符不同的字符。
  • (?=):正lookahead。
  • 管辖权:匹配管辖权

Here is multiple ways to do so:

# Method 1
# Split the sentence into words and get the index of "Jurisdiction"
data = "Word Kerala High Court Jurisdiction"
words = data.split()
new_data = words[words.index('Jurisdiction')-3:words.index('Jurisdiction')]
print(new_data)  # ['Kerala', 'High', 'Court']

# Method 2
# Split the sentence to "Jurisdiction" and the text before into word
data = "Word Kerala High Court Jurisdiction"
new_data = data.split('Jurisdiction')[0].split()[-3:]
print(new_data)  # ['Kerala', 'High', 'Court']


# Method 3
# Using regex
import re

data = "Word Kerala High Court Jurisdiction"
new_data = re.search(r"(\w+\W+){3}(?=Jurisdiction)", data)
print(new_data.group())  # Kerala High Court

  • (){3}: capturing group, repeated 3 times.
    • \w+: matches a word character between one and unlimited times.
    • \W+: matches any character different than a word character between one and unlimited times.
  • (?=): Positive lookahead.
  • Jurisdiction: Matches Jurisdiction.
叫嚣ゝ 2025-01-24 10:34:37

您可以为此使用re,该模式看起来像:^([\ w]+)管辖权

import re
s = """Kerala High Court Jurisdiction."""
print(re.findall(r"^([\w ]+)Jurisdiction", s)[0].strip().split())
# ['Kerala', 'High', 'Court']

说明:

re.findall(r"^([\w ]+)Jurisdiction", s)

给您['Kerala High Court']

[0].strip().split()

获取上述列表的第一个元素,剥离空格,然后将其分配在空格上。

You can use re for this, the pattern could look like: ^([\w ]+)Jurisdiction

import re
s = """Kerala High Court Jurisdiction."""
print(re.findall(r"^([\w ]+)Jurisdiction", s)[0].strip().split())
# ['Kerala', 'High', 'Court']

Explanation:

re.findall(r"^([\w ]+)Jurisdiction", s)

gives you ['Kerala High Court ']

[0].strip().split()

Takes the first element of above list, strips the whitespaces and then splits it at whitespace.

万劫不复 2025-01-24 10:34:37
matches = re.findall(r'(?:\b\w+\s+){3}(?=Jurisdiction)', contents, flags = re.I)
for match in matched:
    print(match)

该表达在“管辖权”一词之前寻找三个单词。

re.i是为了使其案例不敏感。

您应该使用前向外观(?= ...)检查匹配是否在模式之前。您可以删除?=,如果要在匹配中包含管辖权

matches = re.findall(r'(?:\b\w+\s+){3}(?=Jurisdiction)', contents, flags = re.I)
for match in matched:
    print(match)

The expression looks for three words before the word 'Jurisdiction'.

re.I is to make it case insensitive.

You're supposed to use a forward look ahead (?=...) to check if the match precedes a pattern. You can remove ?= if you want to include the word Jurisdiction in your matches.

梦年海沫深 2025-01-24 10:34:37

关于您尝试过的模式:

  • 使用 {3,} 重复 3 次或以上,而不是恰好 3 次
  • 您不必转义 \J
  • 模式以 < 结尾code>\s*(?:\S+\b\s*){3,} 这意味着在匹配 JURISDICTION 后应该出现重复模式。
  • 您在整个模式,但您只能捕获其中的部分 https://docs.python.org/3/library/re.html

要提取第一次出现之前的 3 个单词,您可以使用 re.search,并使用捕获组而不是前瞻。

(\S+(?:\s+\S+){2})\s+JURISDICTION\b

模式匹配:

  • ( 捕获组 1
    • \S+ 匹配 1 个以上非空白字符
    • (?:\s+\S+){2} 重复 2 次匹配 1 个以上空白字符和 1 个以上非空白字符
  • ) 关闭组 1
  • \s+JURISDICTION\b 匹配 1 个以上空白字符,JURISDICTION 后跟单词边界

请参阅 正则表达式演示

例如,使用 re.I 进行不区分大小写的匹配:

import re

pattern = r"(\S+(?:\s+\S+){2})\s+JURISDICTION\b"
s = "Kerala High Court Jurisdiction"

m = re.search(pattern, s, re.I)

if m:
    print(m.group(1))

输出

Kerala High Court

About the pattern that you tried:

  • Using {3,} repeats 3 or more times instead of exactly 3
  • You don't have to escape the \J
  • The pattern ends with \s*(?:\S+\b\s*){3,} which means that the repeating pattern should be present after matching JURISDICTION
  • You use a capture group around the whole pattern, but instead you can capture only the part that you want, and match what should be present before (or also after it)

To extract 3 words before the first occurrence, you can use re.search, and use a capture group instead of a lookahead.

(\S+(?:\s+\S+){2})\s+JURISDICTION\b

The pattern matches:

  • ( Capture group 1
    • \S+ Match 1+ non whitespace chars
    • (?:\s+\S+){2} Repeat 2 times matching 1+ whitespace chars and 1+ non whitspace chars
  • ) Close group 1
  • \s+JURISDICTION\b Match 1+ whitespace chars, JURISDICTION followed by a word boundary

See a regex demo.

For example, using re.I for a case insensitive match:

import re

pattern = r"(\S+(?:\s+\S+){2})\s+JURISDICTION\b"
s = "Kerala High Court Jurisdiction"

m = re.search(pattern, s, re.I)

if m:
    print(m.group(1))

Output

Kerala High Court
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文