python, re.search / re.split 用于看起来像标题的短语,即以大写字母开头

发布于 2025-01-08 02:24:16 字数 761 浏览 0 评论 0原文

我有一个短语列表(由用户输入),我想在文本文件中找到它们,例如:

    titles = ['Blue Team', 'Final Match', 'Best Player',] 
    text = 'In today Final match, The Best player is Joe from the Blue Team and the second best player is Jack from the Red team.'

1./ 我可以像这样找到这些短语的所有出现

    titre = re.compile(r'(?P<title>%s)' % '|'.join(titles), re.M)
    list = [ t for t in titre.split(text) if titre.search(t) ]

(为简单起见,我假设完美的间距.)

2./ 我还可以使用 re.I 找到这些短语的变体,例如“蓝队”、“决赛”、“最佳球员”...(如果它们出现在文本中)。

但我想限制为仅查找第一个字母大写的输入短语的变体,例如文本中的“Blue team”,无论它们如何作为输入输入,例如“bluE tEAm”。

是否可以写一些东西来“阻止”短语的一部分的 re.I 标志?在伪代码中,我想象生成类似“[B]lue Team|[F]inal Match”的内容。

注意:例如,我的主要目标不是计算文本中输入短语的频率,而是提取和分析它们之间或周围的文本片段。

I have a list of phrases (input by user) I'd like to locate them in a text file, for examples:

    titles = ['Blue Team', 'Final Match', 'Best Player',] 
    text = 'In today Final match, The Best player is Joe from the Blue Team and the second best player is Jack from the Red team.'

1./ I can find all the occurrences of these phrases like so

    titre = re.compile(r'(?P<title>%s)' % '|'.join(titles), re.M)
    list = [ t for t in titre.split(text) if titre.search(t) ]

(For simplicity, I am assuming a perfect spacing.)

2./ I can also find variants of these phrases e.g. 'Blue team', final Match', 'best player' ... using re.I, if they ever appear in the text.

But I want to restrict to finding only variants of the input phrases with their first letter upper-cased e.g. 'Blue team' in the text, regardless how they were entered as input, e.g. 'bluE tEAm'.

Is it possible to write something to "block" the re.I flag for a portion of a phrase? In pseudo code I imagine generate something like '[B]lue Team|[F]inal Match'.

Note: My primary goal is not, for example, calculating frequency of the input phrases in the text but extracting and analyzing the text fragments between or around them.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

悲欢浪云 2025-01-15 02:24:16

我将使用 re.I 并将 list-comp 修改为:

l = [ t for t in titre.split(text) if titre.search(t) and t[0].isupper() ]

I would use re.I and modify the list-comp to:

l = [ t for t in titre.split(text) if titre.search(t) and t[0].isupper() ]
来日方长 2025-01-15 02:24:16

我认为正则表达式不允许您仅指定忽略大小写标志适用的区域。但是,您可以生成文本的新版本,其中所有字符都小写,但每个单词的第一个字符都小写:

new_text = ' '.join([word[0] + word[1:].lower() for word in text.split()])

这样,不带忽略标志的正则表达式将匹配仅考虑第一个字符的大小写每个单词的字符。

I think regular expressions won't let you specify just a region where the ignore case flag is applicable. However, you can generate a new version of the text in which all the characters have been lower cased, but the first one for every word:

new_text = ' '.join([word[0] + word[1:].lower() for word in text.split()])

This way, a regular expression without the ignore flag will match taking into account the casing only for the first character of each word.

撩心不撩汉 2025-01-15 02:24:16

在正则表达式中使用输入之前,如何修改输入以使其大小写正确?

How about modifying the input so that it is in the correct case before you use it in the regular expression?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文