python, re.search / re.split 用于看起来像标题的短语,即以大写字母开头
我有一个短语列表(由用户输入),我想在文本文件中找到它们,例如:
titles = ['Blue Team', 'Final Match', 'Best Player',]
text = 'In today Final match, The Best player is Joe from the Blue Team and the second best player is Jack from the Red team.'
1./ 我可以像这样找到这些短语的所有出现
titre = re.compile(r'(?P<title>%s)' % '|'.join(titles), re.M)
list = [ t for t in titre.split(text) if titre.search(t) ]
(为简单起见,我假设完美的间距.)
2./ 我还可以使用 re.I 找到这些短语的变体,例如“蓝队”、“决赛”、“最佳球员”...(如果它们出现在文本中)。
但我想限制为仅查找第一个字母大写的输入短语的变体,例如文本中的“Blue team”,无论它们如何作为输入输入,例如“bluE tEAm”。
是否可以写一些东西来“阻止”短语的一部分的 re.I 标志?在伪代码中,我想象生成类似“[B]lue Team|[F]inal Match”的内容。
注意:例如,我的主要目标不是计算文本中输入短语的频率,而是提取和分析它们之间或周围的文本片段。
I have a list of phrases (input by user) I'd like to locate them in a text file, for examples:
titles = ['Blue Team', 'Final Match', 'Best Player',]
text = 'In today Final match, The Best player is Joe from the Blue Team and the second best player is Jack from the Red team.'
1./ I can find all the occurrences of these phrases like so
titre = re.compile(r'(?P<title>%s)' % '|'.join(titles), re.M)
list = [ t for t in titre.split(text) if titre.search(t) ]
(For simplicity, I am assuming a perfect spacing.)
2./ I can also find variants of these phrases e.g. 'Blue team', final Match', 'best player' ... using re.I, if they ever appear in the text.
But I want to restrict to finding only variants of the input phrases with their first letter upper-cased e.g. 'Blue team' in the text, regardless how they were entered as input, e.g. 'bluE tEAm'.
Is it possible to write something to "block" the re.I flag for a portion of a phrase? In pseudo code I imagine generate something like '[B]lue Team|[F]inal Match'.
Note: My primary goal is not, for example, calculating frequency of the input phrases in the text but extracting and analyzing the text fragments between or around them.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我将使用
re.I
并将 list-comp 修改为:I would use
re.I
and modify the list-comp to:我认为正则表达式不允许您仅指定忽略大小写标志适用的区域。但是,您可以生成文本的新版本,其中所有字符都小写,但每个单词的第一个字符都小写:
这样,不带忽略标志的正则表达式将匹配仅考虑第一个字符的大小写每个单词的字符。
I think regular expressions won't let you specify just a region where the ignore case flag is applicable. However, you can generate a new version of the text in which all the characters have been lower cased, but the first one for every word:
This way, a regular expression without the ignore flag will match taking into account the casing only for the first character of each word.
在正则表达式中使用输入之前,如何修改输入以使其大小写正确?
How about modifying the input so that it is in the correct case before you use it in the regular expression?