正则表达式:如何匹配字符串末尾的键值对序列
我正在尝试匹配出现在(长)字符串末尾的键值对。字符串看起来像(我替换了“\n”),
my_str = "lots of blah
key1: val1-words
key2: val2-words
key3: val3-words"
所以我期望匹配“key1:val1-words”,“key2:val2-words”和“key3:val3-words”。
- 可能的键名称集是已知的。
- 并非所有可能的键都出现在每个字符串中。
- 每个字符串中至少出现两个键(如果这样更容易匹配)。
- val-words 可以是多个单词。
- 键值对只能在字符串末尾匹配。
- 我正在使用 Python re 模块。
我在想
re.compile('(?:tag1|tag2|tag3):')
加上一些前瞻断言的东西将是一个解决方案。但我还是做不到。我该怎么办?
谢谢。
/David
真实示例字符串:
my_str = u'ucourt métrage pour kino session volume 18\nThème: O sombres héros\nContraintes: sous titrés\nAuthor: nicoalabdou\nTags: wakatanka productions court métrage kino session humour cantat bertrand noir désir sombres héros mer medine marie trintignant femme droit des femmes nicoalabdou pute soumise\nPosted: 06 June 2009\nRating: 1.3\nVotes: 3'
编辑:
基于 Mikel 的解决方案,我现在使用以下内容:
my_tags = ['\S+'] # gets all tags
my_tags = ['Tags','Author','Posted'] # selected tags
regex = re.compile(r'''
\n # all key-value pairs are on separate lines
( # start group to return
(?:{0}): # placeholder for tags to detect '\S+' == all
\s # the space between ':' and value
.* # the value
) # end group to return
'''.format('|'.join(my_tags)), re.VERBOSE)
regex.sub('',my_str) # 返回 my_str 而不匹配键值行
regex.findall(my_str) # 返回匹配的键值行
I am trying to match key-value pairs that appear at the end of (long) strings. The strings look like (I replaced the "\n")
my_str = "lots of blah
key1: val1-words
key2: val2-words
key3: val3-words"
so I expect matches "key1: val1-words", "key2: val2-words" and "key3: val3-words".
- The set of possible key names is known.
- Not all possible keys appear in every string.
- At least two keys appear in every string (if that makes it easier to match).
- val-words can be several words.
- key-value pairs should only be matched at the end of string.
- I am using Python re module.
I was thinking
re.compile('(?:tag1|tag2|tag3):')
plus some look-ahead assertion stuff would be a solution. I can't get it right though. How do I do?
Thank you.
/David
Real example string:
my_str = u'ucourt métrage pour kino session volume 18\nThème: O sombres héros\nContraintes: sous titrés\nAuthor: nicoalabdou\nTags: wakatanka productions court métrage kino session humour cantat bertrand noir désir sombres héros mer medine marie trintignant femme droit des femmes nicoalabdou pute soumise\nPosted: 06 June 2009\nRating: 1.3\nVotes: 3'
EDIT:
Based on Mikel's solution I am now using the following:
my_tags = ['\S+'] # gets all tags
my_tags = ['Tags','Author','Posted'] # selected tags
regex = re.compile(r'''
\n # all key-value pairs are on separate lines
( # start group to return
(?:{0}): # placeholder for tags to detect '\S+' == all
\s # the space between ':' and value
.* # the value
) # end group to return
'''.format('|'.join(my_tags)), re.VERBOSE)
regex.sub('',my_str) # return my_str without matching key-vaue lines regex.findall(my_str) # return matched key-value lines
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
负零宽度前瞻是
(?!pattern)
。re 模块文档页面中提到了它。
(?!...)
因此,您可以使用它来匹配键后的任意数量的单词,但不能使用
(?!\S+:)\S+
之类的键来匹配。完整的代码如下所示:
如果
您使用以下方式打印键/值:
它将打印:
或者使用您更新的示例,它将打印:
您可以使用类似以下内容将每个键/值对转换为字典:
这将使您更轻松地仅查找所需的键(和值)。
更多信息:
The negative zero-width lookahead is
(?!pattern)
.It's mentioned part-way down the re module documentation page.
(?!...)
So you could use it to match any number of words after a key, but not a key using something like
(?!\S+:)\S+
.And the complete code would look like this:
Which gives
If you print the key/values using:
It will print:
Or using your updated example, it would print:
You could turn each key/value pair into a dictionary using something like this:
which would make it easier to look up only the keys (and values) you want.
More info: