使用正则表达式的字符串掩码和偏移量
我有一个字符串,我尝试在其上创建一个正则表达式掩码,该掩码将在给定偏移量的情况下显示 N
个单词。假设我有以下字符串:
“The Quick, Brown Fox Jumps Over the Lazy Dog.”
我想一次显示 3 个单词:
偏移量 0
: “快速,棕色”
偏移量1
:“快,棕色狐狸”
偏移量2
:“棕色狐狸跳跃”
偏移量3
:“狐狸跳过”
偏移量4
:“跳过”
偏移量5
:“懒惰”
offset 6
: "thelazydog."
我正在使用 Python,并且一直在使用以下简单的正则表达式来检测 3 个单词:
>> >重新导入
<代码>>> s =“敏捷的棕色狐狸跳过了懒狗。”
<代码>>> re.search(r'(\w+\W*){3}', s).group()'The Quick, Brown '
但我不知道如何有一种掩码来显示接下来的 3 个单词而不是开头的单词。我需要保留标点符号。
I have a string on which I try to create a regex mask that will show N
number of words, given an offset. Let's say I have the following string:
"The quick, brown fox jumps over the lazy dog."
I want to show 3 words at the time:
offset 0
: "The quick, brown"
offset 1
: "quick, brown fox"
offset 2
: "brown fox jumps"
offset 3
: "fox jumps over"
offset 4
: "jumps over the"
offset 5
: "over the lazy"
offset 6
: "the lazy dog."
I'm using Python and I've been using the following simple regex to detect 3 words:
>>> import re
>>> s = "The quick, brown fox jumps over the lazy dog."
>>> re.search(r'(\w+\W*){3}', s).group()
'The quick, brown '
But I can't figure out how to have a kind of mask to show the next 3 words and not the beginning ones. I need to keep punctuation.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
前缀匹配选项
您可以通过使用可变前缀正则表达式来跳过第一个
offset
单词并将单词三元组捕获到一个组中来实现此目的。像这样的事情:
让我们看一下模式:
这就是它所说的:匹配
2
个单词,然后捕获到组 1,匹配3
个单词。(?:...)
构造用于对重复进行分组,但它们是非捕获的。参考文献
关于“word”的注释 没有
应该说
\w+\W*
对于“单词”模式来说是一个糟糕的选择,如以下示例所示:3 个单词,但正则表达式无论如何都能匹配,因为
\W*
允许空字符串匹配。也许更好的模式是这样的:
即
\w+
后跟\W+
或字符串$
的末尾。捕获前瞻选项
正如 Kobi 在评论中所建议的,此选项更简单,因为您只有一个静态模式。它使用
findall
捕获所有匹配项(参见 ideone.com):其工作原理是它在零宽度单词边界
\b
上进行匹配,使用前向捕获第 1 组中的 3 个“单词”
The prefix-matching option
You can make this work by having a variable-prefix regex to skip the first
offset
words, and capturing the word triplet into a group.So something like this:
Let's take a look at the pattern:
This does what it says: match
2
words, then capturing into group 1, match3
words.The
(?:...)
constructs are used for grouping for the repetition, but they're non-capturing.References
Note on "word" pattern
It should be said that
\w+\W*
is a poor choice for a "word" pattern, as exhibited by the following example:There are no 3 words, but the regex was able to match anyway, because
\W*
allows for an empty string match.Perhaps a better pattern is something like:
That is, a
\w+
that is followed by either a\W+
or the end of the string$
.The capturing lookahead option
As suggested by Kobi in a comment, this option is simpler in that you only have one static pattern. It uses
findall
to capture all matches (see on ideone.com):How this works is that it matches on zero-width word boundary
\b
, using lookahead to capture 3 "words" in group 1.References
一种倾向是分割字符串并选择切片:
当然,这假设您要么在单词之间只有单个空格,要么不关心所有空白序列是否都折叠成单个空格。
One slant would be to split the string and select slices:
This does, of course, assume that you either have only single spaces between words, or don't care if all whitespace sequences are folded into single spaces.
不需要正则表达式
No need for regex
我们这里有两个正交问题:
对于 1,您可以使用正则表达式或 - 正如其他人指出的 - 一个简单的
str.split
应该足够了。对于 2,请注意,您希望看起来与 itertools 的食谱中的pairwise
抽象非常相似:http://docs.python.org/library/itertools.html#recipes
因此,我们编写了广义的 n 维函数:
我们最终得到一个简单的和模块化代码:
或者按照您的要求:
We have two orthogonal issues here:
For 1 you could use regular expressions or -as others have pointed out- a simple
str.split
should suffice. For 2, note that you want looks very similar to thepairwise
abstraction in itertools's recipes:http://docs.python.org/library/itertools.html#recipes
So we write our generalized n-wise function:
And we end up with a simple and modularized code:
Or as you requested: