使用正则表达式的字符串掩码和偏移量

发布于 2024-09-10 00:48:14 字数 765 浏览 4 评论 0原文

我有一个字符串,我尝试在其上创建一个正则表达式掩码,该掩码将在给定偏移量的情况下显示 N 个单词。假设我有以下字符串:

“The Quick, Brown Fox Jumps Over the Lazy Dog.”

我想一次显示 3 个单词:

偏移量 0: “快速,棕色”
偏移量1“快,棕色狐狸”
偏移量2“棕色狐狸跳跃”
偏移量3“狐狸跳过”
偏移量4“跳过”
偏移量5“懒惰”
offset 6: "thelazydog."

我正在使用 Python,并且一直在使用以下简单的正则表达式来检测 3 个单词:

>> >重新导入
<代码>>> s =“敏捷的棕色狐狸跳过了懒狗。”
<代码>>> re.search(r'(\w+\W*){3}', s).group()
'The Quick, Brown '

但我不知道如何有一种掩码来显示接下来的 3 个单词而不是开头的单词。我需要保留标点符号。

I have a string on which I try to create a regex mask that will show N number of words, given an offset. Let's say I have the following string:

"The quick, brown fox jumps over the lazy dog."

I want to show 3 words at the time:

offset 0: "The quick, brown"
offset 1: "quick, brown fox"
offset 2: "brown fox jumps"
offset 3: "fox jumps over"
offset 4: "jumps over the"
offset 5: "over the lazy"
offset 6: "the lazy dog."

I'm using Python and I've been using the following simple regex to detect 3 words:

>>> import re
>>> s = "The quick, brown fox jumps over the lazy dog."
>>> re.search(r'(\w+\W*){3}', s).group()
'The quick, brown '

But I can't figure out how to have a kind of mask to show the next 3 words and not the beginning ones. I need to keep punctuation.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

只想待在家 2024-09-17 00:48:14

前缀匹配选项

您可以通过使用可变前缀正则表达式来跳过第一个 offset 单词并将单词三元组捕获到一个组中来实现此目的。

像这样的事情:

import re
s = "The quick, brown fox jumps over the lazy dog."

print re.search(r'(?:\w+\W*){0}((?:\w+\W*){3})', s).group(1)
# The quick, brown 
print re.search(r'(?:\w+\W*){1}((?:\w+\W*){3})', s).group(1)
# quick, brown fox      
print re.search(r'(?:\w+\W*){2}((?:\w+\W*){3})', s).group(1)
# brown fox jumps 

让我们看一下模式:

 _"word"_      _"word"_
/        \    /        \
(?:\w+\W*){2}((?:\w+\W*){3})
             \_____________/
                group 1

这就是它所说的:匹配 2 个单词,然后捕获到组 1,匹配 3 个单词。

(?:...) 构造用于对重复进行分组,但它们是非捕获的。

参考文献


关于“word”的注释 没有

应该说 \w+\W* 对于“单词”模式来说是一个糟糕的选择,如以下示例所示:

import re
s = "nothing"
print re.search(r'(\w+\W*){3}', s).group()
# nothing

3 个单词,但正则表达式无论如何都能匹配,因为 \W* 允许空字符串匹配。

也许更好的模式是这样的:

\w+(?:\W+|$)

\w+ 后跟 \W+ 或字符串 $ 的末尾。


捕获前瞻选项

正如 Kobi 在评论中所建议的,此选项更简单,因为您只有一个静态模式。它使用 findall 捕获所有匹配项(参见 ideone.com

import re
s = "The quick, brown fox jumps over the lazy dog."

triplets = re.findall(r"\b(?=((?:\w+(?:\W+|$)){3}))", s)

print triplets
# ['The quick, brown ', 'quick, brown fox ', 'brown fox jumps ',
#  'fox jumps over ', 'jumps over the ', 'over the lazy ', 'the lazy dog.']

print triplets[3]
# fox jumps over 

:其工作原理是它在零宽度单词边界 \b 上进行匹配,使用前向捕获第 1 组中的 3 个“单词

    ______lookahead______
   /      ___"word"__    \
  /      /           \    \
\b(?=((?:\w+(?:\W+|$)){3}))
     \___________________/
           group 1

  • 。 /lookaround.html" rel="nofollow noreferrer">正则表达式.info/Lookarounds

The prefix-matching option

You can make this work by having a variable-prefix regex to skip the first offset words, and capturing the word triplet into a group.

So something like this:

import re
s = "The quick, brown fox jumps over the lazy dog."

print re.search(r'(?:\w+\W*){0}((?:\w+\W*){3})', s).group(1)
# The quick, brown 
print re.search(r'(?:\w+\W*){1}((?:\w+\W*){3})', s).group(1)
# quick, brown fox      
print re.search(r'(?:\w+\W*){2}((?:\w+\W*){3})', s).group(1)
# brown fox jumps 

Let's take a look at the pattern:

 _"word"_      _"word"_
/        \    /        \
(?:\w+\W*){2}((?:\w+\W*){3})
             \_____________/
                group 1

This does what it says: match 2 words, then capturing into group 1, match 3 words.

The (?:...) constructs are used for grouping for the repetition, but they're non-capturing.

References


Note on "word" pattern

It should be said that \w+\W* is a poor choice for a "word" pattern, as exhibited by the following example:

import re
s = "nothing"
print re.search(r'(\w+\W*){3}', s).group()
# nothing

There are no 3 words, but the regex was able to match anyway, because \W* allows for an empty string match.

Perhaps a better pattern is something like:

\w+(?:\W+|$)

That is, a \w+ that is followed by either a \W+ or the end of the string $.


The capturing lookahead option

As suggested by Kobi in a comment, this option is simpler in that you only have one static pattern. It uses findall to capture all matches (see on ideone.com):

import re
s = "The quick, brown fox jumps over the lazy dog."

triplets = re.findall(r"\b(?=((?:\w+(?:\W+|$)){3}))", s)

print triplets
# ['The quick, brown ', 'quick, brown fox ', 'brown fox jumps ',
#  'fox jumps over ', 'jumps over the ', 'over the lazy ', 'the lazy dog.']

print triplets[3]
# fox jumps over 

How this works is that it matches on zero-width word boundary \b, using lookahead to capture 3 "words" in group 1.

    ______lookahead______
   /      ___"word"__    \
  /      /           \    \
\b(?=((?:\w+(?:\W+|$)){3}))
     \___________________/
           group 1

References

如日中天 2024-09-17 00:48:14

一种倾向是分割字符串并选择切片:

words = re.split(r"\s+", s)
for i in range(len(words) - 2):
    print ' '.join(words[i:i+3])

当然,这假设您要么在单词之间只有单个空格,要么不关心所有空白序列是否都折叠成单个空格。

One slant would be to split the string and select slices:

words = re.split(r"\s+", s)
for i in range(len(words) - 2):
    print ' '.join(words[i:i+3])

This does, of course, assume that you either have only single spaces between words, or don't care if all whitespace sequences are folded into single spaces.

jJeQQOZ5 2024-09-17 00:48:14

不需要正则表达式

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> for offset in range(7):
...     print 'offset {0}: "{1}"'.format(offset, ' '.join(s.split()[offset:][:3]))
... 
offset 0: "The quick, brown"
offset 1: "quick, brown fox"
offset 2: "brown fox jumps"
offset 3: "fox jumps over"
offset 4: "jumps over the"
offset 5: "over the lazy"
offset 6: "the lazy dog."

No need for regex

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> for offset in range(7):
...     print 'offset {0}: "{1}"'.format(offset, ' '.join(s.split()[offset:][:3]))
... 
offset 0: "The quick, brown"
offset 1: "quick, brown fox"
offset 2: "brown fox jumps"
offset 3: "fox jumps over"
offset 4: "jumps over the"
offset 5: "over the lazy"
offset 6: "the lazy dog."
简美 2024-09-17 00:48:14

我们这里有两个正交问题:

  1. 如何分割字符串。
  2. 如何构建 3 个连续元素的组。

对于 1,您可以使用正则表达式或 - 正如其他人指出的 - 一个简单的 str.split 应该足够了。对于 2,请注意,您希望看起来与 itertools 的食谱中的pairwise 抽象非常相似:

http://docs.python.org/library/itertools.html#recipes

因此,我们编写了广义的 n 维函数:

import itertools

def nwise(iterable, n):
    """nwise(iter([1,2,3,4,5]), 3) -> (1,2,3), (2,3,4), (4,5,6)"""
    iterables = itertools.tee(iterable, n)
    slices = (itertools.islice(it, idx, None) for (idx, it) in enumerate(iterables))
    return itertools.izip(*slices)

我们最终得到一个简单的和模块化代码:

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> list(nwise(s.split(), 3))
[('The', 'quick,', 'brown'), ('quick,', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog.')]

或者按照您的要求:

>>> # also: map(" ".join, nwise(s.split(), 3))
>>> [" ".join(words) for words in nwise(s.split(), 3)]
['The quick, brown', 'quick, brown fox', 'brown fox jumps', 'fox jumps over', 'jumps over the', 'over the lazy', 'the lazy dog.']

We have two orthogonal issues here:

  1. How to split the string.
  2. How to build groups of 3 consecutive elements.

For 1 you could use regular expressions or -as others have pointed out- a simple str.split should suffice. For 2, note that you want looks very similar to the pairwise abstraction in itertools's recipes:

http://docs.python.org/library/itertools.html#recipes

So we write our generalized n-wise function:

import itertools

def nwise(iterable, n):
    """nwise(iter([1,2,3,4,5]), 3) -> (1,2,3), (2,3,4), (4,5,6)"""
    iterables = itertools.tee(iterable, n)
    slices = (itertools.islice(it, idx, None) for (idx, it) in enumerate(iterables))
    return itertools.izip(*slices)

And we end up with a simple and modularized code:

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> list(nwise(s.split(), 3))
[('The', 'quick,', 'brown'), ('quick,', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog.')]

Or as you requested:

>>> # also: map(" ".join, nwise(s.split(), 3))
>>> [" ".join(words) for words in nwise(s.split(), 3)]
['The quick, brown', 'quick, brown fox', 'brown fox jumps', 'fox jumps over', 'jumps over the', 'over the lazy', 'the lazy dog.']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文