使用正则表达式的字符串掩码和偏移量

发布于 2024-09-10 00:48:14 字数 765 浏览 4 评论 0原文

我有一个字符串，我尝试在其上创建一个正则表达式掩码，该掩码将在给定偏移量的情况下显示 N 个单词。假设我有以下字符串：

“The Quick, Brown Fox Jumps Over the Lazy Dog.”

我想一次显示 3 个单词：

偏移量 0: “快速，棕色”
偏移量1：“快，棕色狐狸”
偏移量2：“棕色狐狸跳跃”
偏移量3：“狐狸跳过”
偏移量4：“跳过”
偏移量5：“懒惰”
offset 6: "thelazydog."

我正在使用 Python，并且一直在使用以下简单的正则表达式来检测 3 个单词：

>> >重新导入
<代码>>> s =“敏捷的棕色狐狸跳过了懒狗。”
<代码>>> re.search(r'(\w+\W*){3}', s).group()
'The Quick, Brown '

但我不知道如何有一种掩码来显示接下来的 3 个单词而不是开头的单词。我需要保留标点符号。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

只想待在家 2024-09-17 00:48:14

前缀匹配选项

您可以通过使用可变前缀正则表达式来跳过第一个 offset 单词并将单词三元组捕获到一个组中来实现此目的。

像这样的事情：

import re
s = "The quick, brown fox jumps over the lazy dog."

print re.search(r'(?:\w+\W*){0}((?:\w+\W*){3})', s).group(1)
# The quick, brown 
print re.search(r'(?:\w+\W*){1}((?:\w+\W*){3})', s).group(1)
# quick, brown fox      
print re.search(r'(?:\w+\W*){2}((?:\w+\W*){3})', s).group(1)
# brown fox jumps

让我们看一下模式：

 _"word"_      _"word"_
/        \    /        \
(?:\w+\W*){2}((?:\w+\W*){3})
             \_____________/
                group 1

这就是它所说的：匹配 2 个单词，然后捕获到组 1，匹配 3 个单词。

(?:...) 构造用于对重复进行分组，但它们是非捕获的。

参考文献

regular-expressions.info/捕获组、非捕获组
- 重复捕获组与捕获重复组

关于“word”的注释没有

应该说 \w+\W* 对于“单词”模式来说是一个糟糕的选择，如以下示例所示：

import re
s = "nothing"
print re.search(r'(\w+\W*){3}', s).group()
# nothing

3 个单词，但正则表达式无论如何都能匹配，因为 \W* 允许空字符串匹配。

也许更好的模式是这样的：

\w+(?:\W+|$)

即 \w+ 后跟 \W+ 或字符串 $ 的末尾。

捕获前瞻选项

正如 Kobi 在评论中所建议的，此选项更简单，因为您只有一个静态模式。它使用 findall 捕获所有匹配项（参见 ideone.com）

import re
s = "The quick, brown fox jumps over the lazy dog."

triplets = re.findall(r"\b(?=((?:\w+(?:\W+|$)){3}))", s)

print triplets
# ['The quick, brown ', 'quick, brown fox ', 'brown fox jumps ',
#  'fox jumps over ', 'jumps over the ', 'over the lazy ', 'the lazy dog.']

print triplets[3]
# fox jumps over

：其工作原理是它在零宽度单词边界 \b 上进行匹配，使用前向捕获第 1 组中的 3 个“单词

    ______lookahead______
   /      ___"word"__    \
  /      /           \    \
\b(?=((?:\w+(?:\W+|$)){3}))
     \___________________/
           group 1

”

。 /lookaround.html" rel="nofollow noreferrer">正则表达式.info/Lookarounds

The prefix-matching option

You can make this work by having a variable-prefix regex to skip the first offset words, and capturing the word triplet into a group.

So something like this:

import re
s = "The quick, brown fox jumps over the lazy dog."

print re.search(r'(?:\w+\W*){0}((?:\w+\W*){3})', s).group(1)
# The quick, brown 
print re.search(r'(?:\w+\W*){1}((?:\w+\W*){3})', s).group(1)
# quick, brown fox      
print re.search(r'(?:\w+\W*){2}((?:\w+\W*){3})', s).group(1)
# brown fox jumps

Let's take a look at the pattern:

 _"word"_      _"word"_
/        \    /        \
(?:\w+\W*){2}((?:\w+\W*){3})
             \_____________/
                group 1

This does what it says: match 2 words, then capturing into group 1, match 3 words.

The (?:...) constructs are used for grouping for the repetition, but they're non-capturing.

References

regular-expressions.info/Capturing Groups, Non-capturing Groups
- Repeating a Capturing Group vs Capturing a Repeated Group

Note on "word" pattern

It should be said that \w+\W* is a poor choice for a "word" pattern, as exhibited by the following example:

import re
s = "nothing"
print re.search(r'(\w+\W*){3}', s).group()
# nothing

There are no 3 words, but the regex was able to match anyway, because \W* allows for an empty string match.

Perhaps a better pattern is something like:

\w+(?:\W+|$)

That is, a \w+ that is followed by either a \W+ or the end of the string $.

The capturing lookahead option

As suggested by Kobi in a comment, this option is simpler in that you only have one static pattern. It uses findall to capture all matches (see on ideone.com):

import re
s = "The quick, brown fox jumps over the lazy dog."

triplets = re.findall(r"\b(?=((?:\w+(?:\W+|$)){3}))", s)

print triplets
# ['The quick, brown ', 'quick, brown fox ', 'brown fox jumps ',
#  'fox jumps over ', 'jumps over the ', 'over the lazy ', 'the lazy dog.']

print triplets[3]
# fox jumps over

How this works is that it matches on zero-width word boundary \b, using lookahead to capture 3 "words" in group 1.

    ______lookahead______
   /      ___"word"__    \
  /      /           \    \
\b(?=((?:\w+(?:\W+|$)){3}))
     \___________________/
           group 1

References

regular-expressions.info/Lookarounds

回复收藏 0 原文

如日中天 2024-09-17 00:48:14

一种倾向是分割字符串并选择切片：

words = re.split(r"\s+", s)
for i in range(len(words) - 2):
    print ' '.join(words[i:i+3])

当然，这假设您要么在单词之间只有单个空格，要么不关心所有空白序列是否都折叠成单个空格。

One slant would be to split the string and select slices:

words = re.split(r"\s+", s)
for i in range(len(words) - 2):
    print ' '.join(words[i:i+3])

This does, of course, assume that you either have only single spaces between words, or don't care if all whitespace sequences are folded into single spaces.

回复收藏 0 原文

jJeQQOZ5 2024-09-17 00:48:14

不需要正则表达式

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> for offset in range(7):
...     print 'offset {0}: "{1}"'.format(offset, ' '.join(s.split()[offset:][:3]))
... 
offset 0: "The quick, brown"
offset 1: "quick, brown fox"
offset 2: "brown fox jumps"
offset 3: "fox jumps over"
offset 4: "jumps over the"
offset 5: "over the lazy"
offset 6: "the lazy dog."

No need for regex

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> for offset in range(7):
...     print 'offset {0}: "{1}"'.format(offset, ' '.join(s.split()[offset:][:3]))
... 
offset 0: "The quick, brown"
offset 1: "quick, brown fox"
offset 2: "brown fox jumps"
offset 3: "fox jumps over"
offset 4: "jumps over the"
offset 5: "over the lazy"
offset 6: "the lazy dog."

回复收藏 0 原文

简美 2024-09-17 00:48:14

我们这里有两个正交问题：

如何分割字符串。
如何构建 3 个连续元素的组。

对于 1，您可以使用正则表达式或 - 正如其他人指出的 - 一个简单的 str.split 应该足够了。对于 2，请注意，您希望看起来与 itertools 的食谱中的pairwise 抽象非常相似：

http://docs.python.org/library/itertools.html#recipes

因此，我们编写了广义的 n 维函数：

import itertools

def nwise(iterable, n):
    """nwise(iter([1,2,3,4,5]), 3) -> (1,2,3), (2,3,4), (4,5,6)"""
    iterables = itertools.tee(iterable, n)
    slices = (itertools.islice(it, idx, None) for (idx, it) in enumerate(iterables))
    return itertools.izip(*slices)

我们最终得到一个简单的和模块化代码：

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> list(nwise(s.split(), 3))
[('The', 'quick,', 'brown'), ('quick,', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog.')]

或者按照您的要求：

>>> # also: map(" ".join, nwise(s.split(), 3))
>>> [" ".join(words) for words in nwise(s.split(), 3)]
['The quick, brown', 'quick, brown fox', 'brown fox jumps', 'fox jumps over', 'jumps over the', 'over the lazy', 'the lazy dog.']

We have two orthogonal issues here:

How to split the string.
How to build groups of 3 consecutive elements.

For 1 you could use regular expressions or -as others have pointed out- a simple str.split should suffice. For 2, note that you want looks very similar to the pairwise abstraction in itertools's recipes:

http://docs.python.org/library/itertools.html#recipes

So we write our generalized n-wise function:

import itertools

def nwise(iterable, n):
    """nwise(iter([1,2,3,4,5]), 3) -> (1,2,3), (2,3,4), (4,5,6)"""
    iterables = itertools.tee(iterable, n)
    slices = (itertools.islice(it, idx, None) for (idx, it) in enumerate(iterables))
    return itertools.izip(*slices)

And we end up with a simple and modularized code:

>>> s = "The quick, brown fox jumps over the lazy dog."
>>> list(nwise(s.split(), 3))
[('The', 'quick,', 'brown'), ('quick,', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog.')]

Or as you requested:

>>> # also: map(" ".join, nwise(s.split(), 3))
>>> [" ".join(words) for words in nwise(s.split(), 3)]
['The quick, brown', 'quick, brown fox', 'brown fox jumps', 'fox jumps over', 'jumps over the', 'over the lazy', 'the lazy dog.']

回复收藏 0 原文

~没有更多了~

关于作者

风吹雨成花

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

使用正则表达式的字符串掩码和偏移量

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

前缀匹配选项

参考文献

关于“word”的注释没有

捕获前瞻选项

”

The prefix-matching option

References

Note on "word" pattern

The capturing lookahead option

References

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

使用正则表达式的字符串掩码和偏移量

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

前缀匹配选项

参考文献

关于“word”的注释 没有

捕获前瞻选项

”

The prefix-matching option

References

Note on "word" pattern

The capturing lookahead option

References

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

关于“word”的注释没有