如何使用正则表达式匹配表达式后的第一个单词？

发布于 2024-07-13 10:47:13 字数 270 浏览 16 评论 0原文

例如，在本文中：

Lorem ipsum dolor sat amet，consectetur adipiscing elit。 Nunc eu Tellus vel nunc pretium lacinia。 Proin sed lorem。 Cras sed ipsum。 Nunc a libero quis risus sollicitudin imperdiet。

我想匹配“ipsum”后面的单词。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

毁我热情 2024-07-20 10:47:13

这听起来像是后向查找的工作，尽管您应该意识到并非所有正则表达式风格都支持它们。在您的示例中：

(?<=\bipsum\s)(\w+)

这将匹配“ipsum”后面作为整个单词后跟空格的任何字母字符序列。它不匹配“ipsum”本身，您无需担心在替换等情况下重新插入它。

不过，正如我所说，某些风格（例如 JavaScript）根本不支持后向查找。许多其他（事实上，大多数）仅支持“固定宽度”lookbehinds - 因此您可以使用此示例，但不能使用任何重复运算符。（换句话说，(?<=\b\w+\s+)(\w+)不会工作。）

This sounds like a job for lookbehinds, though you should be aware that not all regex flavors support them. In your example:

(?<=\bipsum\s)(\w+)

This will match any sequence of letter characters which follows "ipsum" as a whole word followed by a space. It does not match "ipsum" itself, you don't need to worry about reinserting it in the case of, e.g. replacements.

As I said, though, some flavors (JavaScript, for example) don't support lookbehind at all. Many others (most, in fact) only support "fixed width" lookbehinds — so you could use this example but not any of the repetition operators. (In other words, (?<=\b\w+\s+)(\w+) wouldn't work.)

回复收藏 0 原文

杯别 2024-07-20 10:47:13

其他一些响应者建议使用不依赖于后向查找的正则表达式，但我认为需要一个完整的、有效的示例来阐明这一点。这个想法是，您以正常方式匹配整个序列（“ipsum”加上下一个单词），然后使用捕获组来隔离您感兴趣的部分。例如：

String s = "Lorem ipsum dolor sit amet, consectetur " +
    "adipiscing elit. Nunc eu tellus vel nunc pretium " +
    "lacinia. Proin sed lorem. Cras sed ipsum. Nunc " +
    "a libero quis risus sollicitudin imperdiet.";

Pattern p = Pattern.compile("ipsum\\W+(\\w+)");
Matcher m = p.matcher(s);
while (m.find())
{
  System.out.println(m.group(1));
}

请注意，这会打印“dolor”和“Nunc”。要使用lookbehind版本来做到这一点，您必须做一些像黑客一样的事情：

Pattern p = Pattern.compile("(?<=ipsum\\W{1,2})(\\w+)");

在Java中，它要求lookbehind具有明显的最大长度。有些风格甚至没有那么大的灵活性，当然，有些风格根本不支持向后查找。

然而，人们在示例中遇到的最大问题似乎不是向后查找，而是单词边界。 David Kemp 和 ck 似乎都期望 \b 匹配“m”后面的空格字符，但事实并非如此；它匹配“m”和空格之间的位置（或边界）。

这是一个常见的错误，我什至在一些书籍和教程中看到过重复的错误，但单词边界结构 \b 永远不会匹配任何字符。它是一个零宽度断言，就像环视和锚点（^、$、\z 等），它匹配的是前面有单词字符但后面没有 1 的位置，或者后面有单词字符但前面没有 1 的位置。

Some of the other responders have suggested using a regex that doesn't depend on lookbehinds, but I think a complete, working example is needed to get the point across. The idea is that you match the whole sequence ("ipsum" plus the next word) in the normal way, then use a capturing group to isolate the part that interests you. For example:

String s = "Lorem ipsum dolor sit amet, consectetur " +
    "adipiscing elit. Nunc eu tellus vel nunc pretium " +
    "lacinia. Proin sed lorem. Cras sed ipsum. Nunc " +
    "a libero quis risus sollicitudin imperdiet.";

Pattern p = Pattern.compile("ipsum\\W+(\\w+)");
Matcher m = p.matcher(s);
while (m.find())
{
  System.out.println(m.group(1));
}

Note that this prints both "dolor" and "Nunc". To do that with the lookbehind version, you would have to do something hackish like:

Pattern p = Pattern.compile("(?<=ipsum\\W{1,2})(\\w+)");

That's in Java, which requires the lookbehind to have an obvious maximum length. Some flavors don't have even that much flexibility, and of course, some don't support lookbehinds at all.

However, the biggest problem people seem to be having in their examples is not with lookbehinds, but with word boundaries. Both David Kemp and ck seem to expect \b to match the space character following the 'm', but it doesn't; it matches the position (or boundary) between the 'm' and the space.

It's a common mistake, one I've even seen repeated in a few books and tutorials, but the word-boundary construct, \b, never matches any characters. It's a zero-width assertion, like lookarounds and anchors (^, $, \z, etc.), and what it matches is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.

回复收藏 0 原文

摇划花蜜的午后 2024-07-20 10:47:13

ipsum\b(\w*)

回复收藏 0 原文

烟酉 2024-07-20 10:47:13

(?<=\bipsum\s|\bipsum\.\s)(\w+)

/(?<=\bipsum\s|\bipsum\.\s)(\w+)/gm
正向回顾 (?<=\bipsum\s|\bipsum\.\s)
断言下面的正则表达式匹配

第一个替代方案 \bipsum\s
\b 在字边界断言位置：(^\w|\w$|\W\w|\w\W)
ipsum 与字面上的字符 ipsum 匹配（区分大小写）
\s 匹配任何空白字符（等于 [\r\n\t\f\v ]）
第二个替代方案 \bipsum\.\s
\b 在字边界断言位置：(^\w|\w$|\W\w|\w\W)
ipsum 与字面上的字符 ipsum 匹配（区分大小写）
。匹配字符。按字面意思（区分大小写）
\s 匹配任何空白字符（等于 [\r\n\t\f\v ]）
第一捕获组 (\w+)
\w+ 匹配任何单词字符（等于 [a-zA-Z0-9_]）

量词 — 匹配一次和无限次，尽可能多次，根据需要返回（贪婪）
全局模式标志
g 修饰符：全局。所有比赛（第一场比赛后不返回）
m 修饰符：多行。使 ^ 和 $ 匹配每行的开始/结束（不仅仅是字符串的开始/结束）

(?<=\bipsum\s|\bipsum\.\s)(\w+)

/(?<=\bipsum\s|\bipsum\.\s)(\w+)/gm
Positive Lookbehind (?<=\bipsum\s|\bipsum\.\s)
Assert that the Regex below matches

1st Alternative \bipsum\s
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
ipsum matches the characters ipsum literally (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
2nd Alternative \bipsum\.\s
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
ipsum matches the characters ipsum literally (case sensitive)
. matches the character . literally (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
1st Capturing Group (\w+)
\w+ matches any word character (equal to [a-zA-Z0-9_])

Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

回复收藏 0 原文