如何使用正则表达式匹配表达式后的第一个单词?
例如,在本文中:
Lorem ipsum dolor sat amet,consectetur adipiscing elit。 Nunc eu Tellus vel nunc pretium lacinia。 Proin sed lorem。 Cras sed ipsum。 Nunc a libero quis risus sollicitudin imperdiet。
我想匹配“ipsum”后面的单词。
For example, in this text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc eu tellus vel nunc pretium lacinia. Proin sed lorem. Cras sed ipsum. Nunc a libero quis risus sollicitudin imperdiet.
I want to match the word after 'ipsum'.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
这听起来像是后向查找的工作,尽管您应该意识到并非所有正则表达式风格都支持它们。 在您的示例中:
这将匹配“ipsum”后面作为整个单词后跟空格的任何字母字符序列。 它不匹配“ipsum”本身,您无需担心在替换等情况下重新插入它。
不过,正如我所说,某些风格(例如 JavaScript)根本不支持后向查找。 许多其他(事实上,大多数)仅支持“固定宽度”lookbehinds - 因此您可以使用此示例,但不能使用任何重复运算符。 (换句话说,
(?<=\b\w+\s+)(\w+)
不会工作。)This sounds like a job for lookbehinds, though you should be aware that not all regex flavors support them. In your example:
This will match any sequence of letter characters which follows "ipsum" as a whole word followed by a space. It does not match "ipsum" itself, you don't need to worry about reinserting it in the case of, e.g. replacements.
As I said, though, some flavors (JavaScript, for example) don't support lookbehind at all. Many others (most, in fact) only support "fixed width" lookbehinds — so you could use this example but not any of the repetition operators. (In other words,
(?<=\b\w+\s+)(\w+)
wouldn't work.)其他一些响应者建议使用不依赖于后向查找的正则表达式,但我认为需要一个完整的、有效的示例来阐明这一点。 这个想法是,您以正常方式匹配整个序列(“ipsum”加上下一个单词),然后使用捕获组来隔离您感兴趣的部分。 例如:
请注意,这会打印“dolor”和“Nunc”。 要使用lookbehind版本来做到这一点,您必须做一些像黑客一样的事情:
在Java中,它要求lookbehind具有明显的最大长度。 有些风格甚至没有那么大的灵活性,当然,有些风格根本不支持向后查找。
然而,人们在示例中遇到的最大问题似乎不是向后查找,而是单词边界。 David Kemp 和 ck 似乎都期望
\b
匹配“m”后面的空格字符,但事实并非如此; 它匹配“m”和空格之间的位置(或边界)。这是一个常见的错误,我什至在一些书籍和教程中看到过重复的错误,但单词边界结构
\b
永远不会匹配任何字符。 它是一个零宽度断言,就像环视和锚点(^
、$
、\z
等),它匹配的是前面有单词字符但后面没有 1 的位置,或者后面有单词字符但前面没有 1 的位置。Some of the other responders have suggested using a regex that doesn't depend on lookbehinds, but I think a complete, working example is needed to get the point across. The idea is that you match the whole sequence ("ipsum" plus the next word) in the normal way, then use a capturing group to isolate the part that interests you. For example:
Note that this prints both "dolor" and "Nunc". To do that with the lookbehind version, you would have to do something hackish like:
That's in Java, which requires the lookbehind to have an obvious maximum length. Some flavors don't have even that much flexibility, and of course, some don't support lookbehinds at all.
However, the biggest problem people seem to be having in their examples is not with lookbehinds, but with word boundaries. Both David Kemp and ck seem to expect
\b
to match the space character following the 'm', but it doesn't; it matches the position (or boundary) between the 'm' and the space.It's a common mistake, one I've even seen repeated in a few books and tutorials, but the word-boundary construct,
\b
, never matches any characters. It's a zero-width assertion, like lookarounds and anchors (^
,$
,\z
, etc.), and what it matches is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.ipsum\b(\w*)
ipsum\b(\w*)
/(?<=\bipsum\s|\bipsum\.\s)(\w+)/gm
正向回顾
(?<=\bipsum\s|\bipsum\.\s)
断言下面的正则表达式匹配
\bipsum\s
\b 在字边界断言位置:
(^\w|\w$|\W\w|\w\W)
ipsum 与字面上的字符 ipsum 匹配(区分大小写)
\s 匹配任何空白字符(等于
[\r\n\t\f\v ]
)\bipsum\.\s
\b 在字边界断言位置:
(^\w|\w$|\W\w|\w\W)
ipsum 与字面上的字符 ipsum 匹配(区分大小写)
。 匹配字符。 按字面意思(区分大小写)
\s 匹配任何空白字符(等于
[\r\n\t\f\v ]
)第一捕获组 (\w+)
\w+ 匹配任何单词字符(等于
[a-zA-Z0-9_]
)全局模式标志
g 修饰符:全局。 所有比赛(第一场比赛后不返回)
m 修饰符:多行。 使 ^ 和 $ 匹配每行的开始/结束(不仅仅是字符串的开始/结束)
/(?<=\bipsum\s|\bipsum\.\s)(\w+)/gm
Positive Lookbehind
(?<=\bipsum\s|\bipsum\.\s)
Assert that the Regex below matches
\bipsum\s
\b assert position at a word boundary:
(^\w|\w$|\W\w|\w\W)
ipsum matches the characters ipsum literally (case sensitive)
\s matches any whitespace character (equal to
[\r\n\t\f\v ]
)\bipsum\.\s
\b assert position at a word boundary:
(^\w|\w$|\W\w|\w\W)
ipsum matches the characters ipsum literally (case sensitive)
. matches the character . literally (case sensitive)
\s matches any whitespace character (equal to
[\r\n\t\f\v ]
)1st Capturing Group (\w+)
\w+ matches any word character (equal to
[a-zA-Z0-9_]
)Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
使用
javascript
您可以使用(?=ipsum.*?(\w+))
这也将获得第二次出现 (Nunc)
With
javascript
you can use(?=ipsum.*?(\w+))
This will get the second occurrence as well (Nunc)
示例语句:“availebleLimit:Double?”。 如果你想查找':'字符之后的单词,可以使用下面的正则表达式
Regex =>; :.+$
Example statement: "availebleLimit: Double?". İf you want to find words after ':' character, the below regex can be used
Regex => :.+$
ipsum\b(.*)\b
编辑:
尽管根据您的正则表达式实现,这可能会很饿并找到 ipsum 之后的所有单词
ipsum\b(.*)\b
EDIT:
although depending on your regex implementation, this could be hungry and find all words after ipsum