R 中具有多个捕获组的正则表达式组捕获
在 R 中,是否可以从正则表达式匹配中提取组捕获? 据我所知,没有 grep
、grepl
、regexpr
、gregexpr
、sub< /code> 或
gsub
返回组捕获。
我需要从这样编码的字符串中提取键值对:
\((.*?) :: (0\.[0-9]+)\)
我总是可以执行多个完全匹配的 grep,或者执行一些外部(非 R)处理,但我希望我可以在 R 中完成这一切。有一个函数或一个包提供这样的函数来执行此操作吗?
In R, is it possible to extract group capture from a regular expression match? As far as I can tell, none of grep
, grepl
, regexpr
, gregexpr
, sub
, or gsub
return the group captures.
I need to extract key-value pairs from strings that are encoded thus:
\((.*?) :: (0\.[0-9]+)\)
I can always just do multiple full-match greps, or do some outside (non-R) processing, but I was hoping I can do it all within R. Is there's a function or a package that provides such a function to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
str_match()
,来自stringr
包,将执行此操作。 它返回一个字符矩阵,其中每一组对应一列(整个匹配对应一列):str_match()
, from thestringr
package, will do this. It returns a character matrix with one column for each group in the match (and one for the whole match):gsub 是这样做的,从您的示例来看:
您需要对引号中的 \s 进行双重转义,然后它们适用于正则表达式。
希望这可以帮助。
gsub does this, from your example:
you need to double escape the \s in the quotes then they work for the regex.
Hope this helps.
尝试 regmatches() 和 regexec():
Try
regmatches()
andregexec()
:gsub() 可以执行此操作并仅返回捕获组:
但是,为了使其正常工作,您必须显式选择捕获组之外的元素,如 gsub() 帮助中所述。
因此,如果要选择的文本位于某个字符串的中间,则在捕获组之前和之后添加 .* 应该允许您只返回它。
gsub(".*\\((.*?) :: (0\\.[0-9]+)\\).*","\\1 \\2", "(sometext : : 0.1231313213)")
[1]“某个文本0.1231313213”
gsub() can do this and return only the capture group:
However, in order for this to work, you must explicitly select elements outside your capture group as mentioned in the gsub() help.
So if your text to be selected lies in the middle of some string, adding .* before and after the capture group should allow you to only return it.
gsub(".*\\((.*?) :: (0\\.[0-9]+)\\).*","\\1 \\2", "(sometext :: 0.1231313213)")
[1] "sometext 0.1231313213"
使用
utils
中的strcapture
解决方案:Solution with
strcapture
from theutils
:这就是我最终解决这个问题的方法。 我使用两个单独的正则表达式来匹配第一个和第二个捕获组,并运行两个
gregexpr
调用,然后提取匹配的子字符串:This is how I ended up working around this problem. I used two separate regexes to match the first and second capture groups and run two
gregexpr
calls, then pull out the matched substrings:我喜欢 Perl 兼容的正则表达式。 可能其他人也这样做...
这是一个函数,它执行 perl 兼容的正则表达式,并与我习惯的其他语言中的函数功能相匹配:
I like perl compatible regular expressions. Probably someone else does too...
Here is a function that does perl compatible regular expressions and matches the functionality of functions in other languages that I am used to:
正如
stringr
中所建议的包,这可以使用str_match()
或str_extract()
来实现。改编自手册:
提取并组合我们的组:
用输出矩阵指示组(我们对第 2+ 列感兴趣):
As suggested in the
stringr
package, this can be achieved using eitherstr_match()
orstr_extract()
.Adapted from the manual:
Extracting and combining our groups:
Indicating groups with an output matrix (we're interested in columns 2+):
这可以使用包 unglue 来完成,以所选答案中的示例为例:
或者从数据帧开始,
您可以从 unglue 模式中获取原始正则表达式,可以选择使用命名捕获:
更多信息:< a href="https://github.com/moodymudskipper/unglue/blob/master/README.md" rel="nofollow noreferrer">https://github.com/moodymudskipper/unglue/blob/master/README.md
This can be done using the package unglue, taking the example from the selected answer:
Or starting from a data frame
you can get the raw regex from the unglue pattern, optionally with named capture :
More info : https://github.com/moodymudskipper/unglue/blob/master/README.md