使用 rapply 处理名称字符向量中的后缀字符向量
我想从一组全名中删除一组后缀(后缀和全名都是字符向量)。使用两个 for() 循环和 gsub() 非常容易,但似乎应该有一种更有效的方法(在代码行和时钟周期方面) 。
我的第一个想法是 rapply()
,但我无法让它工作。也许 for() 循环是最好的方法,但此时我有兴趣更好地理解 rapply()
这是 for()
> 循环版本。
names.full <- c("tom inc", "dick co", "harry incorp", "larry inc incorp", "curly", "moe")
suffix <- c("inc", "incorp", "incorporated", "co", "company")
suffix <- paste(" ", suffix, "$", sep = "")
# with loops
names.abbr <- names.full
for (k in seq(2)) {
for (i in seq(length(names.abbr))) {
for (j in seq(length(suffix))) {
names.abbr[i] <- gsub(suffix[j], "", names.abbr[i])
}
}
}
还有我失败的 rapply()
版本。
# with rapply
inner.fun <- function(y, x) {
rapply(as.list(x), function(x) gsub(y, "", x), how = "replace")
}
names.abbr.fail <- unlist(rapply(as.list(suffix), inner.fun, x = names.full, how = replace))
这给出了以下错误:
> names.abbr.fail <- unlist(rapply(as.list(suffix), inner.fun, x = names.full, how = replace))
Error in match.arg(how) : 'arg' must be NULL or a character vector
I would like to remove a set of suffixes from a set of full names (both suffixes and full names are character vectors). This is pretty easy with two for()
loops and gsub()
, but it seems that there should be a more efficient approach (both in lines of code and clock cycles).
My first thought was rapply()
, but I can't get it to work. Maybe the for()
loop is the best approach, but at this point I'm interested in better understanding rapply()
Here's the for()
loop version.
names.full <- c("tom inc", "dick co", "harry incorp", "larry inc incorp", "curly", "moe")
suffix <- c("inc", "incorp", "incorporated", "co", "company")
suffix <- paste(" ", suffix, "$", sep = "")
# with loops
names.abbr <- names.full
for (k in seq(2)) {
for (i in seq(length(names.abbr))) {
for (j in seq(length(suffix))) {
names.abbr[i] <- gsub(suffix[j], "", names.abbr[i])
}
}
}
And my failed rapply()
version.
# with rapply
inner.fun <- function(y, x) {
rapply(as.list(x), function(x) gsub(y, "", x), how = "replace")
}
names.abbr.fail <- unlist(rapply(as.list(suffix), inner.fun, x = names.full, how = replace))
Which gives the following error:
> names.abbr.fail <- unlist(rapply(as.list(suffix), inner.fun, x = names.full, how = replace))
Error in match.arg(how) : 'arg' must be NULL or a character vector
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在您的示例中,您最终只会删除除第一个单词之外的所有单词。这很容易完成
,但更通用的正则表达式模式是类似
"(suffix1|suffix2)"
,它包含所有后缀。由于您似乎想要从一个字符串中删除多个后缀,如
"larry inc incorp"
中所示,因此您需要类似"( suffix1| suffix2)+$"
的内容。然后你可以简单地将它应用到names.full(我将“moe”更改为“moe Money”以显示“第一个单词”解决方案失败的地方)。它看起来像这样:
顺便说一句,如果您不想替换后缀以外的任何内容,则
sub
可能比gsub
更合适(>gsub
通常用于替换单词中某个模式的多个实例。In your example, you only end up removing all but the first word. That's easily done with
But a more general regexpr pattern is something like
"(suffix1|suffix2)"
that has ALL your suffixes.Since you seem to want to remove multiple suffixes from one string as in
"larry inc incorp"
, you need something like"( suffix1| suffix2)+$"
.Then you can simply apply it to
names.full
(I changed "moe" into "moe money" to show something where the "first word" solution fails). It would look something like this:And by the way, if you don't want to replace anything but the suffix,
sub
is probably a better fit thangsub
(gsub
is typically used to replace several instances of a pattern within a word).你真的需要使用for循环吗?我认为你应该能够使用 gsub 中的反向引用来提取你想要的内容。
\\w
匹配 0 - 9、A - Z 和 a - z 范围内的任何字符。+
匹配前一个字符 1 次或多次。()
允许我们向后引用后面的内容正则表达式。
.
匹配任何字符所有字符,*
匹配前面的字符 0 次或多次。
将以上所有内容放在一起可以得出:
Do you really need to use the for loops? I think you should be able to use back references in gsub to extract what you want.
\\w
matches any character in the range 0 - 9, A - Z and a - z.+
matches the previous character 1 or more times.()
allow us to back reference whatever is inside later inthe regex.
.
matches any character all characters and*
matches thepreceding character 0 or more times.
Putting all of the above together gives us: