使用 rapply 处理名称字符向量中的后缀字符向量

发布于 2024-11-25 02:13:47 字数 1173 浏览 2 评论 0原文

我想从一组全名中删除一组后缀（后缀和全名都是字符向量）。使用两个 for() 循环和 gsub() 非常容易，但似乎应该有一种更有效的方法（在代码行和时钟周期方面）。

我的第一个想法是 rapply()，但我无法让它工作。也许 for() 循环是最好的方法，但此时我有兴趣更好地理解 rapply()

这是 for() > 循环版本。

names.full <- c("tom inc", "dick co", "harry incorp", "larry inc incorp", "curly", "moe")
suffix <- c("inc", "incorp", "incorporated", "co", "company")
suffix <- paste(" ", suffix, "$", sep = "")

# with loops
names.abbr <- names.full
for (k in seq(2)) {
    for (i in seq(length(names.abbr))) {
        for (j in seq(length(suffix))) {
            names.abbr[i] <- gsub(suffix[j], "", names.abbr[i])
        }
    }
}

还有我失败的 rapply() 版本。

# with rapply
inner.fun <- function(y, x) {
    rapply(as.list(x), function(x) gsub(y, "", x), how = "replace")
}
names.abbr.fail <- unlist(rapply(as.list(suffix), inner.fun, x = names.full, how = replace))

这给出了以下错误：

> names.abbr.fail <- unlist(rapply(as.list(suffix), inner.fun, x = names.full, how = replace))
Error in match.arg(how) : 'arg' must be NULL or a character vector

原文

I would like to remove a set of suffixes from a set of full names (both suffixes and full names are character vectors). This is pretty easy with two for() loops and gsub(), but it seems that there should be a more efficient approach (both in lines of code and clock cycles).

My first thought was rapply(), but I can't get it to work. Maybe the for() loop is the best approach, but at this point I'm interested in better understanding rapply()

Here's the for() loop version.

names.full <- c("tom inc", "dick co", "harry incorp", "larry inc incorp", "curly", "moe")
suffix <- c("inc", "incorp", "incorporated", "co", "company")
suffix <- paste(" ", suffix, "$", sep = "")

# with loops
names.abbr <- names.full
for (k in seq(2)) {
    for (i in seq(length(names.abbr))) {
        for (j in seq(length(suffix))) {
            names.abbr[i] <- gsub(suffix[j], "", names.abbr[i])
        }
    }
}

And my failed rapply() version.

# with rapply
inner.fun <- function(y, x) {
    rapply(as.list(x), function(x) gsub(y, "", x), how = "replace")
}
names.abbr.fail <- unlist(rapply(as.list(suffix), inner.fun, x = names.full, how = replace))

Which gives the following error:

> names.abbr.fail <- unlist(rapply(as.list(suffix), inner.fun, x = names.full, how = replace))
Error in match.arg(how) : 'arg' must be NULL or a character vector

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

征﹌骨岁月お 2024-12-02 02:13:48

在您的示例中，您最终只会删除除第一个单词之外的所有单词。这很容易完成

sub(" .*$", "", names.full)

，但更通用的正则表达式模式是类似 "(suffix1|suffix2)" ，它包含所有后缀。

由于您似乎想要从一个字符串中删除多个后缀，如 "larry inc incorp" 中所示，因此您需要类似 "( suffix1| suffix2)+$" 的内容。

然后你可以简单地将它应用到names.full（我将“moe”更改为“moe Money”以显示“第一个单词”解决方案失败的地方）。它看起来像这样：

names.full <- c("tom inc", "dick co", "harry incorp",
  "larry inc incorp", "curly", "moe money")
suffix <- c("inc", "incorp", "incorporated", "co", "company")

pattern <- paste("(", paste(" ", suffix, collapse="|", sep=""), ")+$", sep="")    
sub(pattern, "", names.full)
[1] "tom"       "dick"      "harry"     "larry"     "curly"     "moe money"

顺便说一句，如果您不想替换后缀以外的任何内容，则 sub 可能比 gsub 更合适（>gsub 通常用于替换单词中某个模式的多个实例。

In your example, you only end up removing all but the first word. That's easily done with

sub(" .*$", "", names.full)

But a more general regexpr pattern is something like "(suffix1|suffix2)" that has ALL your suffixes.

Since you seem to want to remove multiple suffixes from one string as in "larry inc incorp", you need something like "( suffix1| suffix2)+$".

Then you can simply apply it to names.full (I changed "moe" into "moe money" to show something where the "first word" solution fails). It would look something like this:

names.full <- c("tom inc", "dick co", "harry incorp",
  "larry inc incorp", "curly", "moe money")
suffix <- c("inc", "incorp", "incorporated", "co", "company")

pattern <- paste("(", paste(" ", suffix, collapse="|", sep=""), ")+$", sep="")    
sub(pattern, "", names.full)
[1] "tom"       "dick"      "harry"     "larry"     "curly"     "moe money"

And by the way, if you don't want to replace anything but the suffix, sub is probably a better fit than gsub (gsub is typically used to replace several instances of a pattern within a word).

回复收藏 0 原文

淡忘如思 2024-12-02 02:13:48

你真的需要使用for循环吗？我认为你应该能够使用 gsub 中的反向引用来提取你想要的内容。

\\w 匹配 0 - 9、A - Z 和 a - z 范围内的任何字符。
+ 匹配前一个字符 1 次或多次。
() 允许我们向后引用后面的内容
正则表达式。
. 匹配任何字符所有字符，* 匹配
前面的字符 0 次或多次。

将以上所有内容放在一起可以得出：

gsub("(\\w+)(.*)", "\\1", names.full)

> gsub("(\\w+)(.*)", "\\1", names.full)
[1] "tom"   "dick"  "harry" "larry" "curly"  "moe"

Do you really need to use the for loops? I think you should be able to use back references in gsub to extract what you want.

The \\w matches any character in the range 0 - 9, A - Z and a - z.
The + matches the previous character 1 or more times.
The () allow us to back reference whatever is inside later in
the regex.
The . matches any character all characters and * matches the
preceding character 0 or more times.

Putting all of the above together gives us:

gsub("(\\w+)(.*)", "\\1", names.full)

> gsub("(\\w+)(.*)", "\\1", names.full)
[1] "tom"   "dick"  "harry" "larry" "curly"  "moe"

回复收藏 0 原文

~没有更多了~

关于作者

晚风撩人

暂无简介

0 文章

0 评论

24510 人气

关注发私信

友情链接

文江博客

使用 rapply 处理名称字符向量中的后缀字符向量

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

娇女薄笑

biaggi

xiaolangfanhua

rivulet

我三岁

薆情海

友情链接

使用 rapply 处理名称字符向量中的后缀字符向量

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

娇女薄笑

biaggi

xiaolangfanhua

rivulet

我三岁

薆情海

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。