使用 rapply 处理名称字符向量中的后缀字符向量

发布于 2024-11-25 02:13:47 字数 1173 浏览 2 评论 0原文

我想从一组全名中删除一组后缀(后缀和全名都是字符向量)。使用两个 for() 循环和 gsub() 非常容易,但似乎应该有一种更有效的方法(在代码行和时钟周期方面) 。

我的第一个想法是 rapply(),但我无法让它工作。也许 for() 循环是最好的方法,但此时我有兴趣更好地理解 rapply()

这是 for() > 循环版本。

names.full <- c("tom inc", "dick co", "harry incorp", "larry inc incorp", "curly", "moe")
suffix <- c("inc", "incorp", "incorporated", "co", "company")
suffix <- paste(" ", suffix, "$", sep = "")

# with loops
names.abbr <- names.full
for (k in seq(2)) {
    for (i in seq(length(names.abbr))) {
        for (j in seq(length(suffix))) {
            names.abbr[i] <- gsub(suffix[j], "", names.abbr[i])
        }
    }
}

还有我失败的 rapply() 版本。

# with rapply
inner.fun <- function(y, x) {
    rapply(as.list(x), function(x) gsub(y, "", x), how = "replace")
}
names.abbr.fail <- unlist(rapply(as.list(suffix), inner.fun, x = names.full, how = replace))

这给出了以下错误:

> names.abbr.fail <- unlist(rapply(as.list(suffix), inner.fun, x = names.full, how = replace))
Error in match.arg(how) : 'arg' must be NULL or a character vector

I would like to remove a set of suffixes from a set of full names (both suffixes and full names are character vectors). This is pretty easy with two for() loops and gsub(), but it seems that there should be a more efficient approach (both in lines of code and clock cycles).

My first thought was rapply(), but I can't get it to work. Maybe the for() loop is the best approach, but at this point I'm interested in better understanding rapply()

Here's the for() loop version.

names.full <- c("tom inc", "dick co", "harry incorp", "larry inc incorp", "curly", "moe")
suffix <- c("inc", "incorp", "incorporated", "co", "company")
suffix <- paste(" ", suffix, "$", sep = "")

# with loops
names.abbr <- names.full
for (k in seq(2)) {
    for (i in seq(length(names.abbr))) {
        for (j in seq(length(suffix))) {
            names.abbr[i] <- gsub(suffix[j], "", names.abbr[i])
        }
    }
}

And my failed rapply() version.

# with rapply
inner.fun <- function(y, x) {
    rapply(as.list(x), function(x) gsub(y, "", x), how = "replace")
}
names.abbr.fail <- unlist(rapply(as.list(suffix), inner.fun, x = names.full, how = replace))

Which gives the following error:

> names.abbr.fail <- unlist(rapply(as.list(suffix), inner.fun, x = names.full, how = replace))
Error in match.arg(how) : 'arg' must be NULL or a character vector

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

征﹌骨岁月お 2024-12-02 02:13:48

在您的示例中,您最终只会删除除第一个单词之外的所有单词。这很容易完成

sub(" .*$", "", names.full)

,但更通用的正则表达式模式是类似 "(suffix1|suffix2)" ,它包含所有后缀。

由于您似乎想要从一个字符串中删除多个后缀,如 "larry inc incorp" 中所示,因此您需要类似 "( suffix1| suffix2)+$" 的内容。

然后你可以简单地将它应用到names.full(我将“moe”更改为“moe Money”以显示“第一个单词”解决方案失败的地方)。它看起来像这样:

names.full <- c("tom inc", "dick co", "harry incorp",
  "larry inc incorp", "curly", "moe money")
suffix <- c("inc", "incorp", "incorporated", "co", "company")

pattern <- paste("(", paste(" ", suffix, collapse="|", sep=""), ")+$", sep="")    
sub(pattern, "", names.full)
[1] "tom"       "dick"      "harry"     "larry"     "curly"     "moe money"

顺便说一句,如果您不想替换后缀以外的任何内容,则 sub 可能比 gsub 更合适(>gsub 通常用于替换单词中某个模式的多个实例

In your example, you only end up removing all but the first word. That's easily done with

sub(" .*$", "", names.full)

But a more general regexpr pattern is something like "(suffix1|suffix2)" that has ALL your suffixes.

Since you seem to want to remove multiple suffixes from one string as in "larry inc incorp", you need something like "( suffix1| suffix2)+$".

Then you can simply apply it to names.full (I changed "moe" into "moe money" to show something where the "first word" solution fails). It would look something like this:

names.full <- c("tom inc", "dick co", "harry incorp",
  "larry inc incorp", "curly", "moe money")
suffix <- c("inc", "incorp", "incorporated", "co", "company")

pattern <- paste("(", paste(" ", suffix, collapse="|", sep=""), ")+$", sep="")    
sub(pattern, "", names.full)
[1] "tom"       "dick"      "harry"     "larry"     "curly"     "moe money"

And by the way, if you don't want to replace anything but the suffix, sub is probably a better fit than gsub (gsub is typically used to replace several instances of a pattern within a word).

淡忘如思 2024-12-02 02:13:48

你真的需要使用for循环吗?我认为你应该能够使用 gsub 中的反向引用来提取你想要的内容。

  • \\w 匹配 0 - 9、A - Z 和 a - z 范围内的任何字符。
  • + 匹配前一个字符 1 次或多次。
  • () 允许我们向后引用后面的内容
    正则表达式。
  • . 匹配任何字符所有字符,* 匹配
    前面的字符 0 次或多次。

将以上所有内容放在一起可以得出:

gsub("(\\w+)(.*)", "\\1", names.full)

> gsub("(\\w+)(.*)", "\\1", names.full)
[1] "tom"   "dick"  "harry" "larry" "curly"  "moe"   

Do you really need to use the for loops? I think you should be able to use back references in gsub to extract what you want.

  • The \\w matches any character in the range 0 - 9, A - Z and a - z.
  • The + matches the previous character 1 or more times.
  • The () allow us to back reference whatever is inside later in
    the regex.
  • The . matches any character all characters and * matches the
    preceding character 0 or more times.

Putting all of the above together gives us:

gsub("(\\w+)(.*)", "\\1", names.full)

> gsub("(\\w+)(.*)", "\\1", names.full)
[1] "tom"   "dick"  "harry" "larry" "curly"  "moe"   
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文