循环:在r中函数时,如何循环case_?

发布于 2025-01-21 06:59:08 字数 1553 浏览 0 评论 0 原文

这是代码,我试图通过检测单词并匹配单词来创建变量。在这里,我使用 dplyr 软件包及其功能突变 case_when 结合使用。问题是我正在手动添加每个值。如何通过应用一些循环函数匹配两者来自动化它?

city <- LETTERS #26 cities
district <- letters[10:20] #11 districts
streets <- paste0(district, district)
streets <- streets[-c(5:26)] #4 streets

df <- data.frame(x = c(1:5), 
           address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))

library(dplyr)
library(stringi)

df2 <- df %>%
  mutate(districts = case_when(
    stri_detect_fixed(address, "b") ~ "b",   #address[1]
                                             #address[2]
    stri_detect_fixed(address, "a") ~ "a",   #address[3]
                                             #address[4]
    stri_detect_fixed(address, "cc") ~ "cc"  #address[5]
))

代码通过地址扫描 district 向量的值。我很想为 city street 变量做同样的事情。因此,我使用了代码的修改版本另一个问题在堆栈溢出中。它会产生错误。

for (j in town_village2) {
trn_house3[,93] <- case_when(
      stri_detect_fixed(trn_house3[1:6469, 4], j) ~ j)
}

我试图产生这个结果:

x    address      city     district   street
1    A, b, cc,      A        b          cc  
2    B, dd          B        NA         dd
3    a, dd          NA       a          dd
4    C              C        NA         NA
5    D, a, cc       D        a          cc

Here's the code, where I am trying to create a variable by detecting the words and matching them. Here I use dplyr package and its function mutate in combination with case_when. The problem is I am adding each one of the values manually as you see. How can I automate it by applying some loop functions to match the two?

city <- LETTERS #26 cities
district <- letters[10:20] #11 districts
streets <- paste0(district, district)
streets <- streets[-c(5:26)] #4 streets

df <- data.frame(x = c(1:5), 
           address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))

library(dplyr)
library(stringi)

df2 <- df %>%
  mutate(districts = case_when(
    stri_detect_fixed(address, "b") ~ "b",   #address[1]
                                             #address[2]
    stri_detect_fixed(address, "a") ~ "a",   #address[3]
                                             #address[4]
    stri_detect_fixed(address, "cc") ~ "cc"  #address[5]
))

The code scans through address for the value in district vector. I would love to do the same for city and street variables. So I used the modified version of the code from another question in Stack Overflow. It produces an error.

for (j in town_village2) {
trn_house3[,93] <- case_when(
      stri_detect_fixed(trn_house3[1:6469, 4], j) ~ j)
}

I seek to produce this result:

x    address      city     district   street
1    A, b, cc,      A        b          cc  
2    B, dd          B        NA         dd
3    a, dd          NA       a          dd
4    C              C        NA         NA
5    D, a, cc       D        a          cc

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

抹茶夏天i‖ 2025-01-28 06:59:08

如果要添加循环,使用 case_when();如果您可以循环浏览它们,则不必将所有选项添加到其中。

您可以使用循环解决:

library(stringi)
 
df2 <- df
 
for(c in city) df2$city[stri_detect_fixed(df2$address, c)] <- c
 
for(d in district) df2$district[stri_detect_fixed(df2$address, d)] <- d
 
for(s in streets) df2$street[stri_detect_fixed(df2$address, s)] <- s

请注意,您的示例代码不起作用;区域名称在您的示例数据集中为“ A”和“ B”,但是您可以生成“ j”通过“ t”的名称。我在上面的代码中解决了这一点。

如果城市,地区和/或街道重叠,这将导致错误。例如,如果一排在“ B”区,并且在街道“ CC”中,则stri_detect_fixed也将看到“ C”,并认为它在“ C”中。我提出了一种完全不同的方法来克服这一点:

替代方法

鉴于您的示例数据,首先将给定的地址划分为,这是最有意义的。 ,然后寻找 Exact 与您的参考城市/地区/街道名称匹配。我们可以与 Intersect()一起查找这些确切的匹配。

# example reference address parts
cities <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", 
          "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", 
          "Z")

districts <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")

streets <- c("aa", "bb", "cc", "dd")

# example dataset
df <- data.frame(x = c(1:5), 
                 address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))

# vectorize address into elements
address_elems = strsplit(df$address, ',') # split by comma
address_elems = sapply(address_elems, trimws) # trim whitespace

比较 df $ address 和新创建的 address_elems

> df$address
[1] "A, b, cc," "B, dd"     "a, dd"     "C"         "D, a, cc"

> address_elems
[[1]]
[1] "A"  "b"  "cc"
[[2]]
[1] "B"  "dd"
[[3]]
[1] "a"  "dd"
[[4]]
[1] "C"
[[5]]
[1] "D"  "a"  "cc"

我们可以找到匹配的 coities ,仅在 adverion_elems 中仅是第一个向量与 Intersect(Cities,advelly_elems [[1]])中。

因为我们可能获得多个匹配项,所以我们只采用第一个元素, Intersect(cities,address_elems [[1]])[[1])[[1]]

其应用于 adverry_elems 中的每个

# intersect the respective reference lists with each list of
# address items, taking only the first element
df$cities = sapply(address_elems, function(x) intersect(cities, x)[1])

df$district = sapply(address_elems, function(x) intersect(districts, x)[1])

df$street = sapply(address_elems, function(x) intersect(streets, x)[1])

向量

要将 我们一起得到了:

# example reference address parts
cities <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", 
          "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", 
          "Z")

districts <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")

streets <- c("aa", "bb", "cc", "dd")

# example dataset
df <- data.frame(x = c(1:5), 
                 address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))

# create vector of address elements
address_elems = strsplit(df$address, ',') # split by comma
address_elems = sapply(address_elems, trimws) # trim whitespace

# intersect the respecitve reference lists with each list of
# address items, take only the first element
df$cities = lapply(address_elems, function(x) intersect(cities, x)[1])

df$district = sapply(address_elems, function(x) intersect(districts, x)[1])

df$street = sapply(address_elems, function(x) intersect(streets, x)[1])

# cleanup
rm(address_elems)

If you are going to add a loop, it makes no sense to use case_when(); you don't have to add all options into it if you can loop over them.

You can solve it with a for-loop:

library(stringi)
 
df2 <- df
 
for(c in city) df2$city[stri_detect_fixed(df2$address, c)] <- c
 
for(d in district) df2$district[stri_detect_fixed(df2$address, d)] <- d
 
for(s in streets) df2$street[stri_detect_fixed(df2$address, s)] <- s

Note that your example code didn't work; the district names are 'a' and 'b' in your example dataset, but you generate names 'j' through 't'. I fixed that in my code above.

And it will cause an error if names of cities, districts and/or streets overlap. For instance, if one row is in the district 'b', and in the street 'cc', stri_detect_fixed will also see the 'c' and think it is in 'c'. I propose a completely different method to overcome this:

Alternative method

Given your example data, it makes most sense to first split the given address by ,, then look for exact matches with your reference city/district/street names. We can look for those exact matches with intersect().

# example reference address parts
cities <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", 
          "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", 
          "Z")

districts <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")

streets <- c("aa", "bb", "cc", "dd")

# example dataset
df <- data.frame(x = c(1:5), 
                 address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))

# vectorize address into elements
address_elems = strsplit(df$address, ',') # split by comma
address_elems = sapply(address_elems, trimws) # trim whitespace

Compare df$address and the newly created address_elems:

> df$address
[1] "A, b, cc," "B, dd"     "a, dd"     "C"         "D, a, cc"

> address_elems
[[1]]
[1] "A"  "b"  "cc"
[[2]]
[1] "B"  "dd"
[[3]]
[1] "a"  "dd"
[[4]]
[1] "C"
[[5]]
[1] "D"  "a"  "cc"

We could find matching cities for just the first vector in address_elems in with intersect(cities, address_elems[[1]]).

Because we might get multiple matches, we only take the first element, with intersect(cities, address_elems[[1]])[[1]].

To apply this to every vector in address_elems, we can use sapply() or lapply():

# intersect the respective reference lists with each list of
# address items, taking only the first element
df$cities = sapply(address_elems, function(x) intersect(cities, x)[1])

df$district = sapply(address_elems, function(x) intersect(districts, x)[1])

df$street = sapply(address_elems, function(x) intersect(streets, x)[1])

PIAT

Putting it all together we get:

# example reference address parts
cities <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", 
          "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", 
          "Z")

districts <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")

streets <- c("aa", "bb", "cc", "dd")

# example dataset
df <- data.frame(x = c(1:5), 
                 address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))

# create vector of address elements
address_elems = strsplit(df$address, ',') # split by comma
address_elems = sapply(address_elems, trimws) # trim whitespace

# intersect the respecitve reference lists with each list of
# address items, take only the first element
df$cities = lapply(address_elems, function(x) intersect(cities, x)[1])

df$district = sapply(address_elems, function(x) intersect(districts, x)[1])

df$street = sapply(address_elems, function(x) intersect(streets, x)[1])

# cleanup
rm(address_elems)
无力看清 2025-01-28 06:59:08

这将将元素分为向量:

library(tidyverse)

df <- data.frame(
  x = c(1:5),
  address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc")
)

df3 <-
  df %>%
  separate_rows(address, sep = "[, ]+") %>%
  filter(nchar(address) > 0) %>%
  nest(address) %>%
  transmute(x, districts = data %>% map(~ .x[[1]]))
#> Warning: All elements of `...` must be named.
#> Did you want `data = address`?
df3
#> # A tibble: 5 × 2
#>       x districts
#>   <int> <list>   
#> 1     1 <chr [3]>
#> 2     2 <chr [2]>
#> 3     3 <chr [2]>
#> 4     4 <chr [1]>
#> 5     5 <chr [3]>
df3$districts[[1]]
#> [1] "A"  "b"  "cc"

This will separate the elements into vectors:

library(tidyverse)

df <- data.frame(
  x = c(1:5),
  address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc")
)

df3 <-
  df %>%
  separate_rows(address, sep = "[, ]+") %>%
  filter(nchar(address) > 0) %>%
  nest(address) %>%
  transmute(x, districts = data %>% map(~ .x[[1]]))
#> Warning: All elements of `...` must be named.
#> Did you want `data = address`?
df3
#> # A tibble: 5 × 2
#>       x districts
#>   <int> <list>   
#> 1     1 <chr [3]>
#> 2     2 <chr [2]>
#> 3     3 <chr [2]>
#> 4     4 <chr [1]>
#> 5     5 <chr [3]>
df3$districts[[1]]
#> [1] "A"  "b"  "cc"

Created on 2022-04-14 by the reprex package (v2.0.0)

眼角的笑意。 2025-01-28 06:59:08

data.table 方法

library(data.table)
DT <- data.table(city, streets, district)
# create a lookup table with all elements
lookup <- melt(DT, measure.vars = names(DT))
# set df to data.table format
setDT(df)
final <- df[, .(address = unlist(tstrsplit(address, ",[ ]*", perl = TRUE))), by = .(x)]
# now add elements
final[lookup, type := i.variable, on = .(address = value)]
# and dcast to wide
dcast(final, x ~ type, value.var = "address")
#    x city streets district
# 1: 1    A      cc        b
# 2: 2    B      dd     <NA>
# 3: 3 <NA>      dd        a
# 4: 4    C    <NA>     <NA>
# 5: 5    D      cc        a

a data.table approach

library(data.table)
DT <- data.table(city, streets, district)
# create a lookup table with all elements
lookup <- melt(DT, measure.vars = names(DT))
# set df to data.table format
setDT(df)
final <- df[, .(address = unlist(tstrsplit(address, ",[ ]*", perl = TRUE))), by = .(x)]
# now add elements
final[lookup, type := i.variable, on = .(address = value)]
# and dcast to wide
dcast(final, x ~ type, value.var = "address")
#    x city streets district
# 1: 1    A      cc        b
# 2: 2    B      dd     <NA>
# 3: 3 <NA>      dd        a
# 4: 4    C    <NA>     <NA>
# 5: 5    D      cc        a
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文