有条件地将字符串分成列

发布于 2025-01-20 15:09:46 字数 747 浏览 2 评论 0原文

在调查中，常见的做法是提出问题，然后告诉参与者“选择所有适用的选项”。例如，“您喜欢吃哪些食物（请选择所有适用的选项）？” a）寿司，b）意大利面，c）汉堡。

假设四名 (N=4) 参与者回答了这个问题，数据可能如下所示。

food.df <- data.frame(id = c(1,2,3,4), food.choice = c("1,2", "", "1,2,3", "3"))

我想做的是使用一种对个体数量和食物选择属性数量（即寿司、意大利面、汉堡……）灵活的方法有条件地将它们分成独特的列。最终数据看起来像这样。

food.final <- data.frame(id= c(1,2,3,4), sushi = c(1,0,1,0), pasta = c(1,0,1,0), hamburger = c(0,0,1,1))

更高级的版本将允许条件分组。您可以将其视为按食物组、位置等进行分组。假设我们按“选定的含有蛋白质的食物”进行分组，则可以对其进行编码以反映总的选择。这可能看起来像这样。

food.group <- data.frame(id = c(1,2,3,4), protein = c(1,0,2,1), non.protein = c(1,0,1,0))

我尝试过使用 tidyr::separate、strsplit 和其他列分割函数，但似乎无法获得所需的结果。感谢对此的帮助，并希望答案可以帮助其他从事调查工作的 R 用户。

原文

It is common in surveys to ask a question and then tell participants to "select all that apply". For example, "Which foods do you enjoy eating (Please select all that apply)?" a) Sushi, b) Pasta, c) Hamburger.

Assuming four (N=4) participants answered this question, the data could look like this.

food.df <- data.frame(id = c(1,2,3,4), food.choice = c("1,2", "", "1,2,3", "3"))

What I am trying to do is conditionally separate these into unique columns using a method that is flexible on the number of individuals and the number of food choice attributes (i.e. Sushi, Pasta, Hamburger, ....). The final data would look something like this.

food.final <- data.frame(id= c(1,2,3,4), sushi = c(1,0,1,0), pasta = c(1,0,1,0), hamburger = c(0,0,1,1))

The more advanced version of this would allow for conditional groupings. You can think of this as grouping by food groups, location, etc. Assuming we were grouping by "selected foods that have protein" this could be coded to reflect total choices. This could look something like this.

food.group <- data.frame(id = c(1,2,3,4), protein = c(1,0,2,1), non.protein = c(1,0,1,0))

I have tried to use tidyr::separate, strsplit, and other column splitting functions but cannot seem to get the desired outcome. Appreciate the help on this and hopefully, the answer helps other users of R who do survey work.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

撞了怀 2025-01-27 15:09:46

我们可以使用fastdummies

library(fastDummies)
library(dplyr)
dummy_cols(food.df, 'food.choice', split = ",", 
    remove_selected_columns = TRUE) %>%
    setNames(c("id", "sushi", "pasta", "hamburger"))

-Output

   id sushi pasta hamburger
1  1     1     1         0
2  2     0     0         0
3  3     1     1         1
4  4     0     0         1

如果重命名是自动的，请创建一个命名向量并使用str_replace

library(stringr)
nm1 <- setNames(c("sushi", "pasta", "hamburger"), 1:3)
 dummy_cols(food.df, 'food.choice', split = ",", 
    remove_selected_columns = TRUE) %>% 
   rename_with(~ str_replace_all(str_remove(.x, 'food.choice_'), nm1), -id)
  id sushi pasta hamburger
1  1     1     1         0
2  2     0     0         0
3  3     1     1         1
4  4     0     0         1

在第二种情况下，我们可以使用str_count

food.df %>%
   mutate(protein = str_count(food.choice, '[13]'), 
    non.protein = str_count(food.choice, '2'), .keep = 'unused')
  id protein non.protein
1  1       1           1
2  2       0           0
3  3       2           1
4  4       1           0

We may use fastDummies

library(fastDummies)
library(dplyr)
dummy_cols(food.df, 'food.choice', split = ",", 
    remove_selected_columns = TRUE) %>%
    setNames(c("id", "sushi", "pasta", "hamburger"))

-output

   id sushi pasta hamburger
1  1     1     1         0
2  2     0     0         0
3  3     1     1         1
4  4     0     0         1

If the renaming should be automatic, create a named vector and use str_replace

library(stringr)
nm1 <- setNames(c("sushi", "pasta", "hamburger"), 1:3)
 dummy_cols(food.df, 'food.choice', split = ",", 
    remove_selected_columns = TRUE) %>% 
   rename_with(~ str_replace_all(str_remove(.x, 'food.choice_'), nm1), -id)
  id sushi pasta hamburger
1  1     1     1         0
2  2     0     0         0
3  3     1     1         1
4  4     0     0         1

For the second case, we may use str_count

food.df %>%
   mutate(protein = str_count(food.choice, '[13]'), 
    non.protein = str_count(food.choice, '2'), .keep = 'unused')
  id protein non.protein
1  1       1           1
2  2       0           0
3  3       2           1
4  4       1           0

回复收藏 0 原文

冷了相思 2025-01-27 15:09:46

您可以创建或可能具有一个矩阵，该矩阵可以分配foody之类的所需信息。

(foody <- matrix(c('sushi', 'pasta', 'hamburger', 
                  'protein', 'non_protein', 'protein',
                  '1', '2', '3'), nrow=3, ncol=3, 
                dimnames=list(NULL, c('food', 'protein', 'id'))))
#       food        protein       id 
# [1,] "sushi"     "protein"     "1"
# [2,] "pasta"     "non_protein" "2"
# [3,] "hamburger" "protein"     "3"

然后，您可以在逗号上轻松strsplit 匹配带有foody的ID。表格创建一个长度的二进制匹配向量nrow（foody），在sapply中，我们得到了一个矩阵mt。

(mt <- t(sapply(strsplit(food.df$food.choice, ','), \(x) {
  tabulate(match(x, foody[, 'id']), nrow(foody))
})))
#      [,1] [,2] [,3]
# [1,]    1    1    0
# [2,]    0    0    0
# [3,]    1    1    1
# [4,]    0    0    1
# [5,]    1    0    1

最后，我们需要的只是用我们希望作为级别的功能创建factor的table。为了方便起见，我们将其包装到功能f中。

f <- \(v) {
  r <- apply(mt, 1, \(i) foody[as.logical(i), v])
  cbind(food.df[1], t(sapply(r, \(x) 
                             table(factor(x, levels=unique(foody[, v]))))))
}

f('food')
#   id sushi pasta hamburger
# 1  1     1     1         0
# 2  2     0     0         0
# 3  3     1     1         1
# 4  4     0     0         1
# 5  5     1     0         1

f('protein')
#   id protein non_protein
# 1  1       1           1
# 2  2       0           0
# 3  3       2           1
# 4  4       1           0
# 5  5       2           0

请注意，数字字符串应按升序排序，无论如何它们可能是。

数据：

food.df <- structure(list(id = 1:5, food.choice = c("1,2", "", "1,2,3", 
           "3", "1,3")), class = "data.frame", row.names = c("1", "2", "3", 
           "4", "5"))

You could create or probably have a matrix that allocates the needed information like this foody.

(foody <- matrix(c('sushi', 'pasta', 'hamburger', 
                  'protein', 'non_protein', 'protein',
                  '1', '2', '3'), nrow=3, ncol=3, 
                dimnames=list(NULL, c('food', 'protein', 'id'))))
#       food        protein       id 
# [1,] "sushi"     "protein"     "1"
# [2,] "pasta"     "non_protein" "2"
# [3,] "hamburger" "protein"     "3"

Then you could easily strsplit on the commas and match the IDs with foody. tabulate creates a binary matching vector of length nrow(foody) and in an sapply we get a matrix mt.

(mt <- t(sapply(strsplit(food.df$food.choice, ','), \(x) {
  tabulate(match(x, foody[, 'id']), nrow(foody))
})))
#      [,1] [,2] [,3]
# [1,]    1    1    0
# [2,]    0    0    0
# [3,]    1    1    1
# [4,]    0    0    1
# [5,]    1    0    1

Finally all we need is to create a table of a factor of each row with the feature we desire as levels. For convenience we wrap it into a function f.

f <- \(v) {
  r <- apply(mt, 1, \(i) foody[as.logical(i), v])
  cbind(food.df[1], t(sapply(r, \(x) 
                             table(factor(x, levels=unique(foody[, v]))))))
}

f('food')
#   id sushi pasta hamburger
# 1  1     1     1         0
# 2  2     0     0         0
# 3  3     1     1         1
# 4  4     0     0         1
# 5  5     1     0         1

f('protein')
#   id protein non_protein
# 1  1       1           1
# 2  2       0           0
# 3  3       2           1
# 4  4       1           0
# 5  5       2           0

Note that the number strings should be sorted in ascending order, which they probably are anyway.

Data:

food.df <- structure(list(id = 1:5, food.choice = c("1,2", "", "1,2,3", 
           "3", "1,3")), class = "data.frame", row.names = c("1", "2", "3", 
           "4", "5"))

回复收藏 0 原文

~没有更多了~