将空格分隔的条目拆分为 R 中的新列

发布于 2024-09-11 09:58:01 字数 609 浏览 3 评论 0原文

我正在编写一项输出 .csv 文件的调查。在此 csv 中,我有一些以空格分隔的条目,它们代表多选问题(例如具有多个答案的问题)。最后,我想将这些空格分隔的条目解析到它们自己的列中,并为它们创建标题,这样我就知道它们来自哪里。

例如,我可以从这个开始(请注意,多选列后面有一个 _M):

Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2

我想转到这个:

Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0

我想这是一个相对常见的问题,但我无法在R 部分。导入 .csv 后,有什么想法如何在 R 中执行此操作吗?我的一般想法(这通常会导致程序效率低下)是我可以: (1)用grep()拉取带有特殊后缀的列号 (2) 循环(或使用应用)这些列中的每个条目并确定响应级别,然后相应地创建列 (3) 循环(或使用应用)并将指示器放置在适当的列中以指示选择的存在。

我感谢任何帮助,如果不清楚,请告诉我。

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.

For example I may start with this (note that the multiselect columns have an _M after them):

Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2

and I want to go to this:

Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0

I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection

I appreciate any help and please let me know if this is not clear.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

沙沙粒小 2024-09-18 09:58:01

我同意 ran2 和 aL3Xa 的观点,您可能希望更改数据的格式,以便为每个可能的响应提供不同的列。但是,如果将数据集修改为更好的格式证明存在问题,则可以按照您的要求进行操作。

process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)

q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1]  1  2  3 NA  4

[[2]]
[1] 2 5

建议针对不同的响应使用不同的列的原因是,尝试从这种形式的数据中检索任何统计信息仍然是相当不愉快的。尽管您可以做诸如

# Number of reponses given
sapply(processed_q2, length)

#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")

编辑之类的事情:还有一条建议。将处理数据的代码与分析数据的代码分开。如果您创建任何图表,请再次将创建它们的代码分开。我一直在将各种东西混合在一起,但这并不漂亮。 (特别是当您六个月后回到代码时。)

I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.

process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)

q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1]  1  2  3 NA  4

[[2]]
[1] 2 5

The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like

# Number of reponses given
sapply(processed_q2, length)

#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")

EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)

情魔剑神 2024-09-18 09:58:01

我不完全确定你想做什么以及你这样编码的原因是什么。因此,我的建议更为笼统——所以只要澄清一下,我会尽力给出更具体的回应。

1) 我说您正在自己编写调查编码,这很好,因为这意味着您对 .csv 文件有影响力。我永远不会在同一个 .csv 文件中使用不同类型的分隔。只需从一开始就进行命名,就像您在第二块中建议的那样。
否则,您可能会遇到复选框等麻烦。假设有人检查了 5 个可能答案中的 3 个,下一个只检查了 1 个(即“不知道”)。现在,创建电子表格 (data.frame) 类型的结果视图将比只需要重新编码的空字段(结果在 R 中是 NA)要困难得多。

2)另一个重要的问题是你是否打算进行小组调查(即一遍又一遍地询问相同参与者的纵向研究)。这(以及其他许多)是考虑将数据保存到 MySQL 数据库而不是 .csv 的一个很好的理由。 RMySQL 可以直接连接到数据库并访问其表,更重要的是其视图。
视图确实有助于调查数据,因为您可以根据许多不同的需求在不同的视图中重新排列数据。

3)除了所有的个人/观点和经验之外,这里还有一些(不那么偏见的)文献可以开始:
复杂调查:使用 R 进行分析的指南(调查方法中的 Wiley 系列)

本书相对简单,省略了小组调查,但提供了大量 R 代码和示例,这应该是一个实用的开始,

为了防止重新发明轮子,您可能需要检查 LimeSurvey,一个相当不错的调查指挥工具(不是说模板:))。除了我的 TYPO3 CMS 扩展 pbsurveyke_questionnaire (应该)也能很好地工作(仅测试了 pbsurvey )。

I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.

1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.

2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.

3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology

The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.

To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).

沒落の蓅哖 2024-09-18 09:58:01

多项选择项应始终编码为单独的变量。也就是说,如果你有5个选择和多项选择,你应该将它们编码为i1, i2, i3, i4, i5,即每一个都是一个二进制变量(0-1)。我看到第一个示例中 Q4_M 变量的值为 3 5 99 。这是否意味着某个商品有 99 种选择?哎哟...

首先,您应该继续为多项选择项中的每个选项创建单独的变量。也就是说,做:

# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")

现在你有一个带有 0 的 data.frame,所以你需要做的就是在适当的位置获取 1(这有点麻烦),一个函数将完成这项工作......

# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1

这几乎是它...基本上,您想重新考虑使用 _M 变量创建 data.frame,这样您就可以编写一个自动执行此插入操作的函数。避免 for 循环!

或者,更好的是,创建一个带有逻辑的矩阵,然后执行 dtf[m] <- 1,其中 dtf 是您的多项选择 data.frame,并且 < em>m 是带有逻辑的矩阵。

我很想在这方面为您提供更多帮助,但经过漫长的一夜之后我正在康复! =)希望我能帮上一点忙! =)

Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...

First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:

# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")

now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...

# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1

And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!

Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.

I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)

情泪▽动烟 2024-09-18 09:58:01

感谢您的所有回复。我同意你们大多数人的观点,这种格式有点愚蠢,但这是我必须使用的(调查已编码并将于下周使用)。这是我从所有回复中得出的结论。我确信这不是最优雅或最有效的方法,但我认为它应该有效。

colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)

for (i in colnums) {
  vec <- as.vector(dat[,i]) #turn into vector
  b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
  c <- sort(unique(unlist(b))) #which values were used
  newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
  e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
  colnames(e) <- newcolnames 
#next loop looks for responses and puts indicators in the correct places
  for (i in 1:responses) {
  e[i,] <- ifelse(c %in% b[[i]],1,0)
  }
  dat <- cbind(dat,e)
}

欢迎提出改进建议。

Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.

colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)

for (i in colnums) {
  vec <- as.vector(dat[,i]) #turn into vector
  b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
  c <- sort(unique(unlist(b))) #which values were used
  newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
  e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
  colnames(e) <- newcolnames 
#next loop looks for responses and puts indicators in the correct places
  for (i in 1:responses) {
  e[i,] <- ifelse(c %in% b[[i]],1,0)
  }
  dat <- cbind(dat,e)
}

Suggestions for improvement are welcome.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文