按列表对 data.frame 进行子集化,并按行在每个部分上应用函数
这可能看起来是一个典型的 plyr
问题,但我有不同的想法。 这是我想要优化的函数(跳过 for
循环)。
# dummy data
set.seed(1985)
lst <- list(a=1:10, b=11:15, c=16:20)
m <- matrix(round(runif(200, 1, 7)), 10)
m <- as.data.frame(m)
dfsub <- function(dt, lst, fun) {
# check whether dt is `data.frame`
stopifnot (is.data.frame(dt))
# check if vectors in lst are "whole" / integer
# vector elements should be column indexes
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
# fall if any non-integers in list
idx <- rapply(lst, is.wholenumber)
stopifnot(idx)
# check for list length
stopifnot(ncol(dt) == length(idx))
# subset the data
subs <- list()
for (i in 1:length(lst)) {
# apply function on each part, by row
subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
}
# preserve names
names(subs) <- names(lst)
# convert to data.frame
subs <- as.data.frame(subs)
# guess what =)
return(subs)
}
现在是一个简短的演示......实际上,我将解释我主要打算做什么。我想通过 list
对象中收集的向量对 data.frame
进行子集化。由于这是心理学研究中伴随数据操作的函数的代码的一部分,因此您可以将 m
视为性格调查问卷(10 个受试者,20 个变量)的结果。列表中的向量包含定义调查问卷子量表(例如人格特质)的列索引。每个子量表由多个项目(data.frame
中的列)定义。如果我们假设每个子量表的分数只不过是行值(每个主题的调查问卷该部分的结果)的 sum
(或其他函数),您可以运行:
> dfsub(m, lst, sum)
a b c
1 46 20 24
2 41 24 21
3 41 13 12
4 37 14 18
5 57 18 25
6 27 18 18
7 28 17 20
8 31 18 23
9 38 14 15
10 41 14 22
我看了一眼在这个函数中,我必须承认这个小循环根本不会破坏代码......但是,如果有更简单/有效的方法来做到这一点,请告诉我!
This may seem as a typical plyr
problem, but I have something different in mind.
Here's the function that I want to optimize (skip the for
loop).
# dummy data
set.seed(1985)
lst <- list(a=1:10, b=11:15, c=16:20)
m <- matrix(round(runif(200, 1, 7)), 10)
m <- as.data.frame(m)
dfsub <- function(dt, lst, fun) {
# check whether dt is `data.frame`
stopifnot (is.data.frame(dt))
# check if vectors in lst are "whole" / integer
# vector elements should be column indexes
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
# fall if any non-integers in list
idx <- rapply(lst, is.wholenumber)
stopifnot(idx)
# check for list length
stopifnot(ncol(dt) == length(idx))
# subset the data
subs <- list()
for (i in 1:length(lst)) {
# apply function on each part, by row
subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
}
# preserve names
names(subs) <- names(lst)
# convert to data.frame
subs <- as.data.frame(subs)
# guess what =)
return(subs)
}
And now a short demonstration... actually, I'm about to explain what I primarily intended to do. I wanted to subset a data.frame
by vectors gathered in list
object. Since this is a part of code from a function that accompanies data manipulation in psychological research, you can consider m
as a results from personality questionnaire (10 subjects, 20 vars). Vectors in list hold column indexes that define questionnaire subscales (e.g. personality traits). Each subscale is defined by several items (columns in data.frame
). If we presuppose that the score on each subscale is nothing more than sum
(or some other function) of row values (results on that part of questionnaire for each subject), you could run:
> dfsub(m, lst, sum)
a b c
1 46 20 24
2 41 24 21
3 41 13 12
4 37 14 18
5 57 18 25
6 27 18 18
7 28 17 20
8 31 18 23
9 38 14 15
10 41 14 22
I took a glance at this function and I must admit that this little loop isn't spoiling the code at all... BUT, if there's an easier/efficient way of doing this, please, let me know!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我会采取不同的方法,将所有内容保留为数据帧,以便您可以使用 merge 和 ddply。我想您会发现这种方法更通用一些,并且更容易检查每个步骤是否正确执行。
I'd take a different approach and keep everything as data frames so that you can use merge and ddply. I think you'll find this approach is a little more general, and it's easier to check that each step is performed correctly.
加载plyr包后,替换
为
after loading the plyr package, replace
with
@Hadley,我已经检查了您的回复,因为它非常简单且易于记账(除了它是更通用的解决方案之外)。然而,这是我的不太长的脚本,它只需要
base
包(这很简单,因为我安装了plyr
和reshape
就在安装 R 之后)。现在,这是来源:@Hadley, I've checked your response since it's quite straightforward and easy for bookkeeping (besides the fact it's more general-purpose-solution). However, here's my not-so-long script that does the thing and requires only
base
package (which is trivial since I installplyr
andreshape
just after installing R). Now, here's the source:对于您的具体示例,一行解决方案是
sapply(lst,function(x) rowSums(m[,x]))
(尽管您可能会添加更多行来检查有效输入并放置在列名称中)。您还有其他更一般的应用吗?或者这可能是 YAGNI 的情况吗?
For your specific example, a one-line solution is
sapply(lst,function(x) rowSums(m[,x]))
(although you might add some more lines to check for valid input and put in the column names).Do you have other, more general, applications in mind? Or is this possibly a case of YAGNI?