如何将 ddply 与不同的 .variables 一起使用?
我使用 ddply 按不同类别总结一些 data.frame
,如下所示:
# with both group and size being factors / categorical
split.df <- ddply(mydata,.(group,size),summarize,
sumGroupSize = sum(someValue))
这工作得很顺利,但我经常喜欢计算比率,这意味着我需要除以组的总数。如何在同一个 ddply 调用中计算这样的总数?
假设我想要获得 A 组中尺寸类别 1 中的观测值的份额。显然,我必须首先计算尺寸类别 1 中所有观测值的总和。 当然,我可以通过两个 ddply 调用来完成此操作,但使用所有一个调用会更舒服。有办法这样做吗?
编辑: 我本来不想问得太具体,但我意识到我打扰了这里的人。这是我的具体问题。事实上,我确实有一个可行的例子,但我不认为它真的很漂亮。另外,它还有一个我需要克服的缺点:它不能与 apply 一起正常工作。
library(plyr)
# make the dataset more "realistic"
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA
# someValue is summarized !
# note we have a another, varying category hence we need the a parameter
calcShares <- function(a, data) {
# !is.na needs to be specific!
tempres1 <- eval(substitute(ddply(data[!is.na(a),],.(group,size,a),summarize,
sumTest = sum(someValue,na.rm=T))),
envir=data, enclos=parent.frame())
tempres2 <- eval(substitute(ddply(data[!is.na(a),],.(group,size),summarize,
sumTestTotal = sum(someValue,na.rm=T))),
envir=data, enclos=parent.frame())
res <- merge(tempres1,tempres2,by=c("group","size"))
res$share <- res$sumTest/res$sumTestTotal
return(res)
}
test <- calcShares(category,mydata)
test2 <- calcShares(categoryA,mydata)
head(test)
head(test2)
正如你所看到的,我打算在不同的分类变量上运行它。在示例中,我只有两个(类别,类别A),但实际上我有更多,因此将 apply 与我的函数一起使用会非常好,但不知何故它无法正常工作。
applytest <- head(apply(mydata[grep("^cat",
names(mydata),value=T)],2,calcShares,data=mydata))
.. 返回一条警告消息和一个奇怪的类别 var 名称 (newX[, i] )。
那么我怎样才能a)更优雅地b)解决应用问题呢?
I use ddply to summarize some data.frame
by various categories, like this:
# with both group and size being factors / categorical
split.df <- ddply(mydata,.(group,size),summarize,
sumGroupSize = sum(someValue))
This works smoothly, but often I like to calculate ratios which implies that I need to divide by the group's total. How can I calculate such a total within the same ddply
call?
Let's say I'd like to have the share of observations in group A that are in size class 1. Obviously I have to calculate the sum of all observations in size class 1 first.
Sure I could do this with two ddply calls, but using all one call would be more comfortable. Is there a way to do so?
EDIT:
I did not mean to ask overly specific, but I realize I was disturbing people here. So here's my specific problem. In fact I do have an example that works, but I don't consider it really nifty. Plus it has a shortcoming that I need to overcome: it does not work correctly with apply.
library(plyr)
# make the dataset more "realistic"
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA
# someValue is summarized !
# note we have a another, varying category hence we need the a parameter
calcShares <- function(a, data) {
# !is.na needs to be specific!
tempres1 <- eval(substitute(ddply(data[!is.na(a),],.(group,size,a),summarize,
sumTest = sum(someValue,na.rm=T))),
envir=data, enclos=parent.frame())
tempres2 <- eval(substitute(ddply(data[!is.na(a),],.(group,size),summarize,
sumTestTotal = sum(someValue,na.rm=T))),
envir=data, enclos=parent.frame())
res <- merge(tempres1,tempres2,by=c("group","size"))
res$share <- res$sumTest/res$sumTestTotal
return(res)
}
test <- calcShares(category,mydata)
test2 <- calcShares(categoryA,mydata)
head(test)
head(test2)
As you can see I intend to run this over different categorical variables. In the example I have only two (category, categoryA) but in fact I got more, so using apply with my function would be really nice, but somehow it does not work correctly.
applytest <- head(apply(mydata[grep("^cat",
names(mydata),value=T)],2,calcShares,data=mydata))
.. returns a warning message and a strange name (newX[, i] ) for the category var.
So how can I do THIS a) more elegantly and b) fix the apply issue?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这看起来很简单,所以我可能遗漏了你问题的某些方面。
首先,定义一个函数来计算
group
每个级别内所需的值。然后,不要使用.(group, size)
来拆分 data.frame,而是使用.(group)
,并将新定义的函数应用于每个拆分部分。This seems simple, so I may be missing some aspect of your question.
First, define a function that calculates the values you want inside each level of
group
. Then, instead of using.(group, size)
to split the data.frame, use.(group)
, and apply the newly defined function to each of the split pieces.