如何将 ddply 与不同的 .variables 一起使用？

发布于 2024-12-27 09:21:36 字数 1938 浏览 0 评论 0原文

我使用 ddply 按不同类别总结一些 data.frame，如下所示：

# with both group and size being factors / categorical
split.df <- ddply(mydata,.(group,size),summarize,
                  sumGroupSize = sum(someValue))

这工作得很顺利，但我经常喜欢计算比率，这意味着我需要除以组的总数。如何在同一个 ddply 调用中计算这样的总数？

假设我想要获得 A 组中尺寸类别 1 中的观测值的份额。显然，我必须首先计算尺寸类别 1 中所有观测值的总和。当然，我可以通过两个 ddply 调用来完成此操作，但使用所有一个调用会更舒服。有办法这样做吗？

编辑：我本来不想问得太具体，但我意识到我打扰了这里的人。这是我的具体问题。事实上，我确实有一个可行的例子，但我不认为它真的很漂亮。另外，它还有一个我需要克服的缺点：它不能与 apply 一起正常工作。

library(plyr)

# make the dataset more "realistic"
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA


# someValue is summarized !
# note we have a another, varying category hence we need the a parameter
calcShares <- function(a, data) {
# !is.na needs to be specific!
tempres1 <- eval(substitute(ddply(data[!is.na(a),],.(group,size,a),summarize,
                sumTest = sum(someValue,na.rm=T))),

                envir=data, enclos=parent.frame())
tempres2 <- eval(substitute(ddply(data[!is.na(a),],.(group,size),summarize,
                sumTestTotal = sum(someValue,na.rm=T))),
                envir=data, enclos=parent.frame())

res <- merge(tempres1,tempres2,by=c("group","size"))
res$share <- res$sumTest/res$sumTestTotal
 return(res)

}

test <- calcShares(category,mydata)
test2 <- calcShares(categoryA,mydata)   
head(test)
head(test2)

正如你所看到的，我打算在不同的分类变量上运行它。在示例中，我只有两个（类别，类别A），但实际上我有更多，因此将 apply 与我的函数一起使用会非常好，但不知何故它无法正常工作。

applytest <- head(apply(mydata[grep("^cat",
             names(mydata),value=T)],2,calcShares,data=mydata))

.. 返回一条警告消息和一个奇怪的类别 var 名称 (newX[, i] )。

那么我怎样才能a）更优雅地b）解决应用问题呢？

原文

I use ddply to summarize some data.frameby various categories, like this:

# with both group and size being factors / categorical
split.df <- ddply(mydata,.(group,size),summarize,
                  sumGroupSize = sum(someValue))

This works smoothly, but often I like to calculate ratios which implies that I need to divide by the group's total. How can I calculate such a total within the same ddply call?

Let's say I'd like to have the share of observations in group A that are in size class 1. Obviously I have to calculate the sum of all observations in size class 1 first.
Sure I could do this with two ddply calls, but using all one call would be more comfortable. Is there a way to do so?

EDIT:
I did not mean to ask overly specific, but I realize I was disturbing people here. So here's my specific problem. In fact I do have an example that works, but I don't consider it really nifty. Plus it has a shortcoming that I need to overcome: it does not work correctly with apply.

library(plyr)

# make the dataset more "realistic"
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA


# someValue is summarized !
# note we have a another, varying category hence we need the a parameter
calcShares <- function(a, data) {
# !is.na needs to be specific!
tempres1 <- eval(substitute(ddply(data[!is.na(a),],.(group,size,a),summarize,
                sumTest = sum(someValue,na.rm=T))),

                envir=data, enclos=parent.frame())
tempres2 <- eval(substitute(ddply(data[!is.na(a),],.(group,size),summarize,
                sumTestTotal = sum(someValue,na.rm=T))),
                envir=data, enclos=parent.frame())

res <- merge(tempres1,tempres2,by=c("group","size"))
res$share <- res$sumTest/res$sumTestTotal
 return(res)

}

test <- calcShares(category,mydata)
test2 <- calcShares(categoryA,mydata)   
head(test)
head(test2)

As you can see I intend to run this over different categorical variables. In the example I have only two (category, categoryA) but in fact I got more, so using apply with my function would be really nice, but somehow it does not work correctly.

applytest <- head(apply(mydata[grep("^cat",
             names(mydata),value=T)],2,calcShares,data=mydata))

.. returns a warning message and a strange name (newX[, i] ) for the category var.

So how can I do THIS a) more elegantly and b) fix the apply issue?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜夜流光相皎洁 2025-01-03 09:21:36

这看起来很简单，所以我可能遗漏了你问题的某些方面。

首先，定义一个函数来计算 group 每个级别内所需的值。然后，不要使用 .(group, size) 来拆分 data.frame，而是使用 .(group)，并将新定义的函数应用于每个拆分部分。

library(plyr)

# Create a dataset with the names in your example
mydata <- warpbreaks
names(mydata) <- c("someValue", "group", "size")

# A function that calculates the proportional contribution of each size class 
# to the sum of someValue within a level of group
getProps <- function(df) {
    with(df, ave(someValue, size, FUN=sum)/sum(someValue))
}

# The call to ddply()
res <- ddply(mydata, .(group), 
             .fun = function(X) transform(X, PROPS=getProps(X)))

head(res, 12)
#    someValue group size     PROPS
# 1         26     A    L 0.4785203
# 2         30     A    L 0.4785203
# 3         54     A    L 0.4785203
# 4         25     A    L 0.4785203
# 5         70     A    L 0.4785203
# 6         52     A    L 0.4785203
# 7         51     A    L 0.4785203
# 8         26     A    L 0.4785203
# 9         67     A    L 0.4785203
# 10        18     A    M 0.2577566
# 11        21     A    M 0.2577566
# 12        29     A    M 0.2577566

This seems simple, so I may be missing some aspect of your question.

First, define a function that calculates the values you want inside each level of group. Then, instead of using .(group, size) to split the data.frame, use .(group), and apply the newly defined function to each of the split pieces.

library(plyr)

# Create a dataset with the names in your example
mydata <- warpbreaks
names(mydata) <- c("someValue", "group", "size")

# A function that calculates the proportional contribution of each size class 
# to the sum of someValue within a level of group
getProps <- function(df) {
    with(df, ave(someValue, size, FUN=sum)/sum(someValue))
}

# The call to ddply()
res <- ddply(mydata, .(group), 
             .fun = function(X) transform(X, PROPS=getProps(X)))

head(res, 12)
#    someValue group size     PROPS
# 1         26     A    L 0.4785203
# 2         30     A    L 0.4785203
# 3         54     A    L 0.4785203
# 4         25     A    L 0.4785203
# 5         70     A    L 0.4785203
# 6         52     A    L 0.4785203
# 7         51     A    L 0.4785203
# 8         26     A    L 0.4785203
# 9         67     A    L 0.4785203
# 10        18     A    M 0.2577566
# 11        21     A    M 0.2577566
# 12        29     A    M 0.2577566

回复收藏 0 原文

~没有更多了~