plyr 转换后不返回新变量
我正在尝试学习如何在 R/plyr 中编写函数。我知道有更简单的方法可以完成下面所示的操作,但这不是重点。
在下面的示例中,PLYR 不会将新变量返回到我的新数据框中,
library(plyr)
highab <-subset(baseball, ab >= 600)
testfunc1 <-function(x) {
print(x) #just to show me that the vector does get into the function. Works fine.
medianAB <- median(x)
print(medianAB) #just to prove that medianAB was calculated correctly. Works fine
}
baseball3 <-ddply(highab, .(id), transform, testfunc1(ab))
str(baseball3$medianAB) #No medianAB
我遗漏了什么明显的东西?
R version 2.12.2 (2011-02-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8 LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid splines stats graphics grDevices utils datasets methods base
other attached packages:
[1] foreign_0.8-42 ggplot2_0.8.9 proto_0.3-9.1 reshape_0.8.4 plyr_1.4.1 rms_3.3-0 Hmisc_3.8-3
[8] survival_2.36-5 stringr_0.4
loaded via a namespace (and not attached):
[1] cluster_1.13.3 lattice_0.19-23 tools_2.12.2
I'm trying to learn how to write function in R/plyr. I am aware that there are easier ways to do what I show below, but that's not the point.
In the example that follows, PLYR does not return a new variable to my new data frame
library(plyr)
highab <-subset(baseball, ab >= 600)
testfunc1 <-function(x) {
print(x) #just to show me that the vector does get into the function. Works fine.
medianAB <- median(x)
print(medianAB) #just to prove that medianAB was calculated correctly. Works fine
}
baseball3 <-ddply(highab, .(id), transform, testfunc1(ab))
str(baseball3$medianAB) #No medianAB
What obvious thing am I missing?
R version 2.12.2 (2011-02-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8 LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid splines stats graphics grDevices utils datasets methods base
other attached packages:
[1] foreign_0.8-42 ggplot2_0.8.9 proto_0.3-9.1 reshape_0.8.4 plyr_1.4.1 rms_3.3-0 Hmisc_3.8-3
[8] survival_2.36-5 stringr_0.4
loaded via a namespace (and not attached):
[1] cluster_1.13.3 lattice_0.19-23 tools_2.12.2
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
只需进行两处更改
medianAB = testfunc1(ab)
您就完成了!
这是带有输出的简化代码
Just make two changes
medianAB = testfunc1(ab)
as suggested by JoshuaYou are done!
Here is the simplified code with the output
对不起。我误解了这个问题。
请参阅
?转换
。您需要将所需的新变量指定为tag=value
对。所以你需要类似的东西Sorry. I mis-understood the question.
See
?transform
. You need to specify the new variables you want astag=value
pairs. So you need something like起初,我喜欢将派生列添加到 data.frame 的习惯用法,但我发现使用
transform()
会使大集合速度慢得令人无法接受。在 ddply() 中使用 lambda 形式并随后调用 merge merge() 会更好吗?计时看起来是值得的:
十分之三秒可能看起来不多,但它使执行时间减半。通过选择整个
baseball
数据集,改进会更大。At first I liked the idiom to add derived columns to a data.frame, but I find the usage of
transform()
unacceptably slow far large sets.Would it be better to use a lambda form in
ddply()
and a subsequent call to mergemerge()
? Timing it looks like it's worth it:3 tenths of a second may not seem much but it is halving the execution time. The improvement is even bigger by selecting the whole
baseball
dataset.