plyr 转换后不返回新变量

发布于 2024-11-01 01:20:41 字数 1376 浏览 4 评论 0原文

我正在尝试学习如何在 R/plyr 中编写函数。我知道有更简单的方法可以完成下面所示的操作，但这不是重点。

在下面的示例中，PLYR 不会将新变量返回到我的新数据框中，

library(plyr)
highab <-subset(baseball, ab >= 600)

testfunc1 <-function(x) {
    print(x) #just to show me that the vector does get into the function. Works fine.
    medianAB <- median(x)
    print(medianAB) #just to prove that medianAB was calculated correctly. Works fine   
}


baseball3 <-ddply(highab, .(id), transform, testfunc1(ab))
str(baseball3$medianAB) #No medianAB

我遗漏了什么明显的东西？

R version 2.12.2 (2011-02-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      splines   stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] foreign_0.8-42  ggplot2_0.8.9   proto_0.3-9.1   reshape_0.8.4   plyr_1.4.1      rms_3.3-0       Hmisc_3.8-3    
[8] survival_2.36-5 stringr_0.4    

loaded via a namespace (and not attached):
[1] cluster_1.13.3  lattice_0.19-23 tools_2.12.2

原文

I'm trying to learn how to write function in R/plyr. I am aware that there are easier ways to do what I show below, but that's not the point.

In the example that follows, PLYR does not return a new variable to my new data frame

library(plyr)
highab <-subset(baseball, ab >= 600)

testfunc1 <-function(x) {
    print(x) #just to show me that the vector does get into the function. Works fine.
    medianAB <- median(x)
    print(medianAB) #just to prove that medianAB was calculated correctly. Works fine   
}


baseball3 <-ddply(highab, .(id), transform, testfunc1(ab))
str(baseball3$medianAB) #No medianAB

What obvious thing am I missing?

R version 2.12.2 (2011-02-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      splines   stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] foreign_0.8-42  ggplot2_0.8.9   proto_0.3-9.1   reshape_0.8.4   plyr_1.4.1      rms_3.3-0       Hmisc_3.8-3    
[8] survival_2.36-5 stringr_0.4    

loaded via a namespace (and not attached):
[1] cluster_1.13.3  lattice_0.19-23 tools_2.12.2

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

放低过去 2024-11-08 01:20:41

只需进行两处更改

即可删除函数内的打印命令，以便返回中位数
按照 Joshua 的建议添加 medianAB = testfunc1(ab)

您就完成了！

这是带有输出的简化代码

library(plyr)
highab <-subset(baseball, ab >= 600)
baseball3 <-ddply(highab, .(id), transform, medianAB = median(ab))
summary(baseball3$medianAB)

最小。第一曲。第三季度中位数
最大限度。
600.0 612.0 621.5 623.1 631.5 677.0

Just make two changes

Remove the print command inside the function, so that median is returned
Add medianAB = testfunc1(ab) as suggested by Joshua

You are done!

Here is the simplified code with the output

library(plyr)
highab <-subset(baseball, ab >= 600)
baseball3 <-ddply(highab, .(id), transform, medianAB = median(ab))
summary(baseball3$medianAB)

Min. 1st Qu. Median Mean 3rd Qu.
Max.
600.0 612.0 621.5 623.1 631.5 677.0

回复收藏 0 原文

旧时浪漫 2024-11-08 01:20:41

对不起。我误解了这个问题。

请参阅？转换。您需要将所需的新变量指定为 tag=value 对。所以你需要类似的东西

baseball3 <- ddply(highab, .(id), transform, medianAB=testfunc1(ab))

Sorry. I mis-understood the question.

See ?transform. You need to specify the new variables you want as tag=value pairs. So you need something like

baseball3 <- ddply(highab, .(id), transform, medianAB=testfunc1(ab))

回复收藏 0 原文

失退 2024-11-08 01:20:41

起初，我喜欢将派生列添加到 data.frame 的习惯用法，但我发现使用 transform() 会使大集合速度慢得令人无法接受。

在 ddply() 中使用 lambda 形式并随后调用 merge merge() 会更好吗？计时看起来是值得的：

    > library(plyr)
    > highab <-subset(baseball, ab >= 600)
    > 
    > system.time( 
    +   baseball3.lambda <-merge(highab, 
    +     ddply(highab, .(id), 
    +       function(u) data.frame(medianAB = median(u$ab)))), FALSE)
       user  system elapsed 
      0.336   0.000   0.336 
    > 
    > system.time( 
        baseball3.orig <- ddply(highab, .(id), 
          transform, medianAB = median(ab)), FALSE)
       user  system elapsed 
      0.640   0.000   0.641 
    > 
    > summary(baseball3.lambda$medianAB)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      600.0   612.0   621.5   623.1   631.5   677.0 
    > summary(baseball3.orig$medianAB)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      600.0   612.0   621.5   623.1   631.5   677.0

十分之三秒可能看起来不多，但它使执行时间减半。通过选择整个 baseball 数据集，改进会更大。

At first I liked the idiom to add derived columns to a data.frame, but I find the usage of transform() unacceptably slow far large sets.

Would it be better to use a lambda form in ddply() and a subsequent call to merge merge()? Timing it looks like it's worth it:

    > library(plyr)
    > highab <-subset(baseball, ab >= 600)
    > 
    > system.time( 
    +   baseball3.lambda <-merge(highab, 
    +     ddply(highab, .(id), 
    +       function(u) data.frame(medianAB = median(u$ab)))), FALSE)
       user  system elapsed 
      0.336   0.000   0.336 
    > 
    > system.time( 
        baseball3.orig <- ddply(highab, .(id), 
          transform, medianAB = median(ab)), FALSE)
       user  system elapsed 
      0.640   0.000   0.641 
    > 
    > summary(baseball3.lambda$medianAB)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      600.0   612.0   621.5   623.1   631.5   677.0 
    > summary(baseball3.orig$medianAB)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      600.0   612.0   621.5   623.1   631.5   677.0

3 tenths of a second may not seem much but it is halving the execution time. The improvement is even bigger by selecting the whole baseball dataset.

回复收藏 0 原文

~没有更多了~