按类范围列表对数据帧进行分类或剪切，并使用 ddply 对其进行汇总

发布于 2024-09-26 16:31:41 字数 3650 浏览 9 评论 0原文

我对 ddply 和 subset 有疑问。

我有这样的数据框 df ：

df <- read.table(textConnection(
"   id v_idn v_seed v_time v_pop v_rank v_perco 
    1  15    125648 0      150   1      15      
    2  17    125648 0      120   2      5       
    3  18    125648 0      100   3      6       
    4  52    125648 0      25    4      1       

    5  17    125648 10     220   1      5      
    6  15    125648 10     160   2      15       
    7  18    125648 10     110   3      6      
    8  52    125648 10     50    4      1       

    9  56   -11152  0      250   1      17      
    10 15   -11152  0      180   2      15      
    11 18   -11152  0      110   3      6       
    12 22   -11152  0      5     4      14      

    13 56   -11152  10     250   1      17      
    14 15   -11152  10     180   2      15      
    15 22   -11152  10     125   3      14      
    16 18   -11152  10     120   4      6 "), header=TRUE)

第一步：

我有一个带有 cut_interval 的等间隔列表，如下所示：

myinterval <- cut_interval(c(15,5,6,1,17,14), length=10)

所以我这里有两个级别：[0,10) 和 (10,20]

第二步：

我希望每个组/类由 v_cut 中的两个级别定义...像这样：

id v_idn v_seed v_time v_pop v_rank v_perco v_cut
1  15    125648 0      150   1      15      (10,20]
2  17    125648 0      120   2      5       [0,10)
3  18    125648 0      100   3      6       [0,10)
4  52    125648 0      25    4      1       [0,10)

5  17    125648 10     220   1      5       [0,10)
6  15    125648 10     160   2      15      (10,20] 
7  18    125648 10     110   3      6       [0,10)
8  52    125648 10     50    4      1       [0,10)

9  56   -11152  0      250   1      17      (10,20]
10 15   -11152  0      180   2      15      (10,20]
11 18   -11152  0      110   3      6       [0,10)
12 22   -11152  0      5     4      14      (10,20]

13 56   -11152  10     250   1      17      (10,20]
14 15   -11152  10     180   2      15      (10,20]
15 22   -11152  10     125   3      14      (10,20]
16 18   -11152  10     120   4      6       [0,10)

第三步：

我想知道 v_rank 对于 x 轴的可变性，并且y 轴的时间，对于每个组 v_cut，所以我需要计算 v_rank 值的 min、mean、max、sd，类似于

ddply(df, .(v_cut,v_time), summarize ,mean = mean(v_rank), min = min(v_rank), max = max(v_rank), sd = sd(v_rank))

*想要的结果：*

id  v_time MEAN.v_rank ... v_cut
1   0      2.25            (10,20]
2   0      2.42            [0,10)
3   10     2.25            [0,10)
4   10     2.42            (10,20]

我的问题

I不知道如何通过步骤 1 -> 步骤 2:/

是否可以像步骤 3 中的示例一样按 v_cut 进行分组？是否

可以使用 ddply 的“子集”选项进行相同的操作

？更多时间，非常感谢伟大的 R 大师的帮助！

更新 1：

我有一个从步骤 1 到步骤 2 的答案：

df$v_cut <- cut_interval(df$v_perco,n=10)

我正在使用 plyr，但在这种情况下也许有更好的答案？

回答从步骤 2 到步骤 3 ？

更新 2：

Brandon Bertelsen 给了我一个很好的熔化 + 铸造的答案，但现在（为了理解）我想用 plyr 和 ddply 进行相同的操作。 . 具有不同的结果：

id  v_idn v_time MEAN.v_rank ... v_cut
    1   15   0      2.25            (10,20]
    2   15   10     2.45            (10,20]
    2   17   0      1.52            [0,10)
    2   17   10     2.42            [0,10)
    etc.

我正在尝试这样的事情：

r('sumData <- ddply(df, .(v_idn,v_time), summarize,min = min(v_rank),mean =  mean(v_rank), max = max(v_rank), sd=sd(v_rank))')

但是我想在我的 sumData 数据框中使用 v_cut ，我该如何处理 ddply ？有没有办法做到这一点？或者与初始 df 和 key = v_idn 合并以将列 v_cut 添加到 sumData 是唯一好的答案？

原文

I have question about ddply and subset.

I have dataframe df like this :

df <- read.table(textConnection(
"   id v_idn v_seed v_time v_pop v_rank v_perco 
    1  15    125648 0      150   1      15      
    2  17    125648 0      120   2      5       
    3  18    125648 0      100   3      6       
    4  52    125648 0      25    4      1       

    5  17    125648 10     220   1      5      
    6  15    125648 10     160   2      15       
    7  18    125648 10     110   3      6      
    8  52    125648 10     50    4      1       

    9  56   -11152  0      250   1      17      
    10 15   -11152  0      180   2      15      
    11 18   -11152  0      110   3      6       
    12 22   -11152  0      5     4      14      

    13 56   -11152  10     250   1      17      
    14 15   -11152  10     180   2      15      
    15 22   -11152  10     125   3      14      
    16 18   -11152  10     120   4      6 "), header=TRUE)

STEP ONE :

I have a list of equal interval with cut_interval like this :

myinterval <- cut_interval(c(15,5,6,1,17,14), length=10)

So i have two levels here : [0,10) and (10,20]

STEP TWO :

I want each group/class is define by my two levels in v_cut ... like this :

id v_idn v_seed v_time v_pop v_rank v_perco v_cut
1  15    125648 0      150   1      15      (10,20]
2  17    125648 0      120   2      5       [0,10)
3  18    125648 0      100   3      6       [0,10)
4  52    125648 0      25    4      1       [0,10)

5  17    125648 10     220   1      5       [0,10)
6  15    125648 10     160   2      15      (10,20] 
7  18    125648 10     110   3      6       [0,10)
8  52    125648 10     50    4      1       [0,10)

9  56   -11152  0      250   1      17      (10,20]
10 15   -11152  0      180   2      15      (10,20]
11 18   -11152  0      110   3      6       [0,10)
12 22   -11152  0      5     4      14      (10,20]

13 56   -11152  10     250   1      17      (10,20]
14 15   -11152  10     180   2      15      (10,20]
15 22   -11152  10     125   3      14      (10,20]
16 18   -11152  10     120   4      6       [0,10)

STEP 3 :

I want to know the variability of v_rank for x axis, and time for y axis, for each group v_cut, so i need to compute min,mean,max,sd for v_rank value with something like

ddply(df, .(v_cut,v_time), summarize ,mean = mean(v_rank), min = min(v_rank), max = max(v_rank), sd = sd(v_rank))

*RESULT WANTED : *

id  v_time MEAN.v_rank ... v_cut
1   0      2.25            (10,20]
2   0      2.42            [0,10)
3   10     2.25            [0,10)
4   10     2.42            (10,20]

MY PROBLEM

I don't know how to pass step 1 -> step 2 :/

And if it's possible to group by v_cut like my example in step 3 ?

Is there a possibility to make the same things with the "subset" option of ddply ?

One more time, thanks a lot for your help great R guru !

UPDATE 1 :

I have an answer to go step1 to step2 :

df$v_cut <- cut_interval(df$v_perco,n=10)

I'm using plyr, but there are perhaps a better answer in this case ?

Answer to go to step 2 to step 3 ?

UPDATE 2 :

Brandon Bertelsen give me a good answer with melt + cast, but now (to understand) i want to make the same operation with plyr and ddply .. with a different result :

id  v_idn v_time MEAN.v_rank ... v_cut
    1   15   0      2.25            (10,20]
    2   15   10     2.45            (10,20]
    2   17   0      1.52            [0,10)
    2   17   10     2.42            [0,10)
    etc.

I'm trying with something like this :

r('sumData <- ddply(df, .(v_idn,v_time), summarize,min = min(v_rank),mean =  mean(v_rank), max = max(v_rank), sd=sd(v_rank))')

But i want to have v_cut in my sumData dataframe, how can i do with ddply ? is there an option to make this ? Or merging with initial df and key = v_idn to add column v_cut to sumData is the only good answer ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

给不了的爱 2024-10-03 16:31:41

您实际上并不需要 plyr，

## Pull what you need
dfx <- df[c("v_seed", "v_time","v_rank","v_perco")]
## Bring in your cuts
dfx <- data.frame(dfx, ifelse(df$v_perco > 10,"(10,20]", "[0,10)")))
## Rename v_cut
colnames(dfx)[ncol(dfx)] <- "v_cut"       
## Melt it.    
dfx <- melt(dfx, id=c("v_cut", "v_seed", "v_time"))
## Cast it.
dfx <- cast(dfx, v_cut + v_time + v_seed ~ variable, c(mean,min,max,sd))

如果您只想要平均值，则可以使用 reshape，然后将最后一行替换为：

dfx <- cast(dfx, v_cut + v_time + v_seed ~ variable, mean)

输入“dfx”，您将看到一个数据框，其中包含您想要的内容要求的。

You don't really need plyr for this, you can use reshape

## Pull what you need
dfx <- df[c("v_seed", "v_time","v_rank","v_perco")]
## Bring in your cuts
dfx <- data.frame(dfx, ifelse(df$v_perco > 10,"(10,20]", "[0,10)")))
## Rename v_cut
colnames(dfx)[ncol(dfx)] <- "v_cut"       
## Melt it.    
dfx <- melt(dfx, id=c("v_cut", "v_seed", "v_time"))
## Cast it.
dfx <- cast(dfx, v_cut + v_time + v_seed ~ variable, c(mean,min,max,sd))

if you only want the mean, then replace the last line with:

dfx <- cast(dfx, v_cut + v_time + v_seed ~ variable, mean)

type "dfx" and you'll see a data frame with what you asked for.

回复收藏 0 原文

乱世争霸 2024-10-03 16:31:41

您只是在语法上遇到问题：

## Add your cut
df.new <- data.frame(df, ifelse(df$v_perco > 10,"(10,20]", "[0,10)"))
## Rename v_cut
colnames(df.new)[ncol(df.new)] <- "v_cut"   

## Careful here read the note below
df.new <- ddply(df.new, .(v_idn, v_time), function(x) unique(data.frame(
mean =  mean(x$v_rank),
v_cut = x$v_cut
)))

或者：

ddply(df.new, .(v_idn, v_time), summarise, mean=mean(v_rank))

使用“.(v_idn, v_time)”，您告诉 ddply 对于 v_idn 和 v_time 的每种组合，您希望它计算 v_rank 的平均值。

You're just having a problem with syntax is all:

## Add your cut
df.new <- data.frame(df, ifelse(df$v_perco > 10,"(10,20]", "[0,10)"))
## Rename v_cut
colnames(df.new)[ncol(df.new)] <- "v_cut"   

## Careful here read the note below
df.new <- ddply(df.new, .(v_idn, v_time), function(x) unique(data.frame(
mean =  mean(x$v_rank),
v_cut = x$v_cut
)))

Alternatively:

ddply(df.new, .(v_idn, v_time), summarise, mean=mean(v_rank))

With ".(v_idn, v_time)" you're telling ddply that for each combination of v_idn and v_time, you want it to calculate the mean of v_rank.

回复收藏 0 原文

~没有更多了~

关于作者

蓝海似她心

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

按类范围列表对数据帧进行分类或剪切，并使用 ddply 对其进行汇总

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

若沐

Sherlocked

mb_UOquntnT

你怎么敢

迷乱花海

茶叶先生

友情链接

按类范围列表对数据帧进行分类或剪切，并使用 ddply 对其进行汇总

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

若沐

Sherlocked

mb_UOquntnT

你怎么敢

迷乱花海

茶叶先生

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。