按类范围列表对数据帧进行分类或剪切,并使用 ddply 对其进行汇总
我对 ddply 和 subset 有疑问。
我有这样的数据框 df :
df <- read.table(textConnection(
" id v_idn v_seed v_time v_pop v_rank v_perco
1 15 125648 0 150 1 15
2 17 125648 0 120 2 5
3 18 125648 0 100 3 6
4 52 125648 0 25 4 1
5 17 125648 10 220 1 5
6 15 125648 10 160 2 15
7 18 125648 10 110 3 6
8 52 125648 10 50 4 1
9 56 -11152 0 250 1 17
10 15 -11152 0 180 2 15
11 18 -11152 0 110 3 6
12 22 -11152 0 5 4 14
13 56 -11152 10 250 1 17
14 15 -11152 10 180 2 15
15 22 -11152 10 125 3 14
16 18 -11152 10 120 4 6 "), header=TRUE)
第一步:
我有一个带有 cut_interval 的等间隔列表,如下所示:
myinterval <- cut_interval(c(15,5,6,1,17,14), length=10)
所以我这里有两个级别:[0,10) 和 (10,20]
第二步:
我希望每个组/类由 v_cut 中的两个级别定义...像这样:
id v_idn v_seed v_time v_pop v_rank v_perco v_cut
1 15 125648 0 150 1 15 (10,20]
2 17 125648 0 120 2 5 [0,10)
3 18 125648 0 100 3 6 [0,10)
4 52 125648 0 25 4 1 [0,10)
5 17 125648 10 220 1 5 [0,10)
6 15 125648 10 160 2 15 (10,20]
7 18 125648 10 110 3 6 [0,10)
8 52 125648 10 50 4 1 [0,10)
9 56 -11152 0 250 1 17 (10,20]
10 15 -11152 0 180 2 15 (10,20]
11 18 -11152 0 110 3 6 [0,10)
12 22 -11152 0 5 4 14 (10,20]
13 56 -11152 10 250 1 17 (10,20]
14 15 -11152 10 180 2 15 (10,20]
15 22 -11152 10 125 3 14 (10,20]
16 18 -11152 10 120 4 6 [0,10)
第三步:
我想知道 v_rank 对于 x 轴的可变性,并且y 轴的时间,对于每个组 v_cut,所以我需要计算 v_rank 值的 min、mean、max、sd,类似于
ddply(df, .(v_cut,v_time), summarize ,mean = mean(v_rank), min = min(v_rank), max = max(v_rank), sd = sd(v_rank))
*想要的结果:*
id v_time MEAN.v_rank ... v_cut
1 0 2.25 (10,20]
2 0 2.42 [0,10)
3 10 2.25 [0,10)
4 10 2.42 (10,20]
我的问题
I不知道如何通过步骤 1 -> 步骤 2:/
是否可以像步骤 3 中的示例一样按 v_cut 进行分组?是否
可以使用 ddply 的“子集”选项进行相同的操作
?更多时间,非常感谢伟大的 R 大师的帮助!
更新 1:
我有一个从步骤 1 到步骤 2 的答案:
df$v_cut <- cut_interval(df$v_perco,n=10)
我正在使用 plyr,但在这种情况下也许有更好的答案?
回答从步骤 2 到步骤 3 ?
更新 2:
Brandon Bertelsen 给了我一个很好的熔化 + 铸造的答案,但现在(为了理解)我想用 plyr 和 ddply 进行相同的操作。 . 具有不同的结果:
id v_idn v_time MEAN.v_rank ... v_cut
1 15 0 2.25 (10,20]
2 15 10 2.45 (10,20]
2 17 0 1.52 [0,10)
2 17 10 2.42 [0,10)
etc.
我正在尝试这样的事情:
r('sumData <- ddply(df, .(v_idn,v_time), summarize,min = min(v_rank),mean = mean(v_rank), max = max(v_rank), sd=sd(v_rank))')
但是我想在我的 sumData 数据框中使用 v_cut ,我该如何处理 ddply ?有没有办法做到这一点?或者与初始 df 和 key = v_idn 合并以将列 v_cut 添加到 sumData 是唯一好的答案?
I have question about ddply and subset.
I have dataframe df like this :
df <- read.table(textConnection(
" id v_idn v_seed v_time v_pop v_rank v_perco
1 15 125648 0 150 1 15
2 17 125648 0 120 2 5
3 18 125648 0 100 3 6
4 52 125648 0 25 4 1
5 17 125648 10 220 1 5
6 15 125648 10 160 2 15
7 18 125648 10 110 3 6
8 52 125648 10 50 4 1
9 56 -11152 0 250 1 17
10 15 -11152 0 180 2 15
11 18 -11152 0 110 3 6
12 22 -11152 0 5 4 14
13 56 -11152 10 250 1 17
14 15 -11152 10 180 2 15
15 22 -11152 10 125 3 14
16 18 -11152 10 120 4 6 "), header=TRUE)
STEP ONE :
I have a list of equal interval with cut_interval like this :
myinterval <- cut_interval(c(15,5,6,1,17,14), length=10)
So i have two levels here : [0,10) and (10,20]
STEP TWO :
I want each group/class is define by my two levels in v_cut ... like this :
id v_idn v_seed v_time v_pop v_rank v_perco v_cut
1 15 125648 0 150 1 15 (10,20]
2 17 125648 0 120 2 5 [0,10)
3 18 125648 0 100 3 6 [0,10)
4 52 125648 0 25 4 1 [0,10)
5 17 125648 10 220 1 5 [0,10)
6 15 125648 10 160 2 15 (10,20]
7 18 125648 10 110 3 6 [0,10)
8 52 125648 10 50 4 1 [0,10)
9 56 -11152 0 250 1 17 (10,20]
10 15 -11152 0 180 2 15 (10,20]
11 18 -11152 0 110 3 6 [0,10)
12 22 -11152 0 5 4 14 (10,20]
13 56 -11152 10 250 1 17 (10,20]
14 15 -11152 10 180 2 15 (10,20]
15 22 -11152 10 125 3 14 (10,20]
16 18 -11152 10 120 4 6 [0,10)
STEP 3 :
I want to know the variability of v_rank for x axis, and time for y axis, for each group v_cut, so i need to compute min,mean,max,sd for v_rank value with something like
ddply(df, .(v_cut,v_time), summarize ,mean = mean(v_rank), min = min(v_rank), max = max(v_rank), sd = sd(v_rank))
*RESULT WANTED : *
id v_time MEAN.v_rank ... v_cut
1 0 2.25 (10,20]
2 0 2.42 [0,10)
3 10 2.25 [0,10)
4 10 2.42 (10,20]
MY PROBLEM
I don't know how to pass step 1 -> step 2 :/
And if it's possible to group by v_cut like my example in step 3 ?
Is there a possibility to make the same things with the "subset" option of ddply ?
One more time, thanks a lot for your help great R guru !
UPDATE 1 :
I have an answer to go step1 to step2 :
df$v_cut <- cut_interval(df$v_perco,n=10)
I'm using plyr, but there are perhaps a better answer in this case ?
Answer to go to step 2 to step 3 ?
UPDATE 2 :
Brandon Bertelsen give me a good answer with melt + cast, but now (to understand) i want to make the same operation with plyr and ddply .. with a different result :
id v_idn v_time MEAN.v_rank ... v_cut
1 15 0 2.25 (10,20]
2 15 10 2.45 (10,20]
2 17 0 1.52 [0,10)
2 17 10 2.42 [0,10)
etc.
I'm trying with something like this :
r('sumData <- ddply(df, .(v_idn,v_time), summarize,min = min(v_rank),mean = mean(v_rank), max = max(v_rank), sd=sd(v_rank))')
But i want to have v_cut in my sumData dataframe, how can i do with ddply ? is there an option to make this ? Or merging with initial df and key = v_idn to add column v_cut to sumData is the only good answer ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您实际上并不需要 plyr,
如果您只想要平均值,则可以使用
reshape
,然后将最后一行替换为:输入“dfx”,您将看到一个数据框,其中包含您想要的内容要求的。
You don't really need plyr for this, you can use
reshape
if you only want the mean, then replace the last line with:
type "dfx" and you'll see a data frame with what you asked for.
您只是在语法上遇到问题:
或者:
使用“.(v_idn, v_time)”,您告诉 ddply 对于 v_idn 和 v_time 的每种组合,您希望它计算 v_rank 的平均值。
You're just having a problem with syntax is all:
Alternatively:
With ".(v_idn, v_time)" you're telling ddply that for each combination of v_idn and v_time, you want it to calculate the mean of v_rank.