选择 r 中组内变量值最大的行
a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)
r<-sapply(split(a.3,a.2),function(x) which.max(x$b.2))
a.3[r,]
返回列表索引,而不是整个 data.frame 的索引
我试图为 a.2
的每个子组返回 b.2
的最大值。我怎样才能有效地做到这一点?
a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)
r<-sapply(split(a.3,a.2),function(x) which.max(x$b.2))
a.3[r,]
returns the list index, not the index for the entire data.frame
Im trying to return the largest value of b.2
for each subgroup of a.2
. How can I do this efficiently?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我认为 ddply 和 ave 方法都相当耗费资源。
ave
因当前问题的内存不足而失败(67,608 行,其中四列定义唯一键)。tapply
是一个方便的选择,但我通常需要做的是选择每个唯一键(通常由多个列定义)具有某个值的所有整行。我发现的最佳解决方案是进行排序,然后使用重复的否定来仅选择每个唯一键的第一行。对于这里的简单示例:我认为至少与
ave
或ddply
相比,性能提升是相当可观的。对于多列键来说稍微复杂一些,但是order
将处理一大堆要排序的事情,并且duplicated
适用于数据框,因此可以继续使用这种方法。The
ddply
andave
approaches are both fairly resource-intensive, I think.ave
fails by running out of memory for my current problem (67,608 rows, with four columns defining the unique keys).tapply
is a handy choice, but what I generally need to do is select all the whole rows with the something-est some-value for each unique key (usually defined by more than one column). The best solution I've found is to do a sort and then use negation ofduplicated
to select only the first row for each unique key. For the simple example here:I think the performance gains over
ave
orddply
, at least, are substantial. It is slightly more complicated for multi-column keys, butorder
will handle a whole bunch of things to sort on andduplicated
works on data frames, so it's possible to continue using this approach.乔纳森·张(Jonathan Chang)的答案让您得到了您明确要求的内容,但我猜测您想要数据框中的实际行。
The answer by Jonathan Chang gets you what you explicitly asked for, but I'm guessing that you want the actual row from the data frame.
这确实有效,尽管有点麻烦......但它允许我抓取分组最大值的行。还有其他想法吗?
This does the trick, albeit somewhat cumbersome...But it allows me to grab the rows for the groupwise largest values. Any other ideas?
使用
aggregate
,您可以在一行中获取每个组的最大值:这会产生以下输出:
With
aggregate
, you can get the maximum for each group in one line:This produces the following output: