在多个不同的切片上应用聚合函数
我有一个数据数组,其中包含有关人员和项目的一些信息:
person_id | project_id | action | time
--------------------------------------
1 | 1 | w | 1
1 | 2 | w | 2
1 | 3 | w | 2
1 | 3 | r | 3
1 | 3 | w | 4
1 | 4 | w | 4
2 | 2 | r | 2
2 | 2 | w | 3
我想用几个名为“first_time”和“first_time_project”的字段来扩充此数据,这些字段共同标识该人第一次看到任何操作的时间这是开发人员第一次看到该项目的任何操作。最后,数据应该如下所示:
person_id | project_id | action | time | first_time | first_time_project
------------------------------------------------------------------------
1 | 1 | w | 1 | 1 | 1
1 | 2 | w | 2 | 1 | 2
1 | 3 | w | 2 | 1 | 2
1 | 3 | r | 3 | 1 | 2
1 | 3 | w | 4 | 1 | 2
1 | 4 | w | 4 | 1 | 4
2 | 2 | r | 2 | 2 | 2
2 | 2 | w | 3 | 2 | 2
我的天真的做法是编写几个循环:
for (pid in unique(data$person_id)) {
data[data$pid==pid, "first_time"] = min(data[data$pid==pid, "time"])
for (projid in unique(data[data$pid==pid, "project_id"])) {
data[data$pid==pid & data$project_id==projid, "first_time_project"] = min(data[data$pid==pid & data$project_id==projid, "time"]
}
}
现在,不需要天才就可以看出,使用双重嵌套循环,这将变得极其缓慢。但是,我无法找到在 R 中处理此问题的方法。我有点模拟 SQL 的 group by 选项。我知道 by 可能会有所帮助,但我不知道如何进行多个切片。
关于如何使我的代码从极其缓慢变得更快一点有什么提示吗?如果现在有一只蜗牛我会很高兴。
I have a data array that contains some information about people and projects as such:
person_id | project_id | action | time
--------------------------------------
1 | 1 | w | 1
1 | 2 | w | 2
1 | 3 | w | 2
1 | 3 | r | 3
1 | 3 | w | 4
1 | 4 | w | 4
2 | 2 | r | 2
2 | 2 | w | 3
I'd like to augment this data with a couple of more fields called "first_time" and "first_time_project" that collectively identify first time any action by that person was seen and the first time that developer saw any action on the project. In the end, the data should look like this:
person_id | project_id | action | time | first_time | first_time_project
------------------------------------------------------------------------
1 | 1 | w | 1 | 1 | 1
1 | 2 | w | 2 | 1 | 2
1 | 3 | w | 2 | 1 | 2
1 | 3 | r | 3 | 1 | 2
1 | 3 | w | 4 | 1 | 2
1 | 4 | w | 4 | 1 | 4
2 | 2 | r | 2 | 2 | 2
2 | 2 | w | 3 | 2 | 2
My naive way of doing this to write a couple of loops:
for (pid in unique(data$person_id)) {
data[data$pid==pid, "first_time"] = min(data[data$pid==pid, "time"])
for (projid in unique(data[data$pid==pid, "project_id"])) {
data[data$pid==pid & data$project_id==projid, "first_time_project"] = min(data[data$pid==pid & data$project_id==projid, "time"]
}
}
Now, it doesn't take a genius to see that this is going to be glacially slow with the doubly nested loops. However, I can't figure out a way to handle this in R. I'm kinda emulating the group by option for SQL. I know that by might be able to help, but I can't figure out how to do multiple slices.
Any hints on how to take my code from glacially slow to something a bit faster? I'd be happy with a snail right now.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
尝试
ave
:Try
ave
:Hadley的plyr和transform()的组合是强大的。如果我正确理解你的问题,那么:
The combination of Hadley's plyr and transform() is powerful. If I correctly understand your question, then:
如果您追求的是速度,那么
data.table
就是您的最佳选择。If speed is what you are looking for, then
data.table
is the way to go.快速而肮脏的解决方案,没有循环
Quick and dirty solution with no loops
我能想到的一个快速方法:
A quick way I can think of: