在多个不同的切片上应用聚合函数

发布于 2024-10-17 13:03:58 字数 1862 浏览 11 评论 0原文

我有一个数据数组，其中包含有关人员和项目的一些信息：

person_id | project_id | action | time
--------------------------------------
        1 |          1 |      w |    1
        1 |          2 |      w |    2
        1 |          3 |      w |    2
        1 |          3 |      r |    3
        1 |          3 |      w |    4
        1 |          4 |      w |    4
        2 |          2 |      r |    2
        2 |          2 |      w |    3

我想用几个名为“first_time”和“first_time_project”的字段来扩充此数据，这些字段共同标识该人第一次看到任何操作的时间这是开发人员第一次看到该项目的任何操作。最后，数据应该如下所示：

person_id | project_id | action | time | first_time | first_time_project
------------------------------------------------------------------------
        1 |          1 |      w |    1 |          1 |                  1
        1 |          2 |      w |    2 |          1 |                  2
        1 |          3 |      w |    2 |          1 |                  2
        1 |          3 |      r |    3 |          1 |                  2
        1 |          3 |      w |    4 |          1 |                  2
        1 |          4 |      w |    4 |          1 |                  4
        2 |          2 |      r |    2 |          2 |                  2
        2 |          2 |      w |    3 |          2 |                  2

我的天真的做法是编写几个循环：

for (pid in unique(data$person_id)) {
    data[data$pid==pid, "first_time"] = min(data[data$pid==pid, "time"])
    for (projid in unique(data[data$pid==pid, "project_id"])) {
        data[data$pid==pid & data$project_id==projid, "first_time_project"] = min(data[data$pid==pid & data$project_id==projid, "time"]
    }
}

现在，不需要天才就可以看出，使用双重嵌套循环，这将变得极其缓慢。但是，我无法找到在 R 中处理此问题的方法。我有点模拟 SQL 的 group by 选项。我知道 by 可能会有所帮助，但我不知道如何进行多个切片。

关于如何使我的代码从极其缓慢变得更快一点有什么提示吗？如果现在有一只蜗牛我会很高兴。

原文

I have a data array that contains some information about people and projects as such:

person_id | project_id | action | time
--------------------------------------
        1 |          1 |      w |    1
        1 |          2 |      w |    2
        1 |          3 |      w |    2
        1 |          3 |      r |    3
        1 |          3 |      w |    4
        1 |          4 |      w |    4
        2 |          2 |      r |    2
        2 |          2 |      w |    3

I'd like to augment this data with a couple of more fields called "first_time" and "first_time_project" that collectively identify first time any action by that person was seen and the first time that developer saw any action on the project. In the end, the data should look like this:

person_id | project_id | action | time | first_time | first_time_project
------------------------------------------------------------------------
        1 |          1 |      w |    1 |          1 |                  1
        1 |          2 |      w |    2 |          1 |                  2
        1 |          3 |      w |    2 |          1 |                  2
        1 |          3 |      r |    3 |          1 |                  2
        1 |          3 |      w |    4 |          1 |                  2
        1 |          4 |      w |    4 |          1 |                  4
        2 |          2 |      r |    2 |          2 |                  2
        2 |          2 |      w |    3 |          2 |                  2

My naive way of doing this to write a couple of loops:

for (pid in unique(data$person_id)) {
    data[data$pid==pid, "first_time"] = min(data[data$pid==pid, "time"])
    for (projid in unique(data[data$pid==pid, "project_id"])) {
        data[data$pid==pid & data$project_id==projid, "first_time_project"] = min(data[data$pid==pid & data$project_id==projid, "time"]
    }
}

Now, it doesn't take a genius to see that this is going to be glacially slow with the doubly nested loops. However, I can't figure out a way to handle this in R. I'm kinda emulating the group by option for SQL. I know that by might be able to help, but I can't figure out how to do multiple slices.

Any hints on how to take my code from glacially slow to something a bit faster? I'd be happy with a snail right now.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

薄凉少年不暖心 2024-10-24 13:03:58

尝试ave：

transform(data, 
   first_time = ave(time, person_id, FUN = min),
   first_time_project = ave(time, person_id, project_id, drop = TRUE, FUN = min)
)

Try ave :

transform(data, 
   first_time = ave(time, person_id, FUN = min),
   first_time_project = ave(time, person_id, project_id, drop = TRUE, FUN = min)
)

回复收藏 0 原文

∞梦里开花 2024-10-24 13:03:58

Hadley的plyr和transform()的组合是强大的。如果我正确理解你的问题，那么：

foo <- ddply(foo, .(person_id), transform, first_time=min(time))
foo <- ddply(foo, .(person_id, project_id), transform, 
  first_time_project=min(time))

The combination of Hadley's plyr and transform() is powerful. If I correctly understand your question, then:

foo <- ddply(foo, .(person_id), transform, first_time=min(time))
foo <- ddply(foo, .(person_id, project_id), transform, 
  first_time_project=min(time))

回复收藏 0 原文

旧伤还要旧人安 2024-10-24 13:03:58

如果您追求的是速度，那么 data.table 就是您的最佳选择。

library(data.table)
DT <- data.table(foo)
DT[, first_time := min(time), by = person_id]
DT[, first_time_project := min(time), by = list(person_id, project_id)]

If speed is what you are looking for, then data.table is the way to go.

library(data.table)
DT <- data.table(foo)
DT[, first_time := min(time), by = person_id]
DT[, first_time_project := min(time), by = list(person_id, project_id)]

回复收藏 0 原文

亢潮 2024-10-24 13:03:58

快速而肮脏的解决方案，没有循环

library(plyr)


# function to get first time by any person/project
fp <- function(dat) 
{
dat$first_time=min(dat$time)
ftp <- function(d) { d$first_time_project=min(d$time); return (d) }
dat=ddply(dat, .(project_id), ftp)
return (dat)
}


#this single call should give you the result you want
result=ddply(data, .(person_id), fp)

Quick and dirty solution with no loops

library(plyr)


# function to get first time by any person/project
fp <- function(dat) 
{
dat$first_time=min(dat$time)
ftp <- function(d) { d$first_time_project=min(d$time); return (d) }
dat=ddply(dat, .(project_id), ftp)
return (dat)
}


#this single call should give you the result you want
result=ddply(data, .(person_id), fp)

回复收藏 0 原文

歌入人心 2024-10-24 13:03:58

我能想到的一个快速方法：

foo <- data.frame(
       person_id=rep(1:5,each=6),
       project_id=sample(1:5,30,T),
       time=sample(1:30))

first_time <- aggregate(foo$time, list(foo$person_id), min)

foo$first_time <- first_time[ match(foo$person_id,first_time[,1]),2]

bar <- subset(foo, time==first_time)

foo$first_time_project <- bar$project_id[match(foo$person_id, bar$person_id)]

A quick way I can think of:

foo <- data.frame(
       person_id=rep(1:5,each=6),
       project_id=sample(1:5,30,T),
       time=sample(1:30))

first_time <- aggregate(foo$time, list(foo$person_id), min)

foo$first_time <- first_time[ match(foo$person_id,first_time[,1]),2]

bar <- subset(foo, time==first_time)

foo$first_time_project <- bar$project_id[match(foo$person_id, bar$person_id)]

回复收藏 0 原文

~没有更多了~