如何有效地对稀疏数据进行聚合

发布于 2024-12-13 01:32:10 字数 1861 浏览 0 评论 0原文

我有一个包含 1008412 个观察值的大型数据集，这些列是 customer_id (int)、visit_date (日期，格式：“2010-04-04”)、visit_spend (float)。

此聚合日期函数将感兴趣的周数映射到范围 13-65：

weekofperiod <- function(dt) {
    as.numeric(format(as.Date(dt), "%W")) + 52 * (as.numeric(format(as.Date(dt), "%Y"))-2010)
}

每个 customer_id 在 53 周内的总访问次数是可变的。对于每个 customer_id，我想通过 weekofperiod() 获取 spend_per_week 的聚合。下面的代码在功能上是正确的，但非常慢 - 注释可以使其更快吗？另外，aggregate() 会生成稀疏输出，其中缺少没有访问的周，我将 spend_per_week 初始化为 0，然后按行手动分配aggregate() 的非零结果，以确保结果始终有 53 行。当然可以做得更好吗？

示例数据集行如下所示：

   customer_id visit_date visit_spend 
72          40 2011-03-15       18.38 
73          40 2011-03-20       23.45  
74          79 2010-04-07      150.87 
75          79 2010-04-17      101.90 
76          79 2010-05-02      111.90

这是带有聚合调用和空周调整的代码：

for (cid in all_tt_cids) {
  print_pnq('Getting statistics for cid', cid)

  # Get row indices of the selected subset, for just this cid's records
  I <- which(tt$customer_id==cid & tt$visit_date<="2011-03-31")

  # (other code to compute other per-cid statistics)

  # spend_per_week (mode;mean;sd)
  # Aggregate spend_per_week, but beware this should be 0 for those week with no visits
  spend_per_week <- data.frame(c(list('weekofperiod'=13:65), list('spendperweek'=0)) )
  nonzero_spends_per_week <- aggregate(tt$visit_spend[I], list('weekofperiod'=weekofperiod(tt$visit_date[I])), FUN="sum")
  for (i in 1:nrow(nonzero_spends_per_week)) {
    spend_per_week[spend_per_week$weekofperiod==nonzero_spends_per_week[i,1],2] <- nonzero_spends_per_week[i,2]
  }
  colnames(spend_per_week)[2] <- 'spend_per_week'

  # (code to compute and store per-cid statistics on spend_per_week)

}

原文

I have a large dataset with 1008412 observations,
the columns are customer_id (int), visit_date (Date, format: "2010-04-04"), visit_spend (float).

This date function for the aggregate maps week numbers of interest to the range 13-65:

weekofperiod <- function(dt) {
    as.numeric(format(as.Date(dt), "%W")) + 52 * (as.numeric(format(as.Date(dt), "%Y"))-2010)
}

Each customer_id has a variable number of total visits over a 53-week period.
For each customer_id, I want to get the aggregate of spend_per_week, by weekofperiod().
The code below is functionally correct but very slow - comments to make it faster?
Also, aggregate() produces sparse output where weeks without visits are missing, I initialize spend_per_week to 0, then row-wise manually assign the non-zero results from aggregate(), to make sure the result always has 53 rows. Surely that can be done better?

Sample dataset lines look like:

   customer_id visit_date visit_spend 
72          40 2011-03-15       18.38 
73          40 2011-03-20       23.45  
74          79 2010-04-07      150.87 
75          79 2010-04-17      101.90 
76          79 2010-05-02      111.90

and here's the code with aggregate call and adjustment for empty weeks:

for (cid in all_tt_cids) {
  print_pnq('Getting statistics for cid', cid)

  # Get row indices of the selected subset, for just this cid's records
  I <- which(tt$customer_id==cid & tt$visit_date<="2011-03-31")

  # (other code to compute other per-cid statistics)

  # spend_per_week (mode;mean;sd)
  # Aggregate spend_per_week, but beware this should be 0 for those week with no visits
  spend_per_week <- data.frame(c(list('weekofperiod'=13:65), list('spendperweek'=0)) )
  nonzero_spends_per_week <- aggregate(tt$visit_spend[I], list('weekofperiod'=weekofperiod(tt$visit_date[I])), FUN="sum")
  for (i in 1:nrow(nonzero_spends_per_week)) {
    spend_per_week[spend_per_week$weekofperiod==nonzero_spends_per_week[i,1],2] <- nonzero_spends_per_week[i,2]
  }
  colnames(spend_per_week)[2] <- 'spend_per_week'

  # (code to compute and store per-cid statistics on spend_per_week)

}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜夜流光相皎洁 2024-12-20 01:32:10

如果您替换for循环，您将会获得最大的速度提升。我无法从您的示例中完全看出，因为您覆盖了循环中的每个客户，但如果您想保留所有主题的信息，这里有一种方法。

为了进行测试，首先为原始方法定义函数，然后定义一个没有循环的新方法：

weekofperiod <- function(dt) {
  as.numeric(format(as.Date(dt), "%W")) + 52 * (as.numeric(format(as.Date(dt), "%Y"))-2010)
}

FastMethod <- function(tt) {  
  tt$week = weekofperiod(tt$visit_date)
  spend_per_week.tmp = as.data.frame(tapply(tt$visit_spend, tt[,c(1,4)], sum))
  spend_per_week = data.frame(matrix(0, nrow=nrow(spend_per_week.tmp), ncol=length(13:65)))
  colnames(spend_per_week) = 13:65
  rownames(spend_per_week) = rownames(spend_per_week.tmp)
  spend_per_week[, colnames(spend_per_week.tmp)] = spend_per_week.tmp
  spend_per_week
}

OrigMethod <- function(tt) {
  all_tt_cids = unique(tt$customer_id)

  for (cid in all_tt_cids) {
    # Get row indices of the selected subset, for just this cid's records
    I <- which(tt$customer_id==cid & tt$visit_date<="2011-03-31")

    # Aggregate spend_per_week, but beware this should be 0 for those week with no visits
    spend_per_week <- data.frame(c(list('weekofperiod'=13:65), list('spendperweek'=0)))
    nonzero_spends_per_week <- aggregate(tt$visit_spend[I], list('weekofperiod'=weekofperiod(tt$visit_date[I])), FUN="sum")
    for (i in 1:nrow(nonzero_spends_per_week)) {
      spend_per_week[spend_per_week$weekofperiod==nonzero_spends_per_week[i,1],2] <- nonzero_spends_per_week[i,2]
    }
    colnames(spend_per_week)[2] <- 'spend_per_week'
  }
  spend_per_week
}

现在模拟更大的数据集，以便更容易比较：

n.row  = 10^4
n.cust = 10^3

customer_id = 1:n.cust
dates = seq(as.Date('2010-04-01'), as.Date('2011-03-31'), by=1)
visit_date = sample(dates, n.row, replace=T)
visit_spend = runif(n.row, 0, 200)

tt = data.frame(customer_id, visit_date, visit_spend)

最后，比较这两种方法：

> system.time(FastMethod(tt))
   user  system elapsed 
  0.082   0.001   0.083 
> system.time(OrigMethod(tt))

   user  system elapsed 
  4.505   0.007   4.514

这已经快了 50 倍，并且我敢打赌，您可以通过更多优化使其变得更好。祝你好运！

Your biggest speed up is going to come if you replace the for loops. I can't quite tell from your example, because you overwrite each customer in the loop, but here's one way to do it if you want to keep the info for all subjects.

For testing, first define functions for the original method, and a new method without loops:

weekofperiod <- function(dt) {
  as.numeric(format(as.Date(dt), "%W")) + 52 * (as.numeric(format(as.Date(dt), "%Y"))-2010)
}

FastMethod <- function(tt) {  
  tt$week = weekofperiod(tt$visit_date)
  spend_per_week.tmp = as.data.frame(tapply(tt$visit_spend, tt[,c(1,4)], sum))
  spend_per_week = data.frame(matrix(0, nrow=nrow(spend_per_week.tmp), ncol=length(13:65)))
  colnames(spend_per_week) = 13:65
  rownames(spend_per_week) = rownames(spend_per_week.tmp)
  spend_per_week[, colnames(spend_per_week.tmp)] = spend_per_week.tmp
  spend_per_week
}

OrigMethod <- function(tt) {
  all_tt_cids = unique(tt$customer_id)

  for (cid in all_tt_cids) {
    # Get row indices of the selected subset, for just this cid's records
    I <- which(tt$customer_id==cid & tt$visit_date<="2011-03-31")

    # Aggregate spend_per_week, but beware this should be 0 for those week with no visits
    spend_per_week <- data.frame(c(list('weekofperiod'=13:65), list('spendperweek'=0)))
    nonzero_spends_per_week <- aggregate(tt$visit_spend[I], list('weekofperiod'=weekofperiod(tt$visit_date[I])), FUN="sum")
    for (i in 1:nrow(nonzero_spends_per_week)) {
      spend_per_week[spend_per_week$weekofperiod==nonzero_spends_per_week[i,1],2] <- nonzero_spends_per_week[i,2]
    }
    colnames(spend_per_week)[2] <- 'spend_per_week'
  }
  spend_per_week
}

Now simulate a larger dataset so it's easier to compare:

n.row  = 10^4
n.cust = 10^3

customer_id = 1:n.cust
dates = seq(as.Date('2010-04-01'), as.Date('2011-03-31'), by=1)
visit_date = sample(dates, n.row, replace=T)
visit_spend = runif(n.row, 0, 200)

tt = data.frame(customer_id, visit_date, visit_spend)

Finally, compare the two methods:

> system.time(FastMethod(tt))
   user  system elapsed 
  0.082   0.001   0.083 
> system.time(OrigMethod(tt))

   user  system elapsed 
  4.505   0.007   4.514

This is already 50x faster, and I bet you can make it even better with more optimization. Good luck!

回复收藏 0 原文

萌无敌 2024-12-20 01:32:10

这是使用 data.table 的更快方法，也更易于阅读。

FasterMethod <- function(tt){
  # LOAD LIBRARIES
  require(reshape2)
  require(data.table)
  tt <- transform(tt, week_of_period = weekofperiod(visit_date))

  # AGGREGATE SPEND BY CUSTOMER AND WEEK OF PERIOD
  tt <- data.table(tt)
  ans <- tt[,list(spend = sum(visit_spend)), 'customer_id, week_of_period']

  # RESHAPE TO CUSTOMER ID VS. WEEK OF PERIOD
  dcast(ans, customer_id ~ week_of_period, value_var = 'spend')
}

我们可以使用 rbenchmark 与 FastMethod 和 OrigMethod 进行基准测试，并发现我们比 FastMethod 获得了 1.3 倍的加速整体加速 70 倍

library(rbenchmark)
benchmark(FastMethod(tt), FasterMethod(tt), replications = 40)

test             elapsed relative 
FastMethod(tt)    5.594  1.346654     
FasterMethod(tt)  4.154  1.000000

如果您不关心将最终输出重新调整为客户 ID 与一周的周期，则您可以进一步加快速度（与 FastMethod 相比为 2.5 倍）。

Here is a faster method using data.table, which is also easier to read.

FasterMethod <- function(tt){
  # LOAD LIBRARIES
  require(reshape2)
  require(data.table)
  tt <- transform(tt, week_of_period = weekofperiod(visit_date))

  # AGGREGATE SPEND BY CUSTOMER AND WEEK OF PERIOD
  tt <- data.table(tt)
  ans <- tt[,list(spend = sum(visit_spend)), 'customer_id, week_of_period']

  # RESHAPE TO CUSTOMER ID VS. WEEK OF PERIOD
  dcast(ans, customer_id ~ week_of_period, value_var = 'spend')
}

We can benchmark this against FastMethod and OrigMethod using rbenchmark, and see that we gain a 1.3x speedup over FastMethod and an overall speedup of 70x

library(rbenchmark)
benchmark(FastMethod(tt), FasterMethod(tt), replications = 40)

test             elapsed relative 
FastMethod(tt)    5.594  1.346654     
FasterMethod(tt)  4.154  1.000000

You can speed it up even further (2.5 x compared to FastMethod) if you did not care about reshaping the final output to customer id vs. week of period.

回复收藏 0 原文

~没有更多了~