在另一个数据中查找数据。

发布于 2025-02-09 10:19:12 字数 1299 浏览 1 评论 0 原文

当使用大型数据集并应用使用多个数据列的功能时,我一直遇到一个计算时间问题。这是一个简化的示例:

require(data.table)
main_dt   <- data.table(Start=(1:2), End=c(2,2))
lookup_dt <- data.table(Year = 1:3, Amount = c(10,20,30))

so:

> main_dt
Start End
1:     1   2
2:     2   2

我想要的

> lookup_dt
   Year Amount
1:    1     10
2:    2     20
3:    3     30

是在data.table main_dt 中添加一列,其中包括 start 之间的年数量。结束。所以类似的东西,但是有一个引用错误。

main_dt[, Amount := lookup_dt[ Year >= start & Year <= end, sum(Amount)]]
Warning messages:
1: In Year >= Start :  longer object length is not a multiple of shorter object length
2: In Year <= End :  longer object length is not a multiple of shorter object length

如果金额仅取决于一个变量,我可以做:

main_dt[, GreaterAmount := lapply(Start, function(x) return (lookup_dt[Year >= x, sum(Amount)]))]

> main_dt
Start End Amount GreaterAmount
1:     1   2     30            60
2:     2   2     30            50

:可以,但是

  • 我需要应用一个实际上取决于 start 的函数 end
  • 为每个行应用一个函数,会迅速减慢计算过程。矢量化解决方案将不胜感激。

期待任何建议!

谢谢你! 马克斯

I have been running into a computation time issue when working with large datasets and applying functions that use multiple columns of the data.table as inputs. Here is a simplified example:

require(data.table)
main_dt   <- data.table(Start=(1:2), End=c(2,2))
lookup_dt <- data.table(Year = 1:3, Amount = c(10,20,30))

So:

> main_dt
Start End
1:     1   2
2:     2   2

And

> lookup_dt
   Year Amount
1:    1     10
2:    2     20
3:    3     30

What I want is to add a column in data.table main_dt that includes the sum of Amounts of Years between Start and End. So something like this, but there is a referencing error.

main_dt[, Amount := lookup_dt[ Year >= start & Year <= end, sum(Amount)]]
Warning messages:
1: In Year >= Start :  longer object length is not a multiple of shorter object length
2: In Year <= End :  longer object length is not a multiple of shorter object length

If the Amount would depend only on one variable, I could do:

main_dt[, GreaterAmount := lapply(Start, function(x) return (lookup_dt[Year >= x, sum(Amount)]))]

to get:

> main_dt
Start End Amount GreaterAmount
1:     1   2     30            60
2:     2   2     30            50

Which would be ok, but

  • I need to apply a function that actually depends on both Start and End
  • Applying a function for every row individually slows down computation process rapidly. A vectorized solution would be highly appreciated.

Looking forward to any suggestions!

Thank you!
Markus

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

酷炫老祖宗 2025-02-16 10:19:12

首先,加入列应该是同一类,因此我们可以将 main_dt $ end 转换为整数,或 main_df $ start lookup_dt $年到数字。我将选择第一个:

main_dt[, End := as.integer(End)]
main_dt
#    Start   End
#    <int> <int>
# 1:     1     2
# 2:     2     2

从这里,我们可以进行加入分配:

main_dt[, Amount := lookup_dt[.SD, sum(Amount), on = .(Year >= Start, Year <= End), by = .EACHI]$V1 ]
main_dt
#    Start   End Amount
#    <int> <int>  <num>
# 1:     1     2     30
# 2:     2     2     20

如果您对 data.table 有点熟悉,请注意, .sd 实际上是 main_dt 的内容,因此 lookup_dt [.sd,...] 有效地是“ main_dt fuld of lookup_dt” 。从那里, on = 应该是正常的, sum(量)是您要汇总的。这里唯一介绍的新事物是使用 by = .easti ,这可能会令人困惑。为此链接:

First, the join columns should be the same class, so we can either convert main_dt$End to integer, or main_df$Start and lookup_dt$Year to numeric. I'll choose the first:

main_dt[, End := as.integer(End)]
main_dt
#    Start   End
#    <int> <int>
# 1:     1     2
# 2:     2     2

From here, we can do a joining-assignment:

main_dt[, Amount := lookup_dt[.SD, sum(Amount), on = .(Year >= Start, Year <= End), by = .EACHI]$V1 ]
main_dt
#    Start   End Amount
#    <int> <int>  <num>
# 1:     1     2     30
# 2:     2     2     20

If you're somewhat familiar with data.table, note that .SD referenced is actually the contents of main_dt, so lookup_dt[.SD,...] is effectively "main_dt left join lookup_dt". From there, the on= should be normal, and sum(Amount) is what you want to aggregate. The only new thing introduced here is the use of by=.EACHI, which can be confusing; some links for that:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文