当使用大型数据集并应用使用多个数据列的功能时,我一直遇到一个计算时间问题。这是一个简化的示例:
require(data.table)
main_dt <- data.table(Start=(1:2), End=c(2,2))
lookup_dt <- data.table(Year = 1:3, Amount = c(10,20,30))
so:
> main_dt
Start End
1: 1 2
2: 2 2
我想要的
> lookup_dt
Year Amount
1: 1 10
2: 2 20
3: 3 30
是在data.table main_dt 中添加一列,其中包括 start 和之间的年数量。结束。所以类似的东西,但是有一个引用错误。
main_dt[, Amount := lookup_dt[ Year >= start & Year <= end, sum(Amount)]]
Warning messages:
1: In Year >= Start : longer object length is not a multiple of shorter object length
2: In Year <= End : longer object length is not a multiple of shorter object length
如果金额仅取决于一个变量,我可以做:
main_dt[, GreaterAmount := lapply(Start, function(x) return (lookup_dt[Year >= x, sum(Amount)]))]
要
> main_dt
Start End Amount GreaterAmount
1: 1 2 30 60
2: 2 2 30 50
:可以,但是
- 我需要应用一个实际上取决于 start 和的函数 end
- 为每个行应用一个函数,会迅速减慢计算过程。矢量化解决方案将不胜感激。
期待任何建议!
谢谢你!
马克斯
I have been running into a computation time issue when working with large datasets and applying functions that use multiple columns of the data.table as inputs. Here is a simplified example:
require(data.table)
main_dt <- data.table(Start=(1:2), End=c(2,2))
lookup_dt <- data.table(Year = 1:3, Amount = c(10,20,30))
So:
> main_dt
Start End
1: 1 2
2: 2 2
And
> lookup_dt
Year Amount
1: 1 10
2: 2 20
3: 3 30
What I want is to add a column in data.table main_dt that includes the sum of Amounts of Years between Start and End. So something like this, but there is a referencing error.
main_dt[, Amount := lookup_dt[ Year >= start & Year <= end, sum(Amount)]]
Warning messages:
1: In Year >= Start : longer object length is not a multiple of shorter object length
2: In Year <= End : longer object length is not a multiple of shorter object length
If the Amount would depend only on one variable, I could do:
main_dt[, GreaterAmount := lapply(Start, function(x) return (lookup_dt[Year >= x, sum(Amount)]))]
to get:
> main_dt
Start End Amount GreaterAmount
1: 1 2 30 60
2: 2 2 30 50
Which would be ok, but
- I need to apply a function that actually depends on both Start and End
- Applying a function for every row individually slows down computation process rapidly. A vectorized solution would be highly appreciated.
Looking forward to any suggestions!
Thank you!
Markus
发布评论
评论(1)
首先,加入列应该是同一类,因此我们可以将
main_dt $ end
转换为整数,或main_df $ start
和lookup_dt $年
到数字。我将选择第一个:从这里,我们可以进行加入分配:
如果您对
data.table
有点熟悉,请注意,.sd
实际上是main_dt
的内容,因此lookup_dt [.sd,...]
有效地是“ main_dt fuld of lookup_dt” 。从那里,on =
应该是正常的,sum(量)
是您要汇总的。这里唯一介绍的新事物是使用by = .easti
,这可能会令人困惑。为此链接:First, the join columns should be the same class, so we can either convert
main_dt$End
to integer, ormain_df$Start
andlookup_dt$Year
to numeric. I'll choose the first:From here, we can do a joining-assignment:
If you're somewhat familiar with
data.table
, note that.SD
referenced is actually the contents ofmain_dt
, solookup_dt[.SD,...]
is effectively "main_dt left join lookup_dt". From there, theon=
should be normal, andsum(Amount)
is what you want to aggregate. The only new thing introduced here is the use ofby=.EACHI
, which can be confusing; some links for that: