如何避免循环
大家好, 我是 R 新手。
我有两个面板数据文件,其中包含“id”、“date”和“ret”列,
文件 A 的数据比文件 B 更多, 但我主要处理文件 B 数据。
“id”和“date”的组合是唯一的标识符。
有没有一种优雅的方式来查找 B 中的每个(id,日期),我需要从文件 A 中获取过去 10 天的 ret,并将它们存储回 B 中?
我天真的做法是循环 B 中的所有行,
for i in 1:length(B) {
B$past10d[i] <- prod(1+A$ret[which(A$id == B$id[i] & A$date > B$date[i]-10 & A$date < B$date[i])])-1
}
但循环需要很长时间。
真的很欣赏你的想法。
非常感谢。
HI All,
I'm new to R.
I have two panel data files, with columns "id", "date" and "ret"
file A has a lot more data than file B,
but i'm primarily working with file B data.
Combination of "id" and "date" is unqiue indentifier.
Is there an elegent way of looking up for each (id, date) in B, I need to get the past 10 days ret from file A, and store them back into B?
my naive way of doing it is to loop for all rows in B,
for i in 1:length(B) {
B$past10d[i] <- prod(1+A$ret[which(A$id == B$id[i] & A$date > B$date[i]-10 & A$date < B$date[i])])-1
}
but the loops takes forever.
Really appreciate your thoughts.
Thank you very much.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
你尝试过吗?合并?
“通过公共列或行名称合并两个数据框,或者执行其他版本的数据库连接操作。”
此外,如果您持续使用复合 PK 或任何其他内容作为唯一标识符,我建议使用一些本地 MySQL / PostgreSQL (RMySQL / RPostgreSQL) 数据库。对我来说,SQL 重新排列数据然后使用视图中的 data.frames 比循环容易得多。
Did you try ?merge ?
"Merge two data frames by common columns or row names, or do other versions of database join operations. "
Besides I suggest to use a little local MySQL / PostgreSQL (RMySQL / RPostgreSQL) database if you continously sport composite PKs or whatsoever as unique identifiers. To me SQL rearranging of data and afterwards using data.frames from view is a lot easier than looping.
我认为关键是向量化并使用
%in%
运算符对数据帧A
进行子集化。而且,我知道,价格不是随机数,但我不想编写随机游走代码...我使用paste
创建了股票日期索引,但我确信您可以使用来自plm
库中pdata.frame
的索引,这是我找到的最好的面板数据索引。I think the key is to vectorize and use the
%in%
operator to subset data frameA
. And, I know, prices are not random numbers, but I didn't want to code a random walk... I created a stock-date index usingpaste
, but I'm sure you could use the index frompdata.frame
in theplm
library, which is the best I've found for panel data.这更快吗? (我假设 B$id 和 B$date 的组合是一个唯一标识符,不会在任何地方复制 - 由您的代码暗示)
Is this any faster? (I am assuming the combination of B$id and B$date is a unique identifier not replicated anywhere - implied by your code)
如果您没有在 A 和 B 中复制数据,那么 rbind 是最简单的解决方案。
If you haven't got data that is replicated in both A and B, then
rbind
is the simplest solution.一般来说,您应该避免在 R 中循环。如果您的代码对向量进行操作,速度会快得多。
我会按照 ran2 的建议使用合并。您可以设置
all.x = T
(或all.y
或all
)来获取其中一个(或另一个或两者)的所有行- 数据框。这很快,并且通常会自行计算出要匹配的字段。否则,您需要指定by.x
(以及by.y
或by
)作为查找字段。听起来您可能需要自己创建这个字段(根据约翰的评论)。然后您可以按日期进行过滤。
In general, you ought to avoid looping in R. It's much quicker if your code operates on vectors.
I would use merge, as suggested by ran2. You can set
all.x = T
(orall.y
orall
) to get all the rows from one (or other or both) - data frames. This is quick and will typically work-out which fields to match by itself. Otherwise you'll need to specifyby.x
(andby.y
orby
) as a lookup field. By the sounds of it you may need to create this field yourself (as per John's comment).You can then filter by date.
鉴于您遇到内存问题,也许首先减少 A 可能会有所帮助。首先,删除无关的 id。
完全减少A数据集仍然想抢占更多内存。如果不存储一些变量,这是不可能的。尽管如此,我希望我们可以通过删除低于绝对最小值和高于绝对最大值的每个日期来摆脱它。
当然,如果不通过 id 进行限定,我们就无法获得 A 的最小版本,但希望它足够小。
现在运行我最初提出的代码,看看是否仍然有内存错误
Given that you're having memory issues perhaps paring down A first might help. First, get rid of extraneous ids.
Reducing the A dataset completely still wants to grab more memory. It's not possible without storing some variables. Nevertheless, we can get rid of a bunch of it I'm hoping by lopping off every date below our absolute minimum and above our absolute maximum.
Of course, by not qualifying this by id we haven't get the smallest version of A possible but hopefully it's enough smaller.
Now run the code I first proposed and see if you still have a memory error
这个答案建立在理查兹的答案基础上,但更有针对性。
关键思想是构建一个 id 日期组合向量来进行比较。这发生在第二个代码块中。
我的解决方案使用 data.table 包,但应该对 data.frame 进行一些语法更改。但使用 data.table 包具有关键列的优势。
如果您仍然遇到问题,您可以将此方法与约翰的第二个答案和第一个作物 A 配对。
This answer builds on Richards answer but is more targeted to the question.
Key idea is to build one vector of id date combinations to compare against. This happens in the second code block.
My solution uses the data.table package but should with some syntax changes work with a data.frame. But using the data.table package has the advantage of keycolumns.
If you still have trouble you can pair this approach with john's second answer and first crop A.