R中的plyr在合并过程中非常慢
我在 R 中使用 plyr 包执行以下操作:
- 根据 A 列和 B 列从表 A 中选取一行,
- 查找表 B 中 A 列和 B 列中具有相同值的行,
- 将 C 列从表 B 复制到表A
我已经制作了进度条来显示进度,但是在显示到 100% 后它似乎仍在运行,因为我看到我的 CPU 仍然被 RGUI 占用,但它只是没有结束。
我的表 A 有大约 40000 行数据,具有唯一的 A 列和 B 列。
我怀疑 plyr 中“拆分-征服-组合”工作流程的“组合”部分无法处理这 40000 行数据,因为我可以另一个有 4000 行数据的表。
对于提高效率有什么建议吗?谢谢。
更新
这是我的代码:
for (loop.filename in (1:nrow(filename)))
{print("infection source merge")
print(filename[loop.filename, "table_name"])
temp <- get(filename[loop.filename, "table_name"])
temp1 <- ddply(temp,
c("HOSP_NO", "REF_DATE"),
function(df)
{temp.infection.source <- abcde[abcde[,"Case_Number"]==unique(df[,"HOSP_NO"]) &
abcde[,"Reference_Date"]==unique(df[,"REF_DATE"]),
"Case_Definition"]
if (length(temp.infection.source)==0) {
temp.infection.source<-"NIL"
} else {
if (length(unique(temp.infection.source))>1) {
temp.infection.source<-"MULTIPLE"
} else {
temp.infection.source<-unique(temp.infection.source)}}
data.frame(df,
INFECTION_SOURCE=temp.infection.source)
},
.progress="text")
assign(filename[loop.filename, "table_name"], temp1)
}
I am using plyr package in R to do the following:
- pick up a row from table A according to column A and column B
- find the row from table B having the same value in column A and column B
- copy column C from table B to table A
I have made the progress bar to show the progress, but after it shows to 100% it seems to be still running, as I have see my CPU is still occupied by RGUI, but it just doesn't end.
My table A is having about 40000 rows of data with unique column A and column B.
I suspect that the "combine" part of the "split-conquer-combine" workflow in plyr cannot handle this 40000 rows of data, because I can do it for another table with 4000 rows of data.
Any suggestions for improving the efficiency? Thanks.
UPDATE
Here is my code:
for (loop.filename in (1:nrow(filename)))
{print("infection source merge")
print(filename[loop.filename, "table_name"])
temp <- get(filename[loop.filename, "table_name"])
temp1 <- ddply(temp,
c("HOSP_NO", "REF_DATE"),
function(df)
{temp.infection.source <- abcde[abcde[,"Case_Number"]==unique(df[,"HOSP_NO"]) &
abcde[,"Reference_Date"]==unique(df[,"REF_DATE"]),
"Case_Definition"]
if (length(temp.infection.source)==0) {
temp.infection.source<-"NIL"
} else {
if (length(unique(temp.infection.source))>1) {
temp.infection.source<-"MULTIPLE"
} else {
temp.infection.source<-unique(temp.infection.source)}}
data.frame(df,
INFECTION_SOURCE=temp.infection.source)
},
.progress="text")
assign(filename[loop.filename, "table_name"], temp1)
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果我正确理解你想要实现的目标,这应该会很快完成你想要的事情,并且不会丢失太多记忆。
这仅适用于组合唯一的情况。如果不是,您必须首先解决这个问题。如果没有数据,就不可能知道您想要在完整的函数中准确实现什么目标,但是您应该能够将此处给出的逻辑移植到您自己的案例中。
If I understood correctly what you're trying to achieve, this should do what you want, pretty quick, and without too much memory loss.
This applies only if the combinations are unique. If they're not, you'll have to take care of that first. Without the data it's quite impossible to know what you're trying to achieve exactly in your complete function, but you should be able to port the logic given here to your own case.