R 中的无限函数/循环:数据管理
我正在尝试重构一个巨大的数据框(大约 12.000 个案例):在旧的数据框中,一个人是一行,大约有 250 列(例如,人 1、测试 A1、测试 A2、测试 B,...),我想要所有结果测试 A 的总和(1 - 10 个 A 和该人的 24 个项目 (AY) 在一列中,因此一个人最终有 24 列和 10 行。还有一个固定的数据框项目 AY 开始之前的部分(个人信息,如年龄、性别等),我想保留原样(修复数据)。 该函数/循环适用于 30 个案例(我提前尝试过),但对于 12.000 个案例,它仍在计算,现在已经计算了近 24 小时。有什么想法吗?
restructure <- function(data, firstcol, numcol, numsets){
out <- data.frame(t(rep(0, (firstcol-1)+ numcol)) )
names(out) <- names(daten[0:(firstcol+numcol-1)])
for(i in 1:nrow(daten)){
fixdata <- (daten[i, 1:(firstcol-1)])
for (j in (seq(firstcol, ((firstcol-1)+ numcol* numsets), by = numcol))){
flexdata <- daten[i, j:(j+numcol-1)]
tmp <- cbind(fixdata, flexdata)
names(tmp) <- names(daten[0:(firstcol+numcol-1)])
out <- rbind(out,tmp)
}
}
out <- out[2:nrow(out),]
return(out)
}
提前致谢!
I am trying to restructure an enormous dataframe (about 12.000 cases): In the old dataframe one person is one row and has about 250 columns (e.g. Person 1, test A1, testA2, testB, ...)and I want all the results of test A (1 - 10 A´s overall and 24 items (A-Y) for that person in one column, so one person end up with 24 columns and 10 rows. There is also a fixed dataframe part before the items A-Y start (personal information like age, gender etc.), which I want to keep as it is (fixdata).
The function/loop works for 30 cases (I tried it in advance) but for the 12.000 it is still calculating, for nearly 24hours now. Any ideas why?
restructure <- function(data, firstcol, numcol, numsets){
out <- data.frame(t(rep(0, (firstcol-1)+ numcol)) )
names(out) <- names(daten[0:(firstcol+numcol-1)])
for(i in 1:nrow(daten)){
fixdata <- (daten[i, 1:(firstcol-1)])
for (j in (seq(firstcol, ((firstcol-1)+ numcol* numsets), by = numcol))){
flexdata <- daten[i, j:(j+numcol-1)]
tmp <- cbind(fixdata, flexdata)
names(tmp) <- names(daten[0:(firstcol+numcol-1)])
out <- rbind(out,tmp)
}
}
out <- out[2:nrow(out),]
return(out)
}
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
原因如下:在每次迭代中,您都会
rbind
到out
。随着数据的增长,每次迭代将花费更长的时间 - 因此您必须预期运行时间会随着数据集的增加而不仅仅是线性增长。因此,正如 Andrie 所说,您可以查看
melt
。或者您可以使用核心 R:
stack
来完成此操作。然后您需要自己将固定部分绑定到结果(您需要使用
each = n.var.cols
重复固定列。第三种替代方案是
array2df
来自封装数组助手。Idea why: you
rbind
toout
in each iteration. This will take longer each iteration as out grows - so you have to expect more than linear growth in run time with increasing data sets.So, as Andrie tells you can look at
melt
.Or you can do it with core R:
stack
.Then you need to cbind the fixed part yourself to the result, (you need to repeat the fixed columns with
each = n.var.cols
A third alternative would be
array2df
from package arrayhelpers.我同意其他人的观点,查看
reshape2
和plyr
包,只是想在另一个方向添加一点。特别是melt
、cast
、dcast
可能会对您有所帮助。另外,使用智能列名称可能会有所帮助,例如:此外,如果您在 test# 和测试类型上“花费”了
data.frame
的两个维度,那么显然没有留下任何维度人。当然,您可以通过 ID 来识别它们,这样您可以在绘图时添加美感,但根据您想要执行的操作,您可能希望将它们存储在list
中。因此,您最终会得到一个人员列表,其中每个人都有一个 data.frame。我不确定你想做什么,但仍然希望这会有所帮助。I agree with the others, look into
reshape2
and theplyr
package, just want to add a little in another direction. Particularlymelt
,cast
,dcast
might help you. Plus, it might help to make use of smart column names, e.g.:Besides, if you 'spent' the two dimensions of a
data.frame
on test# and test type, there's obviously none left for the person. Sure, you identify them by an ID, that you could add an aesthetic to when plotting, but depending on what you want to do you might want to store them in alist
. So you end up with a list of persons with a data.frame for every person. I am not sure what you are trying to do, but still hope this helps though.也许您没有获得 plyr 或其他用于重塑数据组件的函数。更直接、更底层的东西怎么样?如果您当前只有一行 A1、A2、A3...A10、B1-B10 等,然后从数据框中提取这一块内容,我猜是第 11-250 列,然后将其制作出来切出你想要的形状并将它们重新组合在一起。
这是以您想要的形式获取大量数据的最快方法。
lapply
函数所做的就是向每一行添加维度属性,以便它们达到您想要的形状,然后将它们作为列表返回,并与后续行一起进行处理。但现在它没有来自主 data.frame 的任何 ID 信息。您只需将前 10 列的每一行复制 10 次。或者您可以使用便利的函数merge
来帮助解决这个问题。将前 10 行中已有的公共列作为新 data.frame 的列之一,然后将它们合并。现在你已经完成了......大多数情况下,你可能想要创建一个额外的列来命名新行
这将是运行速度非常快的代码。您的 data.frame 实际上根本没有那么大,上面的大部分内容将在几秒钟内执行。
这就是您应该如何看待 R 中的数据。并不是说没有方便的函数来处理大部分数据,而是您应该这样做以尽可能避免循环。从技术上讲,上面发生的事情只有一个循环,即在开始时使用的
lapply
。它在该循环中也很少(当您使用它们时它们应该是紧凑的)。你正在用标量代码编写,而在 R 中它非常非常慢......即使你在这样做时并没有真正滥用内存和增长数据。此外,请记住,虽然您不能总是避免某种循环,但您几乎总是可以避免嵌套循环,这是最大的问题之一。(阅读此以更好地理解此代码中的问题......那里犯了大部分大错误)
Maybe you're not getting the plyr or other functions for reshaping the data component. How about something more direct and low level. If you currently just have one line that goes A1, A2, A3... A10, B1-B10, etc. then extract that lump of stuff from your data frame, I'm guessing columns 11-250, and then just make that section the shape you want and put them back together.
That's the fastest way to get the bulk of your data in the shape you want. All the
lapply
function did was add dimension attributes to each row so that they were in the shape you wanted and then return them as a list, which was massaged with the subsequent rows. But now it doesn't have any of your ID information from the main data.frame. You just need to replicate each row of the first 10 columns 10 times. Or you can use the convenience functionmerge
to help with that. Make a common column that is already in your first 10 rows one of the columns of the new data.frame and then just merge them.And now you're done... mostly, you might want to make an extra column that names the new rows
This will be very fast running code. Your data.frame really isn't that big at all and much of the above will execute in a couple of seconds.
This is how you should be thinking of data in R. Not that there aren't convenience functions to handle the bulk of this but you should be doing this that avoid looping as much as possible. Technically, what happened above only had one loop, the
lapply
used right at the start. It had very little in that loop as well (they should be compact when you use them). You're writing in scalar code and it is very very slow in R... even if you weren't really abusing memory and growing data while doing it. Furthermore, keep in mind that, while you can't always avoid a loop of some kind, you can almost always avoid nested loops, which is one of your biggest problems.(read this to better understand your problems in this code... you've made most of the big errors in there)