R 中的无限函数/循环:数据管理

发布于 2024-12-13 13:21:08 字数 945 浏览 3 评论 0原文

我正在尝试重构一个巨大的数据框(大约 12.000 个案例):在旧的数据框中,一个人是一行,大约有 250 列(例如,人 1、测试 A1、测试 A2、测试 B,...),我想要所有结果测试 A 的总和(1 - 10 个 A 和该人的 24 个项目 (AY) 在一列中,因此一个人最终有 24 列和 10 行。还有一个固定的数据框项目 AY 开始之前的部分(个人信息,如年龄、性别等),我想保留原样(修复数据)。 该函数/循环适用于 30 个案例(我提前尝试过),但对于 12.000 个案例,它仍在计算,现在已经计算了近 24 小时。有什么想法吗?

restructure <- function(data, firstcol, numcol, numsets){
    out <- data.frame(t(rep(0, (firstcol-1)+ numcol)) )
    names(out) <- names(daten[0:(firstcol+numcol-1)])
      for(i in 1:nrow(daten)){
         fixdata <- (daten[i, 1:(firstcol-1)])

          for (j in (seq(firstcol, ((firstcol-1)+ numcol* numsets), by = numcol))){
              flexdata <- daten[i, j:(j+numcol-1)]
              tmp <- cbind(fixdata, flexdata)
              names(tmp) <- names(daten[0:(firstcol+numcol-1)])
              out <- rbind(out,tmp)
          }  
      }
    out <- out[2:nrow(out),]
    return(out)
}

提前致谢!

I am trying to restructure an enormous dataframe (about 12.000 cases): In the old dataframe one person is one row and has about 250 columns (e.g. Person 1, test A1, testA2, testB, ...)and I want all the results of test A (1 - 10 A´s overall and 24 items (A-Y) for that person in one column, so one person end up with 24 columns and 10 rows. There is also a fixed dataframe part before the items A-Y start (personal information like age, gender etc.), which I want to keep as it is (fixdata).
The function/loop works for 30 cases (I tried it in advance) but for the 12.000 it is still calculating, for nearly 24hours now. Any ideas why?

restructure <- function(data, firstcol, numcol, numsets){
    out <- data.frame(t(rep(0, (firstcol-1)+ numcol)) )
    names(out) <- names(daten[0:(firstcol+numcol-1)])
      for(i in 1:nrow(daten)){
         fixdata <- (daten[i, 1:(firstcol-1)])

          for (j in (seq(firstcol, ((firstcol-1)+ numcol* numsets), by = numcol))){
              flexdata <- daten[i, j:(j+numcol-1)]
              tmp <- cbind(fixdata, flexdata)
              names(tmp) <- names(daten[0:(firstcol+numcol-1)])
              out <- rbind(out,tmp)
          }  
      }
    out <- out[2:nrow(out),]
    return(out)
}

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

寄风 2024-12-20 13:21:08

原因如下:在每次迭代中,您都会rbindout。随着数据的增长,每次迭代将花费更长的时间 - 因此您必须预期运行时间会随着数据集的增加而不仅仅是线性增长。

因此,正如 Andrie 所说,您可以查看 melt

或者您可以使用核心 R:stack 来完成此操作。
然后您需要自己将固定部分绑定到结果(您需要使用 each = n.var.cols 重复固定列。

第三种替代方案是 array2df 来自封装数组助手。

Idea why: you rbind to out in each iteration. This will take longer each iteration as out grows - so you have to expect more than linear growth in run time with increasing data sets.

So, as Andrie tells you can look at melt.

Or you can do it with core R: stack.
Then you need to cbind the fixed part yourself to the result, (you need to repeat the fixed columns with each = n.var.cols

A third alternative would be array2df from package arrayhelpers.

迷迭香的记忆 2024-12-20 13:21:08

我同意其他人的观点,查看 reshape2plyr 包,只是想在另一个方向添加一点。特别是 meltcastdcast 可能会对您有所帮助。另外,使用智能列名称可能会有所帮助,例如:

As<-grep("^testA",names(yourdf))
# returns a vector with the column position of all testA1 through 10s.

此外,如果您在 test# 和测试类型上“花费”了 data.frame 的两个维度,那么显然没有留下任何维度人。当然,您可以通过 ID 来识别它们,这样您可以在绘图时添加美感,但根据您想要执行的操作,您可能希望将它们存储在 list 中。因此,您最终会得到一个人员列表,其中每个人都有一个 data.frame。我不确定你想做什么,但仍然希望这会有所帮助。

I agree with the others, look into reshape2 and the plyr package, just want to add a little in another direction. Particularly melt, cast,dcast might help you. Plus, it might help to make use of smart column names, e.g.:

As<-grep("^testA",names(yourdf))
# returns a vector with the column position of all testA1 through 10s.

Besides, if you 'spent' the two dimensions of a data.frame on test# and test type, there's obviously none left for the person. Sure, you identify them by an ID, that you could add an aesthetic to when plotting, but depending on what you want to do you might want to store them in a list. So you end up with a list of persons with a data.frame for every person. I am not sure what you are trying to do, but still hope this helps though.

萌逼全场 2024-12-20 13:21:08

也许您没有获得 plyr 或其他用于重塑数据组件的函数。更直接、更底层的东西怎么样?如果您当前只有一行 A1、A2、A3...A10、B1-B10 等,然后从数据框中提取这一块内容,我猜是第 11-250 列,然后将其制作出来切出你想要的形状并将它们重新组合在一起。

yDat <- data[, 11:250]
yDF <- lapply( 1:nrow(data), function(i) matrix(yDat[i,], ncol = 24) )
yDF <- do.call(rbind, y) #combine the list of matrices returned above into one
yDF <- data.frame(yDF) #get it back into a data.frame
names(yDF) <- LETTERS[1:24] #might as well name the columns

这是以您想要的形式获取大量数据的最快方法。 lapply 函数所做的就是向每一行添加维度属性,以便它们达到您想要的形状,然后将它们作为列表返回,并与后续行一起进行处理。但现在它没有来自主 data.frame 的任何 ID 信息。您只需将前 10 列的每一行复制 10 次。或者您可以使用便利的函数merge来帮助解决这个问题。将前 10 行中已有的公共列作为新 data.frame 的列之一,然后将它们合并。

yInfo <- data[, 1:10]
ID <- yInfo$ID
yDF$ID <- rep( yInfo$ID, each = 10 )
newDat <- merge(yInfo, yDF)

现在你已经完成了......大多数情况下,你可能想要创建一个额外的列来命名新行

newDat$condNum <- rep(1:10, nrow(newDat)/10)

这将是运行速度非常快的代码。您的 data.frame 实际上根本没有那么大,上面的大部分内容将在几秒钟内执行。

这就是您应该如何看待 R 中的数据。并不是说没有方便的函数来处理大部分数据,而是您应该这样做以尽可能避免循环。从技术上讲,上面发生的事情只有一个循环,即在开始时使用的lapply。它在该循环中也很少(当您使用它们时它们应该是紧凑的)。你正在用标量代码编写,而在 R 中它非常非常慢......即使你在这样做时并没有真正滥用内存和增长数据。此外,请记住,虽然您不能总是避免某种循环,但您几乎总是可以避免嵌套循环,这是最大的问题之一。

(阅读以更好地理解此代码中的问题......那里犯了大部分大错误)

Maybe you're not getting the plyr or other functions for reshaping the data component. How about something more direct and low level. If you currently just have one line that goes A1, A2, A3... A10, B1-B10, etc. then extract that lump of stuff from your data frame, I'm guessing columns 11-250, and then just make that section the shape you want and put them back together.

yDat <- data[, 11:250]
yDF <- lapply( 1:nrow(data), function(i) matrix(yDat[i,], ncol = 24) )
yDF <- do.call(rbind, y) #combine the list of matrices returned above into one
yDF <- data.frame(yDF) #get it back into a data.frame
names(yDF) <- LETTERS[1:24] #might as well name the columns

That's the fastest way to get the bulk of your data in the shape you want. All the lapply function did was add dimension attributes to each row so that they were in the shape you wanted and then return them as a list, which was massaged with the subsequent rows. But now it doesn't have any of your ID information from the main data.frame. You just need to replicate each row of the first 10 columns 10 times. Or you can use the convenience function merge to help with that. Make a common column that is already in your first 10 rows one of the columns of the new data.frame and then just merge them.

yInfo <- data[, 1:10]
ID <- yInfo$ID
yDF$ID <- rep( yInfo$ID, each = 10 )
newDat <- merge(yInfo, yDF)

And now you're done... mostly, you might want to make an extra column that names the new rows

newDat$condNum <- rep(1:10, nrow(newDat)/10)

This will be very fast running code. Your data.frame really isn't that big at all and much of the above will execute in a couple of seconds.

This is how you should be thinking of data in R. Not that there aren't convenience functions to handle the bulk of this but you should be doing this that avoid looping as much as possible. Technically, what happened above only had one loop, the lapply used right at the start. It had very little in that loop as well (they should be compact when you use them). You're writing in scalar code and it is very very slow in R... even if you weren't really abusing memory and growing data while doing it. Furthermore, keep in mind that, while you can't always avoid a loop of some kind, you can almost always avoid nested loops, which is one of your biggest problems.

(read this to better understand your problems in this code... you've made most of the big errors in there)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文