如何用优化函数替换 R 中的 for 循环（lapply？）

发布于 2024-09-19 03:59:33 字数 1830 浏览 12 评论 0原文

我有一个数据框，每行都有时间事件。在一行中，我有发送者的事件类型 (typeid=1)，另一行有接收者的事件类型 (typeid=2)。我想计算发送者和接收者之间的延迟（时间差）。

我的数据组织在 data.frame 中，如下图所示：

dd[1:10,]
     timeid   valid typeid
1  18,00035 1,00000      1
2  18,00528 0,00493      2
3  18,02035 2,00000      1
4  18,02116 0,00081      2
5  18,04035 3,00000      1
6  18,04116 0,00081      2
7  18,06035 4,00000      1
8  18,06116 0,00081      2
9  18,08035 5,00000      1
10 18,08116 0,00081      2

calc_DelayVIDEO <- function (dDelay ){

        pktProcess <- TRUE
        nLost <- 0
        myDelay <- data.frame(time=-1, delay=-1, jitter=-1, nLost=-1)
        myDelay <- myDelay[-1, ]
        tini <- 0
        tend <- 0
        for (itr in c(1:length(dDelay$timeid))) {
           aRec <- dDelay[itr,]
           if (aRec$typeid == 1){
                tini <- as.numeric(aRec$timeid)
                if (!pktProcess ) {
                   nLost <- (nLost + 1)
                   myprt(paste("Packet Lost at time ", aRec$timeid, " lost= ", nLost, sep=""))
                }

                pktProcess <- FALSE 
           }else if (aRec$typeid == 2){

                tend <- as.numeric(aRec$timeid)
                dd <- tend - tini
                jit <- calc_Jitter(dant=myDelay[length(myDelay), 2], dcur=dd)
                myDelay <- rbind(myDelay, c(aRec$timeid, dd, jit, nLost))
                pktProcess <- TRUE
                #myprt(paste("time=", aRec$timeev, " delay=", dd, " Delay Var=", jit, " nLost=", nLost ))
           }
        }
        colnames(myDelay) <- c("time", "delay", "jitter", "nLost")
        return (myDelay)
}

为了执行延迟计算，我使用 calc_DelayVideo 函数，尽管如此，对于具有大量记录 (~60000) 的数据帧，它需要花费大量时间。

如何用更优化的 R 函数替代 for 循环？我可以使用 lapply 来做这样的计算吗？如果是这样，你能给我举个例子吗？

提前致谢，

原文

I've a data frame with time events on each row. In one row I've have the events types of sender (typeid=1) and on the other the events of the receiver (typeid=2). I want to calculate the delay between sender and receiver (time difference).

My data is organized in a data.frame, as the following snapshot shows:

dd[1:10,]
     timeid   valid typeid
1  18,00035 1,00000      1
2  18,00528 0,00493      2
3  18,02035 2,00000      1
4  18,02116 0,00081      2
5  18,04035 3,00000      1
6  18,04116 0,00081      2
7  18,06035 4,00000      1
8  18,06116 0,00081      2
9  18,08035 5,00000      1
10 18,08116 0,00081      2

calc_DelayVIDEO <- function (dDelay ){

        pktProcess <- TRUE
        nLost <- 0
        myDelay <- data.frame(time=-1, delay=-1, jitter=-1, nLost=-1)
        myDelay <- myDelay[-1, ]
        tini <- 0
        tend <- 0
        for (itr in c(1:length(dDelay$timeid))) {
           aRec <- dDelay[itr,]
           if (aRec$typeid == 1){
                tini <- as.numeric(aRec$timeid)
                if (!pktProcess ) {
                   nLost <- (nLost + 1)
                   myprt(paste("Packet Lost at time ", aRec$timeid, " lost= ", nLost, sep=""))
                }

                pktProcess <- FALSE 
           }else if (aRec$typeid == 2){

                tend <- as.numeric(aRec$timeid)
                dd <- tend - tini
                jit <- calc_Jitter(dant=myDelay[length(myDelay), 2], dcur=dd)
                myDelay <- rbind(myDelay, c(aRec$timeid, dd, jit, nLost))
                pktProcess <- TRUE
                #myprt(paste("time=", aRec$timeev, " delay=", dd, " Delay Var=", jit, " nLost=", nLost ))
           }
        }
        colnames(myDelay) <- c("time", "delay", "jitter", "nLost")
        return (myDelay)
}

To perform the calculations for delay I use calc_DelayVideo function, neverthless for data frames with a high number of records (~60000) it takes a lot of time.

How can I substitute the for loop with more optimized R functions?
Can I use lapply to do such computation? If so, can you provide me an example?

Thanks in advance,

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浅笑轻吟梦一曲 2024-09-26 03:59:34

通常的解决方案是认真思考问题以找到矢量化的东西。

如果失败，我有时会求助于用 C++ 重写循环； Rcpp 包可以帮助界面。

回复收藏 0 原文

够运 2024-09-26 03:59:34

*apply 函数套件未针对循环进行优化。此外，我还解决过 for 循环比 apply 更快的问题，因为 apply 使用了更多内存并导致我的机器交换。

我建议完全初始化 myDelay 对象并避免使用 rbind （必须重新分配内存）：

init <- rep(NA, length(dDelay$timeid))
myDelay <- data.frame(time=init, delay=init, jitter=init, nLost=init)

然后替换

myDelay <- rbind(myDelay, c(aRec$timeid, dd, jit, nLost))

：

myDelay[i,] <- c(aRec$timeid, dd, jit, nLost)

The *apply suite of functions are not optimized for loops. Further, I've worked on problems where for loops are faster than apply because apply used more memory and caused my machine to swap.

I would suggest fully initializing the myDelay object and avoid using rbind (which must re-allocate memory):

init <- rep(NA, length(dDelay$timeid))
myDelay <- data.frame(time=init, delay=init, jitter=init, nLost=init)

then replace:

myDelay <- rbind(myDelay, c(aRec$timeid, dd, jit, nLost))

with

myDelay[i,] <- c(aRec$timeid, dd, jit, nLost)

回复收藏 0 原文

无畏 2024-09-26 03:59:34

正如德克所说：矢量化会有所帮助。例如，将对 as.numeric 的调用移出循环（因为该函数适用于向量）。

dDelay$timeid <- as.numeric(dDelay$timeid)

其他可能有帮助的事情是

不要打扰 aRec <- dDelay[itr,] 行，因为您只需访问 dDelay 行，而无需创建新变量。

预分配 myDelay，因为让它在循环内增长可能会成为瓶颈。有关更多信息，请参阅约书亚的回答。

As Dirk said: vectorization will help. An example of this would be to move the call to as.numeric out of the loop (since this function works with vectors).

dDelay$timeid <- as.numeric(dDelay$timeid)

Other things that may help are

Not bothering with the line aRec <- dDelay[itr,], since you can just access the row of dDelay, without creating a new variable.

Preallocating myDelay, since having it grow within the loop is likely to be a bottleneck. See Joshua's answer for more on this.

回复收藏 0 原文

各自安好 2024-09-26 03:59:34

使用 : 轻松计算向量 nLost

nLost <-cumsum(dDelay$typeid==1)

另一个优化：如果我正确地阅读了你的代码，你可以通过在循环之外。最后您只需将其添加到数据框中即可。已经为您节省了大量时间。如果我使用您的数据框，那么：

> nLost <-cumsum(dd$typeid==1)
> nLost
 [1] 1 1 2 2 3 3 4 4 5 5

同样，包裹丢失的时间可以计算为：

> dd$timeid[which(dd$typeid==1)]
[1] 18,00035 18,02035 18,04035 18,06035 18,08035

以防您也想在某个地方报告它们。

为了测试，我使用了：

dd <- structure(list(timeid = structure(1:10, .Label = c("18,00035", 
"18,00528", "18,02035", "18,02116", "18,04035", "18,04116", "18,06035", 
"18,06116", "18,08035", "18,08116"), class = "factor"), valid = structure(c(3L, 
2L, 4L, 1L, 5L, 1L, 6L, 1L, 7L, 1L), .Label = c("0,00081", "0,00493", 
"1,00000", "2,00000", "3,00000", "4,00000", "5,00000"), class = "factor"), 
    typeid = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L)), .Names = c("timeid", 
"valid", "typeid"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))

Another optimization : If I read your code right, you can easily calculate the vector nLost by using :

nLost <-cumsum(dDelay$typeid==1)

outside the loop. That one you can just add to the dataframe in the end. Saves you a lot of time already. If I use your dataframe, then :

> nLost <-cumsum(dd$typeid==1)
> nLost
 [1] 1 1 2 2 3 3 4 4 5 5

Likewise the times at which the packages were lost can be calculated as:

> dd$timeid[which(dd$typeid==1)]
[1] 18,00035 18,02035 18,04035 18,06035 18,08035

in case you want to report them somewhere too.

For testing, I used :

dd <- structure(list(timeid = structure(1:10, .Label = c("18,00035", 
"18,00528", "18,02035", "18,02116", "18,04035", "18,04116", "18,06035", 
"18,06116", "18,08035", "18,08116"), class = "factor"), valid = structure(c(3L, 
2L, 4L, 1L, 5L, 1L, 6L, 1L, 7L, 1L), .Label = c("0,00081", "0,00493", 
"1,00000", "2,00000", "3,00000", "4,00000", "5,00000"), class = "factor"), 
    typeid = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L)), .Names = c("timeid", 
"valid", "typeid"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))

回复收藏 0 原文

~没有更多了~