R 代码运行时间太长

发布于 2024-12-03 08:33:31 字数 276 浏览 7 评论 0原文

我正在运行以下代码，并且需要很长时间才能运行。我怎么知道它是否仍在执行其工作或卡在某个地方。

noise4<-NULL;
for(i in 1:length(noise3))
{
    if(is.na(noise3[i])==TRUE)
    {
    next;
    }
    else
    {
    noise4<-c(noise4,noise3[i]);
    }
}

Noise3 是一个包含 2418233 个数据点的向量。

原文

I have the following code running and it's taking me a long time to run. How do I know if it's still doing its job or it got stuck somewhere.

noise4<-NULL;
for(i in 1:length(noise3))
{
    if(is.na(noise3[i])==TRUE)
    {
    next;
    }
    else
    {
    noise4<-c(noise4,noise3[i]);
    }
}

noise3 is a vector with 2418233 data points.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

━╋う一瞬間旳綻放 2024-12-10 08:33:31

您只想删除 NA 值。这样做：

noise4 <- noise3[!is.na(noise3)]

这几乎是即时的。

或者正如 Joshua 所建议的，一个更易读的替代方案：

noise4 <- na.omit(noise3)

您的代码很慢，因为：

它使用显式循环，这在 R 解释器下往往很慢。
每次迭代都会重新分配内存。

内存重新分配可能是代码的最大障碍。

You just want to remove the NA values. Do it like this:

noise4 <- noise3[!is.na(noise3)]

This will be pretty much instant.

Or as Joshua suggests, a more readable alternative:

noise4 <- na.omit(noise3)

Your code was slow because:

It uses explicit loops which tend to be slow under the R interpreter.
You reallocate memory every iteration.

The memory reallocation is probably the biggest handicap to your code.

回复收藏 0 原文

够运 2024-12-10 08:33:31

我想说明预分配的好处，所以我尝试运行你的代码......但我在大约 5 分钟后杀死了它。我建议您使用 noise4 <- na.omit(noise3) 正如我在评论中所说。此代码仅用于说明目的。

# Create some random data
set.seed(21)
noise3 <- rnorm(2418233)
noise3[sample(2418233, 100)] <- NA

noise <- function(noise3) {
  # Pre-allocate
  noise4 <- vector("numeric", sum(!is.na(noise3)))
  for(i in seq_along(noise3)) {
    if(is.na(noise3[i])) {
      next
    } else {
      noise4[i] <- noise3[i]
    }
  }
}

system.time(noise(noise3)) # MUCH less than 5+ minutes
#    user  system elapsed 
#    9.50    0.44    9.94 

# Let's see what we gain from compiling
library(compiler)
cnoise <- cmpfun(noise)
system.time(cnoise(noise3))  # a decent reduction
#    user  system elapsed 
#    3.46    0.49    3.96

I wanted to illustrate the benefits of pre-allocation, so I tried to run your code... but I killed it after ~5 minutes. I recommend you use noise4 <- na.omit(noise3) as I said in my comments. This code is solely for illustrative purposes.

# Create some random data
set.seed(21)
noise3 <- rnorm(2418233)
noise3[sample(2418233, 100)] <- NA

noise <- function(noise3) {
  # Pre-allocate
  noise4 <- vector("numeric", sum(!is.na(noise3)))
  for(i in seq_along(noise3)) {
    if(is.na(noise3[i])) {
      next
    } else {
      noise4[i] <- noise3[i]
    }
  }
}

system.time(noise(noise3)) # MUCH less than 5+ minutes
#    user  system elapsed 
#    9.50    0.44    9.94 

# Let's see what we gain from compiling
library(compiler)
cnoise <- cmpfun(noise)
system.time(cnoise(noise3))  # a decent reduction
#    user  system elapsed 
#    3.46    0.49    3.96

回复收藏 0 原文

夜声 2024-12-10 08:33:31

其他答案为您提供了更好的方法来完成您实际要实现的任务（删除数据中的 NA 值），但回答了您提出的具体问题（“如何我知道 R 是否真正在工作，或者它是否卡住了？”）是在循环中引入一些输出 (cat) 语句，如下所示：

rpt <- 10000  ## reporting interval
noise4<-NULL;
for(i in 1:length(noise3))
{
    if (i %% rpt == 0) cat(i,"\n")
    if(is.na(noise3[i])==TRUE)
    {
    next;
    }
    else
    {
    noise4<-c(noise4,noise3[i]);
    }
}

如果运行此代码，您可以立即看到当它进一步进入循环时，它会急剧减慢（a未能预分配空间的后果）...

The other answers have given you much, much better ways to do the task that you actually set out to achieve (removing NA values in your data), but an answer to the specific question you asked ("how do I know if R is actually working or if it has instead gotten stuck?") is to introduce some output (cat) statements in your loop, as follows:

rpt <- 10000  ## reporting interval
noise4<-NULL;
for(i in 1:length(noise3))
{
    if (i %% rpt == 0) cat(i,"\n")
    if(is.na(noise3[i])==TRUE)
    {
    next;
    }
    else
    {
    noise4<-c(noise4,noise3[i]);
    }
}

If you run this code you can immediately see that it slows down radically as it gets farther into the loop (a consequence of the failure to pre-allocate space) ...

回复收藏 0 原文

泅渡 2024-12-10 08:33:31

其他人都给出了解决相同问题的正确方法，因此您不必担心速度。 @BenBolker 还就常规输出给出了很好的指导。

需要注意的另一件事是，如果您发现自己陷入循环，可以跳出循环并找到 i 的值。假设从 i 的值重新启动不会造成损害，即使用该值两次不会出现问题，您可以重新启动。或者，您可以按照其他人所说的那样完成工作。

一个单独的技巧是，如果循环很慢（并且无法矢量化，或者您不急于跳出循环），并且您没有任何报告，您可以仍然寻找外部方法来查看 R 是否确实消耗了计算机上的周期。在 Linux 中，top 命令是您最好的选择。在 Windows 上，任务管理器就可以解决这个问题（我更喜欢使用 SysInternals / Microsoft 程序 Process Explorer）。 “top”也存在于 Mac 上，尽管我相信还有其他一些更流行的工具。

另一条建议是：如果要运行很长的循环，我强烈建议定期保存结果。我通常创建一个名称如下的文件： myPrefix_YYYYMMDDHHMMSS.rdat 。这样，一切都会变得糟糕，而你仍然可以从上次中断的地方开始循环。

我并不总是进行迭代，但当我这样做时，我会使用这些技巧。保持速度，我的朋友。