比较同一向量的相邻元素（避免循环）

发布于 2024-11-28 11:24:22 字数 1055 浏览 6 评论 0原文

我设法编写了一个 for 循环 来比较以下向量中的字母：

bases <- c("G","C","A","T")
test <- sample(bases, replace=T, 20)

test 将返回

[1] "T" "G" "T" "G" "C" "A" "A" "G" "A" "C" "A" "T" "T" "T" "T" "C" "A" "G" "G" "C"

函数 Comp() 我可以检查是否字母与下一个字母匹配导致

Comp <- function(data)
{
    output <- vector()
    for(i in 1:(length(data)-1))
    {
    if(data[i]==data[i+1])
        {
        output[i] <-1
        }
        else
        {
        output[i] <-0
        }
    }
    return(output)
}

；

> Comp(test)
 [1] 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0

这是可行的，但是对于大量数据来说速度非常慢。因此，我尝试了 sapply()

Comp <- function(x,i) if(x[i]==x[i+1]) 1 else 0
unlist(lapply(test, Comp, test))

不幸的是它不起作用...（Error in i + 1 : non-numeric argument to 二元运算符）我无法弄清楚如何访问向量中的前一个字母进行比较。另外，“不比较”最后一个字母的“length(data)-1”可能会成为问题。

谢谢大家的帮助！

干杯幸运的

原文

I managed to write a for loop to compare letters in the following vector:

bases <- c("G","C","A","T")
test <- sample(bases, replace=T, 20)

test will return

[1] "T" "G" "T" "G" "C" "A" "A" "G" "A" "C" "A" "T" "T" "T" "T" "C" "A" "G" "G" "C"

with the function Comp() I can check if a letter is matching to the next letter

Comp <- function(data)
{
    output <- vector()
    for(i in 1:(length(data)-1))
    {
    if(data[i]==data[i+1])
        {
        output[i] <-1
        }
        else
        {
        output[i] <-0
        }
    }
    return(output)
}

Resulting in;

> Comp(test)
 [1] 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0

This is working, however its verry slow with large numbers. Therefor i tried sapply()

Comp <- function(x,i) if(x[i]==x[i+1]) 1 else 0
unlist(lapply(test, Comp, test))

Unfortunately its not working... (Error in i + 1 : non-numeric argument to binary operator) I have trouble figuring out how to access the preceding letter in the vector to compare it. Also the length(data)-1, to "not compare" the last letter might become a problem.

Thank you all for the help!

Cheers
Lucky

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

红焚 2024-12-05 11:24:22

只需“滞后”test 并使用矢量化的==。

bases <- c("G","C","A","T")
set.seed(21)
test <- sample(bases, replace=TRUE, 20)
lag.test <- c(tail(test,-1),NA)
#lag.test <- c(NA,head(test,-1))
test == lag.test

更新：

此外，您的 Comp 函数速度很慢，因为您在初始化时没有指定 output 的长度。我怀疑您正在尝试预分配，但是 vector() 创建了一个零长度向量，必须在循环的每次迭代期间扩展该向量。如果将对 vector() 的调用更改为 vector(length=NROW(data)-1)，您的 Comp 函数会显着加快。

set.seed(21)
test <- sample(bases, replace=T, 1e5)
system.time(orig <- Comp(test))
#    user  system elapsed 
#  34.760   0.010  34.884 
system.time(prealloc <- Comp.prealloc(test))
#    user  system elapsed 
#    1.18    0.00    1.19 
identical(orig, prealloc)
# [1] TRUE

Just "lag" test and use ==, which is vectorized.

bases <- c("G","C","A","T")
set.seed(21)
test <- sample(bases, replace=TRUE, 20)
lag.test <- c(tail(test,-1),NA)
#lag.test <- c(NA,head(test,-1))
test == lag.test

Update:

Also, your Comp function is slow because you don't specify the length of output when you initialize it. I suspect you were trying to pre-allocate, but vector() creates a zero-length vector that must be expanded during every iteration of your loop. Your Comp function is significantly faster if you change the call to vector() to vector(length=NROW(data)-1).

set.seed(21)
test <- sample(bases, replace=T, 1e5)
system.time(orig <- Comp(test))
#    user  system elapsed 
#  34.760   0.010  34.884 
system.time(prealloc <- Comp.prealloc(test))
#    user  system elapsed 
#    1.18    0.00    1.19 
identical(orig, prealloc)
# [1] TRUE

回复收藏 0 原文

仅一夜美梦 2024-12-05 11:24:22

正如@Joshua 所写，你当然应该使用矢量化——它效率更高。
...但仅供参考，您的 Comp 函数仍然可以优化一点。

比较的结果是TRUE/FALSE，它是1/0的美化版本。此外，确保结果是整数而不是数字会消耗一半的内存。

Comp.opt <- function(data)
{
    output <- integer(length(data)-1L)
    for(i in seq_along(output))
    {
        output[[i]] <- (data[[i]]==data[[i+1L]])
    }
    return(output)
}

...以及速度差异：

> system.time(orig <- Comp(test))
   user  system elapsed 
  21.10    0.00   21.11 
> system.time(prealloc <- Comp.prealloc(test))
   user  system elapsed 
   0.49    0.00    0.49 
> system.time(opt <- Comp.opt(test))
   user  system elapsed 
   0.41    0.00    0.40 
> all.equal(opt, orig) # opt is integer, orig is double
[1] TRUE

As @Joshua wrote, you should of course use vectorization - it is way more efficient.
...But just for reference, your Comp function can still be optimized a bit.

The result of a comparison is TRUE/FALSE which is glorified versions of 1/0. Also, ensuring the result is integer instead of numeric consumes half the memory.

Comp.opt <- function(data)
{
    output <- integer(length(data)-1L)
    for(i in seq_along(output))
    {
        output[[i]] <- (data[[i]]==data[[i+1L]])
    }
    return(output)
}

...and the speed difference:

> system.time(orig <- Comp(test))
   user  system elapsed 
  21.10    0.00   21.11 
> system.time(prealloc <- Comp.prealloc(test))
   user  system elapsed 
   0.49    0.00    0.49 
> system.time(opt <- Comp.opt(test))
   user  system elapsed 
   0.41    0.00    0.40 
> all.equal(opt, orig) # opt is integer, orig is double
[1] TRUE

回复收藏 0 原文

肩上的翅膀 2024-12-05 11:24:22

看看这个：

> x = c("T", "G", "T", "G", "G","T","T","T")
> 
> res = sequence(rle(x)$lengths)-1
> 
> dt = data.frame(x,res)
> 
> dt
  x res
1 T   0
2 G   0
3 T   0
4 G   0
5 G   1
6 T   0
7 T   1
8 T   2

可能工作得更快。

Have a look at this :

> x = c("T", "G", "T", "G", "G","T","T","T")
> 
> res = sequence(rle(x)$lengths)-1
> 
> dt = data.frame(x,res)
> 
> dt
  x res
1 T   0
2 G   0
3 T   0
4 G   0
5 G   1
6 T   0
7 T   1
8 T   2

Might work faster.

回复收藏 0 原文

~没有更多了~