向量化包含循环和 if 子句的搜索函数

发布于 2024-12-01 02:09:23 字数 1385 浏览 2 评论 0原文

我得到了两个非常大的数据集,我一直在尝试构建一个函数,该函数可以从一组数据中找到某些坐标,这些坐标遵守有关另一数据集的 if 子句。 我的问题是我编写的函数非常慢,尽管我一直在以某种方式阅读类似问题的答案,但我还没有设法使其工作。
因此,如果给出:

>head(CTSS)    
    V1     V2     V3
1 chr1 564563 564598 
2 chr1 564620 564649
3 chr1 565369 565404
4 chr1 565463 565541
5 chr1 565653 565697
6 chr1 565861 565922

并且

> head(href)
   chr      region    start      end strand nu   gene_id transcript_id
1 chr1 start_codon 67000042 67000044      +  . NM_032291     NM_032291
2 chr1         CDS 67000042 67000051      +  0 NM_032291     NM_032291
3 chr1        exon 66999825 67000051      +  . NM_032291     NM_032291
4 chr1         CDS 67091530 67091593      +  2 NM_032291     NM_032291
5 chr1        exon 67091530 67091593      +  . NM_032291     NM_032291
6 chr1         CDS 67098753 67098777      +  1 NM_032291     NM_032291

对于 href 数据集中的 start 列 中的每个值,我想找到第三列中的前两个值CTSS 数据集小于或等于它,并将其保留在新的数据帧中。
我写的循环:

y <- CTSS[order(-CTSS$V3), ]     
find_CTSS <- function(x, y) {
    n <- length(x$start)
    foo <- data.frame(matrix(0, n, 6))
    for (i in 1:n)
    {
        a <- which(y$V3 <= x$start[i])
        foo[i, ] = c(x$start[i], x$stop[i], y$V2[a[1]], y$V3[a[1]] , y$V2[a[2]], y$V3[a[2]])
    }

print(foo)

}

I am given two very large data sets and I've been trying to build a function that would find certain coordinates from one set that respect an if clause regarding the other data set. My problem is that the function I wrote is very slow and although I've been reading answers to questions similar in some way, I haven't managed to make it work.
So if I am given:

>head(CTSS)    
    V1     V2     V3
1 chr1 564563 564598 
2 chr1 564620 564649
3 chr1 565369 565404
4 chr1 565463 565541
5 chr1 565653 565697
6 chr1 565861 565922

and

> head(href)
   chr      region    start      end strand nu   gene_id transcript_id
1 chr1 start_codon 67000042 67000044      +  . NM_032291     NM_032291
2 chr1         CDS 67000042 67000051      +  0 NM_032291     NM_032291
3 chr1        exon 66999825 67000051      +  . NM_032291     NM_032291
4 chr1         CDS 67091530 67091593      +  2 NM_032291     NM_032291
5 chr1        exon 67091530 67091593      +  . NM_032291     NM_032291
6 chr1         CDS 67098753 67098777      +  1 NM_032291     NM_032291

For each value in the start column from the href data set I want to find the first two values in the 3rd column of the CTSS data set smaller or equal than it and keep it in a new dataframe.
The loop I wrote:

y <- CTSS[order(-CTSS$V3), ]     
find_CTSS <- function(x, y) {
    n <- length(x$start)
    foo <- data.frame(matrix(0, n, 6))
    for (i in 1:n)
    {
        a <- which(y$V3 <= x$start[i])
        foo[i, ] = c(x$start[i], x$stop[i], y$V2[a[1]], y$V3[a[1]] , y$V2[a[2]], y$V3[a[2]])
    }

print(foo)

}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

秋凉 2024-12-08 02:09:23

您提供的数据很少(但请参阅此处),因此这是一个很难对您的解决方案进行基准测试。看看下面的解决方案是否满足您的需求。

#make some fake data
href <- data.frame(start = runif(10), stop = runif(10), other_col = sample(letters, 10))
CTSS <- data.frame(col1 = runif(100), col2 = runif(100))

# for each row in href (but extract only stop and start columns)
result <- apply(X = href[, c("start", "stop")], MARGIN = 1, FUN = function(x, ctss) {
            criterion <- x["start"] #make a criterion
            #see which values are smaller or equal to this criterion (and sort them)
            extracted <- sort(ctss[ctss$col2 <= criterion, "col2"])
            #extract last and one to last value
            get.values <- extracted[c(length(extracted) - 1, length(extracted))] 
            #put values in data frame
            out <- as.data.frame(matrix(get.values, ncol = 2)) 
            return(out)
        }, ctss = CTSS)

#pancake a list into a data.frame
result <- do.call("rbind", result) 

You provide little data (but see here) so it's a bit hard to benchmark your solution. See if the below solution is meeting your needs.

#make some fake data
href <- data.frame(start = runif(10), stop = runif(10), other_col = sample(letters, 10))
CTSS <- data.frame(col1 = runif(100), col2 = runif(100))

# for each row in href (but extract only stop and start columns)
result <- apply(X = href[, c("start", "stop")], MARGIN = 1, FUN = function(x, ctss) {
            criterion <- x["start"] #make a criterion
            #see which values are smaller or equal to this criterion (and sort them)
            extracted <- sort(ctss[ctss$col2 <= criterion, "col2"])
            #extract last and one to last value
            get.values <- extracted[c(length(extracted) - 1, length(extracted))] 
            #put values in data frame
            out <- as.data.frame(matrix(get.values, ncol = 2)) 
            return(out)
        }, ctss = CTSS)

#pancake a list into a data.frame
result <- do.call("rbind", result) 
‘画卷フ 2024-12-08 02:09:23

我发现您最想要的是这里的加速。大量借用 Roman Lustrik 的代码,我看不出在 apply 中进行排序有任何优势。这确实会减慢速度。事实上,您希望从 apply(循环)中获得尽可能多的结果。所以下面的代码应该运行得更快。

#all code using Roman Lustrik's made up data

CTSSs <- CTSS[order(CTSS$col2),] #presort CTSS
result <- lapply(X = href$start, FUN = function(x, ctss) {
    extracted <- ctss$col2[ctss$col2 <= x]
    get.values <- tail(extracted,2)
    out <- matrix(get.values, ncol = 2)
    return(out)}, ctss = CTSSs)
#pancake a list into a data.frame
result <- as.data.frame(do.call("rbind", result))

或者,我可以进一步遵循矢量化的精神,真正让函数尽可能小。

CTSSs <- CTSS[order(CTSS$col2),] #presort CTSS
extracted <- lapply(href$start, function(x, ctss) {
    ctss$col2[ctss$col2 <= x]}, ctss = CTSSs)
get.values <- lapply(extracted, tail, n = 2 )
result <- t( sapply(get.values, matrix, ncol = 2) )

#convert to a data.frame
result <- as.data.frame(result)

这可能会更快,或者可能不适合您的情况,但是,如果您需要添加一个可能利用真正矢量化内置函数的中间步骤,例如您想在将值放入dataframe,那么你就可以轻松做到这一点。另外,您会注意到,现在我可以在矩阵阶段使用 sapply/transpose,这将比 lapply/rbi​​nd 更快。这通常是矢量化代码加速的来源,而不是仅仅通过 apply 围绕它创建一个大循环。 (顺便说一句,它可以更轻松地检查您思考的每一步中的错误......或者也许这不是一个旁白?)

修订:

经过进一步思考,我意识到这可以完全矢量化。以下代码将比前面的任何示例更快地生成您想要的内容。诀窍是使用 cut() 和aggregate() 命令。

href <- href[order(href$start),] #just sorted so that the 0 at the beginning makes sense and the labels then match
margin <- cut(CTSS$col2, breaks = c(0,href$start), labels = href$start, right = TRUE)
result <- aggregate(col2 ~ margin, data = CTSS, FUN = function(x) tail(x,2))

您可以根据需要重新格式化结果,但这应该是最重要的。您可能希望将 margin 列更改为数字,以便它与 href$start 匹配,并使用与上面中间示例中的 sapply 类似的代码,将上面的项目对列表转换为两个单独的列。之前是循环中的 if() 语句或 apply 语句减慢了速度,而 cut() 消除了这一点。

I see that the main thing you want is a speedup here. Borrowing heavily from Roman Lustrik's code I can't see any advantage to putting sort inside the apply. That would really slow things down. In fact, you want to get as much as possible out of the apply (loop). So the following should run much faster.

#all code using Roman Lustrik's made up data

CTSSs <- CTSS[order(CTSS$col2),] #presort CTSS
result <- lapply(X = href$start, FUN = function(x, ctss) {
    extracted <- ctss$col2[ctss$col2 <= x]
    get.values <- tail(extracted,2)
    out <- matrix(get.values, ncol = 2)
    return(out)}, ctss = CTSSs)
#pancake a list into a data.frame
result <- as.data.frame(do.call("rbind", result))

Or, I could follow the spirit of vectorization further and really get the functions as small as possible.

CTSSs <- CTSS[order(CTSS$col2),] #presort CTSS
extracted <- lapply(href$start, function(x, ctss) {
    ctss$col2[ctss$col2 <= x]}, ctss = CTSSs)
get.values <- lapply(extracted, tail, n = 2 )
result <- t( sapply(get.values, matrix, ncol = 2) )

#convert to a data.frame
result <- as.data.frame(result)

This may be faster, or maybe not in your case, but, should you need to add an intermediate step that could possibly take advantage of the truly vectorized built in functions, say if you want to do math on the values before putting them into a dataframe, then you can easily do that. Also, you'll note that now I can use a sapply/transpose at the matrix stage which will be faster than lapply/rbind. And that's often where the speedup of vectorizing your code comes from, not from just making a big loop with apply around it. (as an aside, it makes it easier to check for errors in each step of your thinking... or maybe that's not an aside?)

REVISION:

On further reflection I realized this can be completely vectorized. The following code will generate what you want MUCH faster than any of the prior examples. The trick is to use cut() and aggregate() commands.

href <- href[order(href$start),] #just sorted so that the 0 at the beginning makes sense and the labels then match
margin <- cut(CTSS$col2, breaks = c(0,href$start), labels = href$start, right = TRUE)
result <- aggregate(col2 ~ margin, data = CTSS, FUN = function(x) tail(x,2))

You can reformat the result as you wish to get what you want but that should do the meat of it. You might want to change the margin column to numeric so that it matches href$start and use similar code to the sapply in the middle example above to turn the list of pairs of items above into two separate columns. It was the if() statement in the loop or apply statement that was slowing you down before and cut() eliminates that.

十二 2024-12-08 02:09:23

我不知道我会为此投入多久,所以我会继续前进。当此类问题在 APL 杂志上得到一行字的答案时,我还是一名 APL 人员。后来我成为一名 C++/STL 人员,并在新的着装规范中学到了所有相同的东西。有时,R 让我觉得 APL 与 PHP 结合在一起。

在这个问题中,数据框会分散注意力。这是一个简单的矢量搜索,其中一些内容粘合在一起。

对于性能关键的矢量搜索,您需要 findInterval。搜索范围需要排序。搜索可以按任何顺序进行,但对于大型列表,您需要按顺序进行。

    V <- sort (runif(10*1000*1000))
    U <- sort (runif(10*1000*1000)) 
    W <- findInterval (U, V) 

这需要摇动羔羊尾巴三下。现在你有了整数对。第一列是 1:length(U),第二列包含 V 的值,是 W 内的整数索引。

    sum(u==sort(u)[sort.int (sort.int (u, index.return=TRUE)$ix, index.return=TRUE)$ix])

好的,这是我的 APL 脑干的贡献。答案是 length(u),并演示了“粘合在一起”所需的逆排序。

令人兴奋的事实:只有 R 中的 sort 函数的特殊情况才会返回索引向量。在 APL 中,这是你从升级/降级得到的唯一答案。但是,嘿,他们并不是第一次就做对了。

您必须调整 findInterval 的结果来选择匹配位置小于一侧的两个元素,并且您必须撤消这两种排序才能重新粘合在一起。我怀疑您的运行时要么由两种类型(对于很长的列表)主导,要么由组装结果数据帧(对于较小的列表)主导。在我的系统上,对长度为 100*1000*1000 的数字列表进行排序开始陷入困境。

夹在中间的 findInterval 的运行时间将是一片生菜,这提醒我为什么我不打算闲逛。

I don't know how long I'll devote myself to this, so I'll build forward. I was an APL guy back when this kind of question received one-line answers in the APL Journal. Later I became a C++/STL guy and learned all the same stuff in a new dress code. Sometimes R makes me think that APL mated with PHP.

In this problem the dataframes are a distraction. This is a simple vector search, with some gluing back together.

For the performance critical vector search, you want findInterval. The search-withins need to be ordered. The search-fors can be in any order, but for large lists, you want ordered.

    V <- sort (runif(10*1000*1000))
    U <- sort (runif(10*1000*1000)) 
    W <- findInterval (U, V) 

This runs in three shakes of a lamb's tail. Now you have pairs of integers. The first column is 1:length(U) and the second column with the values of V is integer index within W.

    sum(u==sort(u)[sort.int (sort.int (u, index.return=TRUE)$ix, index.return=TRUE)$ix])

OK, there's a contribution from my APL brainstem. The answer is length(u) and demonstrates the inverse sort required for the "glue back together".

Mind blowing fact: only special cases of the sort function in R return the index vector. In APL, that was the only answer you got from grade up/grade down. But hey, it's not like they got it right the first time.

You'll have to adapt the result of findInterval to pick two elements on the less than side of the match location and you'll have to undo the two sorts to glue back together. I suspect your runtime will either be dominated by the two sorts (for very long lists) or assembling the resulting dataframe (for smaller lists). On my system, sorting a numeric list of length 100*1000*1000 begins to bog.

The run time for findInterval sandwiched in between will be a thin slice of lettuce, which reminds me why I wasn't planning to loiter.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文