加快分析速度

发布于 2024-11-04 17:24:59 字数 1715 浏览 1 评论 0原文

我在 R 中有 2 个数据帧,例如 df 和 dfrefseq。

df<-data.frame( chr =  c("chr1","chr1","chr1","chr4")
    , start = c(843294,4329248,4329423,4932234)
    , stop = c(845294,4329248,4529423,4935234)
    , genenames= c("HTA","OdX","FEA","MGA")
)
dfrefseq<-data.frame( chr =  c("chr1","chr1","chr1","chr2")
    , start = c(843294,4329248,4329423,4932234)
    , stop = c(845294,4329248,4529423,4935234)
    , genenames= c("tra","FGE","FFs","FAA")
)

我想检查 df 中的每个基因,dfrefseq 中的女巫基因最接近所选的 df 基因。 我首先在两个数据框中选择“chr1”。 然后我计算了 readschr1 中第一个基因的 start-start start-stop stop-start 和 stop-stop 位点之间的距离。 这些计算的总和说明了有关距离的一切。我的问题是,如何加快分析速度?因为现在我只针对数据框测试了 1 个基因,但我需要测试 2000 个基因。

readschr1 <- subset(df,df[,1]=="chr1") 
refseqchr1 <- subset(dfrefseq,dfrefseq[,1]=="chr1") 

names<-list()
read_start_start<-list()
read_start_stop<-list() 
read_stop_start<-list()
read_stop_stop<-list()

for (i in 1:nrow(refseqchr1)) {
startstart<-abs(readschr1[1,2] - refseqchr1[i,2])
startstop<-abs(readschr1[1,2] - refseqchr1[i,3])
stopstart<-abs(readschr1[1,3] - refseqchr1[i,2])
stopstop<-abs(readschr1[1,3] - refseqchr1[i,3])
read_start_start[[i]]<- matrix(startstart)
read_start_stop[[i]]<- matrix(startstop)
read_stop_start[[i]]<- matrix(stopstart)
read_stop_stop[[i]]<- matrix(stopstop)
names[[i]]<-matrix(refseqchr1[i,4])
}
table<-cbind(names, read_start_start, read_start_stop, read_stop_start, read_stop_stop)


sumtotalcolumns<-as.numeric(table[,2]) + as.numeric(table[,3])+ as.numeric(table[,4]) + as.numeric(table[,5])
test<-cbind(table, sumtotalcolumns)
test1<-test[order(as.vector(test$sumtotalcolumns)), ]

谢谢你!

I have 2 dataframes in R for example df and dfrefseq.

df<-data.frame( chr =  c("chr1","chr1","chr1","chr4")
    , start = c(843294,4329248,4329423,4932234)
    , stop = c(845294,4329248,4529423,4935234)
    , genenames= c("HTA","OdX","FEA","MGA")
)
dfrefseq<-data.frame( chr =  c("chr1","chr1","chr1","chr2")
    , start = c(843294,4329248,4329423,4932234)
    , stop = c(845294,4329248,4529423,4935234)
    , genenames= c("tra","FGE","FFs","FAA")
)

I want to check for each gene in df witch gene in dfrefseq lies closest to the selected df gene.
I first selected "chr1" in both dataframes.
Then I calculated for the first gene in readschr1 the distance between start-start start-stop stop-start and stop-stop sites.
The sum of this calculations say everything about the distance. My question here is, How can I speed up this analyse? Because now I tested only 1 gene against a dataframe, but I need to test 2000 genes.

readschr1 <- subset(df,df[,1]=="chr1") 
refseqchr1 <- subset(dfrefseq,dfrefseq[,1]=="chr1") 

names<-list()
read_start_start<-list()
read_start_stop<-list() 
read_stop_start<-list()
read_stop_stop<-list()

for (i in 1:nrow(refseqchr1)) {
startstart<-abs(readschr1[1,2] - refseqchr1[i,2])
startstop<-abs(readschr1[1,2] - refseqchr1[i,3])
stopstart<-abs(readschr1[1,3] - refseqchr1[i,2])
stopstop<-abs(readschr1[1,3] - refseqchr1[i,3])
read_start_start[[i]]<- matrix(startstart)
read_start_stop[[i]]<- matrix(startstop)
read_stop_start[[i]]<- matrix(stopstart)
read_stop_stop[[i]]<- matrix(stopstop)
names[[i]]<-matrix(refseqchr1[i,4])
}
table<-cbind(names, read_start_start, read_start_stop, read_stop_start, read_stop_stop)


sumtotalcolumns<-as.numeric(table[,2]) + as.numeric(table[,3])+ as.numeric(table[,4]) + as.numeric(table[,5])
test<-cbind(table, sumtotalcolumns)
test1<-test[order(as.vector(test$sumtotalcolumns)), ]

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

清醇 2024-11-11 17:24:59

Bioconductor 包 GenomicRanges 旨在处理此类数据

source('http://bioconductor.org/biocLite.R')
biocLite('GenomicRanges')                      # one-time installation

然后

library(GenomicRanges)
gr <- with(df,
           GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
                   IRanges(start, stop), genenames=genenames))
grrefseq <- with(dfrefseq,
                 GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
                         IRanges(start, stop), genenames=genenames))

> nearest(gr, grrefseq)
[1]  1  2  3 NA

The Bioconductor package GenomicRanges is designed to work with this type of data

source('http://bioconductor.org/biocLite.R')
biocLite('GenomicRanges')                      # one-time installation

then

library(GenomicRanges)
gr <- with(df,
           GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
                   IRanges(start, stop), genenames=genenames))
grrefseq <- with(dfrefseq,
                 GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
                         IRanges(start, stop), genenames=genenames))

and

> nearest(gr, grrefseq)
[1]  1  2  3 NA
靖瑶 2024-11-11 17:24:59

您可以将两个单独的 data.frame 合并在一起形成一个表,然后使用矢量化运算。 merge 的关键是指定 data.frame 之间的公共列,并告诉它在出现不匹配的情况时该怎么做。如果其他 data.frame(即本例中的 ch2 和 ch4)中没有匹配项,则指定 all = TRUE 将返回所有行并填充 NA。一旦 data.frames 被合并,那么这是一个简单的练习,将不同的列相互减去,然后将感兴趣的四列相加。我使用 transform 来减少减法所需的输入。

zz <- merge(df, dfrefseq, by = "chr", all = TRUE)

zz <- transform(zz, 
    read_start_start = abs(start.x - start.y)
  , read_start_stop = abs(start.x - stop.y)
  , read_stop_start = abs(stop.x - start.y)
  , read_stop_stop = abs(stop.x - stop.y)
)

zz <- transform(zz,
  sum_total_columns = read_start_start + read_start_stop + read_stop_start + read_stop_stop
  )

这是获取距离最小的行的一种方法。我假设你想通过 chr 和基因名来做到这一点。我使用 plyr 包,但我确信如果您更喜欢其中之一的话,有基本的解决方案。也许其他人会提出一个基本的解决方案。

require(plyr)
ddply(zz, c("chr", "genenames.x"), function(x) x[which.min(x$sum_total_columns) ,])

You can merge the two separate data.frames together to form one table and then use vectorized operations. The key to merge is to specify the common column(s) between the data.frames and to tell it what to do when there are cases that do not match. Specifying all = TRUE will return all rows and fill in NAs if there is no match in the other data.frame, i.e. ch2 and ch4 in this case. Once the data.frames have been merged, then it's a simple exercise in subtracting the different columns from one another and then summing the four columns of interest. I use transform to cut down on the typing needed to do the subtraction.

zz <- merge(df, dfrefseq, by = "chr", all = TRUE)

zz <- transform(zz, 
    read_start_start = abs(start.x - start.y)
  , read_start_stop = abs(start.x - stop.y)
  , read_stop_start = abs(stop.x - start.y)
  , read_stop_stop = abs(stop.x - stop.y)
)

zz <- transform(zz,
  sum_total_columns = read_start_start + read_start_stop + read_stop_start + read_stop_stop
  )

Here's one approach get the row with the minimum distance. I'm assuming you want to do this by chr and genenames. I use the plyr package, but I'm sure there are base solutions if you'd prefer one of those. Maybe someone else will chime in with a base solution.

require(plyr)
ddply(zz, c("chr", "genenames.x"), function(x) x[which.min(x$sum_total_columns) ,])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文