加快分析速度
我在 R 中有 2 个数据帧,例如 df 和 dfrefseq。
df<-data.frame( chr = c("chr1","chr1","chr1","chr4")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("HTA","OdX","FEA","MGA")
)
dfrefseq<-data.frame( chr = c("chr1","chr1","chr1","chr2")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("tra","FGE","FFs","FAA")
)
我想检查 df 中的每个基因,dfrefseq 中的女巫基因最接近所选的 df 基因。 我首先在两个数据框中选择“chr1”。 然后我计算了 readschr1 中第一个基因的 start-start start-stop stop-start 和 stop-stop 位点之间的距离。 这些计算的总和说明了有关距离的一切。我的问题是,如何加快分析速度?因为现在我只针对数据框测试了 1 个基因,但我需要测试 2000 个基因。
readschr1 <- subset(df,df[,1]=="chr1")
refseqchr1 <- subset(dfrefseq,dfrefseq[,1]=="chr1")
names<-list()
read_start_start<-list()
read_start_stop<-list()
read_stop_start<-list()
read_stop_stop<-list()
for (i in 1:nrow(refseqchr1)) {
startstart<-abs(readschr1[1,2] - refseqchr1[i,2])
startstop<-abs(readschr1[1,2] - refseqchr1[i,3])
stopstart<-abs(readschr1[1,3] - refseqchr1[i,2])
stopstop<-abs(readschr1[1,3] - refseqchr1[i,3])
read_start_start[[i]]<- matrix(startstart)
read_start_stop[[i]]<- matrix(startstop)
read_stop_start[[i]]<- matrix(stopstart)
read_stop_stop[[i]]<- matrix(stopstop)
names[[i]]<-matrix(refseqchr1[i,4])
}
table<-cbind(names, read_start_start, read_start_stop, read_stop_start, read_stop_stop)
sumtotalcolumns<-as.numeric(table[,2]) + as.numeric(table[,3])+ as.numeric(table[,4]) + as.numeric(table[,5])
test<-cbind(table, sumtotalcolumns)
test1<-test[order(as.vector(test$sumtotalcolumns)), ]
谢谢你!
I have 2 dataframes in R for example df and dfrefseq.
df<-data.frame( chr = c("chr1","chr1","chr1","chr4")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("HTA","OdX","FEA","MGA")
)
dfrefseq<-data.frame( chr = c("chr1","chr1","chr1","chr2")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("tra","FGE","FFs","FAA")
)
I want to check for each gene in df witch gene in dfrefseq lies closest to the selected df gene.
I first selected "chr1" in both dataframes.
Then I calculated for the first gene in readschr1 the distance between start-start start-stop stop-start and stop-stop sites.
The sum of this calculations say everything about the distance. My question here is, How can I speed up this analyse? Because now I tested only 1 gene against a dataframe, but I need to test 2000 genes.
readschr1 <- subset(df,df[,1]=="chr1")
refseqchr1 <- subset(dfrefseq,dfrefseq[,1]=="chr1")
names<-list()
read_start_start<-list()
read_start_stop<-list()
read_stop_start<-list()
read_stop_stop<-list()
for (i in 1:nrow(refseqchr1)) {
startstart<-abs(readschr1[1,2] - refseqchr1[i,2])
startstop<-abs(readschr1[1,2] - refseqchr1[i,3])
stopstart<-abs(readschr1[1,3] - refseqchr1[i,2])
stopstop<-abs(readschr1[1,3] - refseqchr1[i,3])
read_start_start[[i]]<- matrix(startstart)
read_start_stop[[i]]<- matrix(startstop)
read_stop_start[[i]]<- matrix(stopstart)
read_stop_stop[[i]]<- matrix(stopstop)
names[[i]]<-matrix(refseqchr1[i,4])
}
table<-cbind(names, read_start_start, read_start_stop, read_stop_start, read_stop_stop)
sumtotalcolumns<-as.numeric(table[,2]) + as.numeric(table[,3])+ as.numeric(table[,4]) + as.numeric(table[,5])
test<-cbind(table, sumtotalcolumns)
test1<-test[order(as.vector(test$sumtotalcolumns)), ]
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Bioconductor 包
GenomicRanges
旨在处理此类数据然后
,
The Bioconductor package
GenomicRanges
is designed to work with this type of datathen
and
您可以将两个单独的 data.frame 合并在一起形成一个表,然后使用矢量化运算。
merge
的关键是指定 data.frame 之间的公共列,并告诉它在出现不匹配的情况时该怎么做。如果其他 data.frame(即本例中的 ch2 和 ch4)中没有匹配项,则指定all = TRUE
将返回所有行并填充 NA。一旦 data.frames 被合并,那么这是一个简单的练习,将不同的列相互减去,然后将感兴趣的四列相加。我使用transform
来减少减法所需的输入。这是获取距离最小的行的一种方法。我假设你想通过 chr 和基因名来做到这一点。我使用
plyr
包,但我确信如果您更喜欢其中之一的话,有基本的解决方案。也许其他人会提出一个基本的解决方案。You can
merge
the two separate data.frames together to form one table and then use vectorized operations. The key tomerge
is to specify the common column(s) between the data.frames and to tell it what to do when there are cases that do not match. Specifyingall = TRUE
will return all rows and fill in NAs if there is no match in the other data.frame, i.e. ch2 and ch4 in this case. Once the data.frames have been merged, then it's a simple exercise in subtracting the different columns from one another and then summing the four columns of interest. I usetransform
to cut down on the typing needed to do the subtraction.Here's one approach get the row with the minimum distance. I'm assuming you want to do this by chr and genenames. I use the
plyr
package, but I'm sure there are base solutions if you'd prefer one of those. Maybe someone else will chime in with a base solution.