R:分割数据框,将函数应用于每个子集中的所有行对
我是 R 新手,正在尝试高效
完成以下任务。
我有一个 data.frame
、x
,其中包含以下列:start
、end
、val1、<代码>val2、<代码>val3、<代码>val4。这些列按
start
排序/排序。
对于每个 start
,首先我必须找到 x
中共享相同 start
的所有条目。因为列表是有序的,所以它们将是连续的。如果特定的开始
仅发生一次,那么我忽略它。然后,对于具有相同 start
的这些条目,假设对于一个特定的 start
,有 3 个条目,如下所示:
start=10< 的条目/code>
start end val1 val2 val3 val4 10 25 8 9 0 0 10 55 15 200 4 9 10 30 4 8 0 1
然后,我必须一次获取 2 行,并对 val1:4
的 2x4
矩阵执行 fisher.test
。也就是说,
row1:row2 => fisher.test(matrix(c(8,15,9,200,0,4,0,9), nrow=2)) row1:row3 => fisher.test(matrix(c(8,4,9,8,0,0,0,1), nrow=2)) row2:row3 => fisher.test(matrix(c(15,4,200,8,4,0,9,1), nrow=2))
我编写的代码传统上是使用 for-loops
完成的。我想知道是否可以矢量化或以任何方式进行改进。
f_start = as.factor(x$start) #convert start to factor to get count tab_f_start = as.table(f_start) # convert to table to access count o_start1 = NULL o_end1 = NULL o_start2 = NULL o_end2 = NULL p_val = NULL for (i in 1:length(tab_f_start)) { # check if there are more than 1 entries with same start if ( tab_f_start[i] > 1) { # get all rows for current start cur_entry = x[x$start == as.integer(names(tab_f_start[i])),] # loop over all combinations to obtain p-values ctr = tab_f_start[i] for (j in 1:(ctr-1)) { for (k in (j+1):ctr) { # store start and end values separately o_start1 = c(o_start1, x$start[j]) o_end1 = c(o_end1, x$end[j]) o_start2 = c(o_start2, x$start[k]) o_end2 = c(o_end2, x$end[k]) # construct matrix m1 = c(x$val1[j], x$val1[k]) m2 = c(x$val2[j], x$val2[k]) m3 = c(x$val3[j], x$val3[k]) m4 = c(x$val4[j], x$val4[k]) m = matrix(c(m1,m2,m3,m4), nrow=2) p_val = c(p_val, fisher.test(m)) } } } } result=data.frame(o_start1, o_end1, o_start2, o_end2, p_val)
谢谢你!
I am new to R and am trying to accomplish the following task efficiently
.
I have a data.frame
, x
, with columns: start
, end
, val1
, val2
, val3
, val4
. The columns are sorted/ordered by start
.
For each start
, first I have to find all the entries in x
that share the same start
. Because the list is ordered, they will be consecutive. If a particular start
occurs only once, then I ignore it. Then, for these entries that have the same start
, lets say for one particular start
, there are 3 entries, as shown below:
entries for start=10
start end val1 val2 val3 val4 10 25 8 9 0 0 10 55 15 200 4 9 10 30 4 8 0 1
Then, I have to take 2 rows at a time and perform a fisher.test
on the 2x4
matrices of val1:4
. That is,
row1:row2 => fisher.test(matrix(c(8,15,9,200,0,4,0,9), nrow=2)) row1:row3 => fisher.test(matrix(c(8,4,9,8,0,0,0,1), nrow=2)) row2:row3 => fisher.test(matrix(c(15,4,200,8,4,0,9,1), nrow=2))
The code I wrote is accomplished using for-loops
, traditionally. I was wondering if this could be vectorized or improved in anyway.
f_start = as.factor(x$start) #convert start to factor to get count tab_f_start = as.table(f_start) # convert to table to access count o_start1 = NULL o_end1 = NULL o_start2 = NULL o_end2 = NULL p_val = NULL for (i in 1:length(tab_f_start)) { # check if there are more than 1 entries with same start if ( tab_f_start[i] > 1) { # get all rows for current start cur_entry = x[x$start == as.integer(names(tab_f_start[i])),] # loop over all combinations to obtain p-values ctr = tab_f_start[i] for (j in 1:(ctr-1)) { for (k in (j+1):ctr) { # store start and end values separately o_start1 = c(o_start1, x$start[j]) o_end1 = c(o_end1, x$end[j]) o_start2 = c(o_start2, x$start[k]) o_end2 = c(o_end2, x$end[k]) # construct matrix m1 = c(x$val1[j], x$val1[k]) m2 = c(x$val2[j], x$val2[k]) m3 = c(x$val3[j], x$val3[k]) m4 = c(x$val4[j], x$val4[k]) m = matrix(c(m1,m2,m3,m4), nrow=2) p_val = c(p_val, fisher.test(m)) } } } } result=data.frame(o_start1, o_end1, o_start2, o_end2, p_val)
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
正如 @Ben Bolker 所建议的,您可以使用
plyr
包来紧凑地完成此操作。第一步是创建包含所需行对的更广泛的数据框。行对是使用
combn
函数生成的:第二步是通过将
fisher.test
应用于每个行来提取p.value
行对:As @Ben Bolker suggested, you can use the
plyr
package to do this compactly. The first step is to createa wider data-frame that contains the desired row-pairs. The row-pairs are generated using the
combn
function:The second step is to extract the
p.value
from applying thefisher.test
to each row-pair: