R：分割数据框，将函数应用于每个子集中的所有行对

发布于 2024-11-10 12:27:57 字数 2385 浏览 4 评论 0原文

我是 R 新手，正在尝试高效完成以下任务。

我有一个 data.frame、x，其中包含以下列：start、end、val1、<代码>val2、<代码>val3、<代码>val4。这些列按 start 排序/排序。

对于每个 start，首先我必须找到 x 中共享相同 start 的所有条目。因为列表是有序的，所以它们将是连续的。如果特定的开始仅发生一次，那么我忽略它。然后，对于具有相同 start 的这些条目，假设对于一个特定的 start，有 3 个条目，如下所示：

start=10< 的条目/code>

start end val1 val2 val3 val4
   10  25    8    9    0    0
   10  55   15  200    4    9
   10  30    4    8    0    1

然后，我必须一次获取 2 行，并对 val1:4 的 2x4 矩阵执行 fisher.test。也就是说，

row1:row2 => fisher.test(matrix(c(8,15,9,200,0,4,0,9), nrow=2))
row1:row3 => fisher.test(matrix(c(8,4,9,8,0,0,0,1), nrow=2))
row2:row3 => fisher.test(matrix(c(15,4,200,8,4,0,9,1), nrow=2))

我编写的代码传统上是使用 for-loops 完成的。我想知道是否可以矢量化或以任何方式进行改进。

f_start = as.factor(x$start) #convert start to factor to get count
tab_f_start = as.table(f_start) # convert to table to access count
o_start1 = NULL
o_end1   = NULL
o_start2 = NULL
o_end2   = NULL
p_val    = NULL
for (i in 1:length(tab_f_start)) {
    # check if there are more than 1 entries with same start
    if ( tab_f_start[i] > 1) {
        # get all rows for current start
        cur_entry = x[x$start == as.integer(names(tab_f_start[i])),]
        # loop over all combinations to obtain p-values
        ctr = tab_f_start[i]
        for (j in 1:(ctr-1)) {
            for (k in (j+1):ctr) {
                # store start and end values separately
                o_start1 = c(o_start1, x$start[j])
                o_end1   = c(o_end1, x$end[j])
                o_start2 = c(o_start2, x$start[k])
                o_end2   = c(o_end2, x$end[k])
                # construct matrix
                m1 = c(x$val1[j], x$val1[k])
                m2 = c(x$val2[j], x$val2[k])
                m3 = c(x$val3[j], x$val3[k])
                m4 = c(x$val4[j], x$val4[k]) 
                m = matrix(c(m1,m2,m3,m4), nrow=2)
                p_val = c(p_val, fisher.test(m))
            }
        }
    }
}
result=data.frame(o_start1, o_end1, o_start2, o_end2, p_val)

谢谢你！

原文

I am new to R and am trying to accomplish the following task efficiently.

I have a data.frame, x, with columns: start, end, val1, val2, val3, val4. The columns are sorted/ordered by start.

For each start, first I have to find all the entries in x that share the same start. Because the list is ordered, they will be consecutive. If a particular start occurs only once, then I ignore it. Then, for these entries that have the same start, lets say for one particular start, there are 3 entries, as shown below:

entries for start=10

start end val1 val2 val3 val4
   10  25    8    9    0    0
   10  55   15  200    4    9
   10  30    4    8    0    1

Then, I have to take 2 rows at a time and perform a fisher.test on the 2x4 matrices of val1:4. That is,

row1:row2 => fisher.test(matrix(c(8,15,9,200,0,4,0,9), nrow=2))
row1:row3 => fisher.test(matrix(c(8,4,9,8,0,0,0,1), nrow=2))
row2:row3 => fisher.test(matrix(c(15,4,200,8,4,0,9,1), nrow=2))

The code I wrote is accomplished using for-loops, traditionally. I was wondering if this could be vectorized or improved in anyway.

f_start = as.factor(x$start) #convert start to factor to get count
tab_f_start = as.table(f_start) # convert to table to access count
o_start1 = NULL
o_end1   = NULL
o_start2 = NULL
o_end2   = NULL
p_val    = NULL
for (i in 1:length(tab_f_start)) {
    # check if there are more than 1 entries with same start
    if ( tab_f_start[i] > 1) {
        # get all rows for current start
        cur_entry = x[x$start == as.integer(names(tab_f_start[i])),]
        # loop over all combinations to obtain p-values
        ctr = tab_f_start[i]
        for (j in 1:(ctr-1)) {
            for (k in (j+1):ctr) {
                # store start and end values separately
                o_start1 = c(o_start1, x$start[j])
                o_end1   = c(o_end1, x$end[j])
                o_start2 = c(o_start2, x$start[k])
                o_end2   = c(o_end2, x$end[k])
                # construct matrix
                m1 = c(x$val1[j], x$val1[k])
                m2 = c(x$val2[j], x$val2[k])
                m3 = c(x$val3[j], x$val3[k])
                m4 = c(x$val4[j], x$val4[k]) 
                m = matrix(c(m1,m2,m3,m4), nrow=2)
                p_val = c(p_val, fisher.test(m))
            }
        }
    }
}
result=data.frame(o_start1, o_end1, o_start2, o_end2, p_val)

Thank you!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情未る 2024-11-17 12:27:58

正如 @Ben Bolker 所建议的，您可以使用 plyr 包来紧凑地完成此操作。第一步是创建
包含所需行对的更广泛的数据框。行对是使用 combn 函数生成的：

set.seed(1)
x <- data.frame( start = c(1,2,2,2,3,3,3,3), 
                 end = 1:8, 
                 v1 = sample(8), v2 = sample(8), v3 = sample(8), v4 = sample(8))

require(plyr)
z <- ddply(x, .(start), function(d) if (nrow(d) == 1) NULL
                                      else {
                                        row_pairs <- combn(nrow(d),2)
                                        cbind( a = d[ row_pairs[1,], ],
                                               b = d[ row_pairs[2,], ] )
                                      })[, -1]

第二步是通过将 fisher.test 应用于每个行来提取 p.value行对：

result <- ddply(z, .(a.start, a.end, b.start, b.end), 
                function(d) 
                     fisher.test(matrix(unlist( d[, -c(1,2,7,8) ]), 
                                        nrow=2, byrow=TRUE))$p.value  )


> result
  a.start a.end b.start b.end         V1
1       2     2       2     3 0.33320784
2       2     2       2     4 0.03346192
3       2     3       2     4 0.84192284
4       3     5       3     6 0.05175017
5       3     5       3     7 0.65218289
6       3     5       3     8 0.75374989
7       3     6       3     7 0.34747011
8       3     6       3     8 0.10233072
9       3     7       3     8 0.52343422

As @Ben Bolker suggested, you can use the plyr package to do this compactly. The first step is to create
a wider data-frame that contains the desired row-pairs. The row-pairs are generated using the combn function:

set.seed(1)
x <- data.frame( start = c(1,2,2,2,3,3,3,3), 
                 end = 1:8, 
                 v1 = sample(8), v2 = sample(8), v3 = sample(8), v4 = sample(8))

require(plyr)
z <- ddply(x, .(start), function(d) if (nrow(d) == 1) NULL
                                      else {
                                        row_pairs <- combn(nrow(d),2)
                                        cbind( a = d[ row_pairs[1,], ],
                                               b = d[ row_pairs[2,], ] )
                                      })[, -1]

The second step is to extract the p.value from applying the fisher.test to each row-pair:

result <- ddply(z, .(a.start, a.end, b.start, b.end), 
                function(d) 
                     fisher.test(matrix(unlist( d[, -c(1,2,7,8) ]), 
                                        nrow=2, byrow=TRUE))$p.value  )


> result
  a.start a.end b.start b.end         V1
1       2     2       2     3 0.33320784
2       2     2       2     4 0.03346192
3       2     3       2     4 0.84192284
4       3     5       3     6 0.05175017
5       3     5       3     7 0.65218289
6       3     5       3     8 0.75374989
7       3     6       3     7 0.34747011
8       3     6       3     8 0.10233072
9       3     7       3     8 0.52343422

回复收藏 0 原文

~没有更多了~