按行名聚合大矩阵中的行

发布于 2024-12-16 12:01:32 字数 595 浏览 2 评论 0原文

我想通过添加具有相同行名的行中的值来聚合矩阵的行。我当前的方法如下:

> M
  a b c d
1 1 1 2 0
1 2 3 4 2
2 3 0 1 2
3 4 2 5 2
> index <- as.numeric(rownames(M))
> M <- cbind(M,index)
> Dfmat <- data.frame(M)
> Dfmat <- aggregate(. ~ index, data = Dfmat, sum)
> M <- as.matrix(Dfmat)
> rownames(M) <- M[,"index"]
> M <- subset(M, select= -index)
> M
   a b c d
 1 3 4 6 2
 2 3 0 1 2
 3 4 2 5 2

这种方法的问题是我需要将其应用于许多非常大的矩阵(最多 1.000 行和 30.000 列)。在这些情况下,计算时间非常长(使用 ddply 时也会出现同样的问题)。有没有更有效的解决方案?原始输入矩阵是 tm 包中的 DocumentTermMatrix 有帮助吗?据我所知它们以稀疏矩阵格式存储。

I would like to aggregate the rows of a matrix by adding the values in rows that have the same rowname. My current approach is as follows:

> M
  a b c d
1 1 1 2 0
1 2 3 4 2
2 3 0 1 2
3 4 2 5 2
> index <- as.numeric(rownames(M))
> M <- cbind(M,index)
> Dfmat <- data.frame(M)
> Dfmat <- aggregate(. ~ index, data = Dfmat, sum)
> M <- as.matrix(Dfmat)
> rownames(M) <- M[,"index"]
> M <- subset(M, select= -index)
> M
   a b c d
 1 3 4 6 2
 2 3 0 1 2
 3 4 2 5 2

The problem of this appraoch is that i need to apply it to a number of very large matrices (up to 1.000 rows and 30.000 columns). In these cases the computation time is very high (Same problem when using ddply). Is there a more eficcient to come up with the solution? Does it help that the original input matrices are DocumentTermMatrix from the tm package? As far as I know they are stored in a sparse matrix format.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

自此以后,行同陌路 2024-12-23 12:01:32

这是使用 bycolSums 的解决方案,但由于 by 的默认输出,需要一些调整。

M <- matrix(1:9,3)
rownames(M) <- c(1,1,2)
t(sapply(by(M,rownames(M),colSums),identity))
  V1 V2 V3
1  3  9 15
2  3  6  9

Here's a solution using by and colSums, but requires some fiddling due to the default output of by.

M <- matrix(1:9,3)
rownames(M) <- c(1,1,2)
t(sapply(by(M,rownames(M),colSums),identity))
  V1 V2 V3
1  3  9 15
2  3  6  9
獨角戲 2024-12-23 12:01:32

Matrix.utils 中现在有一个聚合函数。这可以用一行代码完成您想要的事情,并且比 combineByRow 解决方案快大约 10 倍,比 by 解决方案快 100 倍:

N <- 10000

m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)

> microbenchmark(a<-t(sapply(by(m,rownames(m),colSums),identity)),b<-combineByRow(m),c<-aggregate.Matrix(m,row.names(m)),times = 10)
Unit: milliseconds
                                                  expr        min         lq       mean     median         uq        max neval
 a <- t(sapply(by(m, rownames(m), colSums), identity)) 6000.26552 6173.70391 6660.19820 6419.07778 7093.25002 7723.61642    10
                                  b <- combineByRow(m)  634.96542  689.54724  759.87833  732.37424  866.22673  923.15491    10
                c <- aggregate.Matrix(m, row.names(m))   42.26674   44.60195   53.62292   48.59943   67.40071   70.40842    10

> identical(as.vector(a),as.vector(c))
[1] TRUE

编辑:Frank 是对的,rowsum比任何这些解决方案都快一些。仅当您使用 Matrix(尤其是稀疏矩阵),或者除了 sum 之外还执行聚合时,您才需要考虑使用其他函数中的另一个。

There is now an aggregate function in Matrix.utils. This can accomplish what you want with a single line of code and is about 10x faster than the combineByRow solution and 100x faster than the by solution:

N <- 10000

m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)

> microbenchmark(a<-t(sapply(by(m,rownames(m),colSums),identity)),b<-combineByRow(m),c<-aggregate.Matrix(m,row.names(m)),times = 10)
Unit: milliseconds
                                                  expr        min         lq       mean     median         uq        max neval
 a <- t(sapply(by(m, rownames(m), colSums), identity)) 6000.26552 6173.70391 6660.19820 6419.07778 7093.25002 7723.61642    10
                                  b <- combineByRow(m)  634.96542  689.54724  759.87833  732.37424  866.22673  923.15491    10
                c <- aggregate.Matrix(m, row.names(m))   42.26674   44.60195   53.62292   48.59943   67.40071   70.40842    10

> identical(as.vector(a),as.vector(c))
[1] TRUE

EDIT: Frank is right, rowsum is somewhat faster than any of these solutions. You would want to consider using another one of these other functions only if you were using a Matrix, especially a sparse one, or if you were performing an aggregation besides sum.

寄人书 2024-12-23 12:01:32

詹姆斯的答案按预期工作,但对于大型矩阵来说相当慢。这是一个避免创建新对象的版本:

combineByRow <- function(m) {
    m <- m[ order(rownames(m)), ]

    ## keep track of previous row name
    prev <- rownames(m)[1]
    i.start <- 1
    i.end <- 1

    ## cache the rownames -- profiling shows that it takes
    ## forever to look at them
    m.rownames <- rownames(m)
    stopifnot(all(!is.na(m.rownames)))


    ## go through matrix in a loop, as we need to combine some unknown
    ## set of rows
    for (i in 2:(1+nrow(m))) {

        curr <- m.rownames[i]

        ## if we found a new row name (or are at the end of the matrix),
        ## combine all rows and mark invalid rows
        if (prev != curr || is.na(curr)) {

            if (i.start < i.end) {
                m[i.start,] <- apply(m[i.start:i.end,], 2, max)
                m.rownames[(1+i.start):i.end] <- NA
            }

            prev <- curr
            i.start <- i
        } else {
            i.end <- i
        }
    }

    m[ which(!is.na(m.rownames)),]    
}

测试表明,这比使用 by 的答案快大约 10 倍(本例中为 2 秒与 20 秒):

N <- 10000

m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)

start <- proc.time()
m1 <- combineByRow(m)
print(proc.time()-start)

start <- proc.time()
m2 <- t(sapply(by(m,rownames(m),function(x) apply(x, 2, max)),identity))
print(proc.time()-start)

all(m1 == m2)

The answer by James work as expected, but is quite slow for large matrices. Here is a version that avoids creating of new objects:

combineByRow <- function(m) {
    m <- m[ order(rownames(m)), ]

    ## keep track of previous row name
    prev <- rownames(m)[1]
    i.start <- 1
    i.end <- 1

    ## cache the rownames -- profiling shows that it takes
    ## forever to look at them
    m.rownames <- rownames(m)
    stopifnot(all(!is.na(m.rownames)))


    ## go through matrix in a loop, as we need to combine some unknown
    ## set of rows
    for (i in 2:(1+nrow(m))) {

        curr <- m.rownames[i]

        ## if we found a new row name (or are at the end of the matrix),
        ## combine all rows and mark invalid rows
        if (prev != curr || is.na(curr)) {

            if (i.start < i.end) {
                m[i.start,] <- apply(m[i.start:i.end,], 2, max)
                m.rownames[(1+i.start):i.end] <- NA
            }

            prev <- curr
            i.start <- i
        } else {
            i.end <- i
        }
    }

    m[ which(!is.na(m.rownames)),]    
}

Testing it shows that is about 10x faster than the answer using by (2 vs. 20 seconds in this example):

N <- 10000

m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)

start <- proc.time()
m1 <- combineByRow(m)
print(proc.time()-start)

start <- proc.time()
m2 <- t(sapply(by(m,rownames(m),function(x) apply(x, 2, max)),identity))
print(proc.time()-start)

all(m1 == m2)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文