按行名聚合大矩阵中的行
我想通过添加具有相同行名的行中的值来聚合矩阵的行。我当前的方法如下:
> M
a b c d
1 1 1 2 0
1 2 3 4 2
2 3 0 1 2
3 4 2 5 2
> index <- as.numeric(rownames(M))
> M <- cbind(M,index)
> Dfmat <- data.frame(M)
> Dfmat <- aggregate(. ~ index, data = Dfmat, sum)
> M <- as.matrix(Dfmat)
> rownames(M) <- M[,"index"]
> M <- subset(M, select= -index)
> M
a b c d
1 3 4 6 2
2 3 0 1 2
3 4 2 5 2
这种方法的问题是我需要将其应用于许多非常大的矩阵(最多 1.000 行和 30.000 列)。在这些情况下,计算时间非常长(使用 ddply 时也会出现同样的问题)。有没有更有效的解决方案?原始输入矩阵是 tm 包中的 DocumentTermMatrix 有帮助吗?据我所知它们以稀疏矩阵格式存储。
I would like to aggregate the rows of a matrix by adding the values in rows that have the same rowname. My current approach is as follows:
> M
a b c d
1 1 1 2 0
1 2 3 4 2
2 3 0 1 2
3 4 2 5 2
> index <- as.numeric(rownames(M))
> M <- cbind(M,index)
> Dfmat <- data.frame(M)
> Dfmat <- aggregate(. ~ index, data = Dfmat, sum)
> M <- as.matrix(Dfmat)
> rownames(M) <- M[,"index"]
> M <- subset(M, select= -index)
> M
a b c d
1 3 4 6 2
2 3 0 1 2
3 4 2 5 2
The problem of this appraoch is that i need to apply it to a number of very large matrices (up to 1.000 rows and 30.000 columns). In these cases the computation time is very high (Same problem when using ddply). Is there a more eficcient to come up with the solution? Does it help that the original input matrices are DocumentTermMatrix from the tm package? As far as I know they are stored in a sparse matrix format.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是使用
by
和colSums
的解决方案,但由于by
的默认输出,需要一些调整。Here's a solution using
by
andcolSums
, but requires some fiddling due to the default output ofby
.Matrix.utils 中现在有一个聚合函数。这可以用一行代码完成您想要的事情,并且比
combineByRow
解决方案快大约 10 倍,比by
解决方案快 100 倍:编辑:Frank 是对的,rowsum比任何这些解决方案都快一些。仅当您使用
Matrix
(尤其是稀疏矩阵),或者除了sum
之外还执行聚合时,您才需要考虑使用其他函数中的另一个。There is now an aggregate function in
Matrix.utils
. This can accomplish what you want with a single line of code and is about 10x faster than thecombineByRow
solution and 100x faster than theby
solution:EDIT: Frank is right, rowsum is somewhat faster than any of these solutions. You would want to consider using another one of these other functions only if you were using a
Matrix
, especially a sparse one, or if you were performing an aggregation besidessum
.詹姆斯的答案按预期工作,但对于大型矩阵来说相当慢。这是一个避免创建新对象的版本:
测试表明,这比使用
by
的答案快大约 10 倍(本例中为 2 秒与 20 秒):The answer by James work as expected, but is quite slow for large matrices. Here is a version that avoids creating of new objects:
Testing it shows that is about 10x faster than the answer using
by
(2 vs. 20 seconds in this example):