R中有更好的分层聚类方法吗?
我想按行然后按列进行层次聚类。我想出了一个完整的解决方案:
#! /path/to/my/Rscript --vanilla
args <- commandArgs(TRUE)
mtxf.in <- args[1]
clusterMethod <- args[2]
mtxf.out <- args[3]
mtx <- read.table(mtxf.in, as.is=T, header=T, stringsAsFactors=T)
mtx.hc <- hclust(dist(mtx), method=clusterMethod)
mtx.clustered <- as.data.frame(mtx[mtx.hc$order,])
mtx.c.colnames <- colnames(mtx.clustered)
rownames(mtx.clustered) <- mtx.clustered$topLeftColumnHeaderName
mtx.clustered$topLeftColumnHeaderName <- NULL
mtx.c.t <- as.data.frame(t(mtx.clustered), row.names=names(mtx))
mtx.c.t.hc <- hclust(dist(mtx.c.t), method=clusterMethod)
mtx.c.t.c <- as.data.frame(mtx.c.t[mtx.c.t.hc$order,])
mtx.c.t.c.t <- as.data.frame(t(mtx.c.t.c))
mtx.c.t.c.t.colnames <- as.vector(names(mtx.c.t.c.t))
names(mtx.c.t.c.t) <- mtx.c.colnames[as.numeric(mtx.c.t.c.t.colnames) + 1]
write.table(mtx.c.t.c.t, file=mtxf.out, sep='\t', quote=F, row.names=T)
变量 mtxf.in 和 mtxf.out 分别表示输入矩阵和集群输出矩阵文件。变量clusterMethod
是hclust
方法之一,例如single
、average
等。
作为示例输入,这是一个数据矩阵:
topLeftColumnHeaderName col1 col2 col3 col4 col5 col6
row1 0 3 0 0 0 3
row2 6 6 6 6 6 6
row3 0 3 0 0 0 3
row4 6 6 6 6 6 6
row5 0 3 0 0 0 3
row6 0 3 0 0 0 3
运行此脚本,我丢失了 mtxf.in
中的左上角元素。这是该脚本的输出:
col5 col4 col1 col3 col2 col6
row6 0 0 0 0 3 3
row5 0 0 0 0 3 3
row1 0 0 0 0 3 3
row3 0 0 0 0 3 3
row2 6 6 6 6 6 6
row4 6 6 6 6 6 6
我的问题:除了寻找一种方法来保留输入矩阵文件的原始结构之外,我也不知道这会消耗多少内存,或者是否有更快、更干净的方法,更多类似“R”的方法来做到这一点。
在 R 中按行和列进行聚类真的这么难吗?有建设性的方法来改进这个脚本吗?谢谢你的建议。
I would like to do hierarchical clustering by row and then by column. I came up with this total hack of a solution:
#! /path/to/my/Rscript --vanilla
args <- commandArgs(TRUE)
mtxf.in <- args[1]
clusterMethod <- args[2]
mtxf.out <- args[3]
mtx <- read.table(mtxf.in, as.is=T, header=T, stringsAsFactors=T)
mtx.hc <- hclust(dist(mtx), method=clusterMethod)
mtx.clustered <- as.data.frame(mtx[mtx.hc$order,])
mtx.c.colnames <- colnames(mtx.clustered)
rownames(mtx.clustered) <- mtx.clustered$topLeftColumnHeaderName
mtx.clustered$topLeftColumnHeaderName <- NULL
mtx.c.t <- as.data.frame(t(mtx.clustered), row.names=names(mtx))
mtx.c.t.hc <- hclust(dist(mtx.c.t), method=clusterMethod)
mtx.c.t.c <- as.data.frame(mtx.c.t[mtx.c.t.hc$order,])
mtx.c.t.c.t <- as.data.frame(t(mtx.c.t.c))
mtx.c.t.c.t.colnames <- as.vector(names(mtx.c.t.c.t))
names(mtx.c.t.c.t) <- mtx.c.colnames[as.numeric(mtx.c.t.c.t.colnames) + 1]
write.table(mtx.c.t.c.t, file=mtxf.out, sep='\t', quote=F, row.names=T)
The variables mtxf.in
and mtxf.out
represent the input matrix and clustered output matrix files, respectively. The variable clusterMethod
is one of the hclust
methods, such as single
, average
, etc.
As an example input, here's a data matrix:
topLeftColumnHeaderName col1 col2 col3 col4 col5 col6
row1 0 3 0 0 0 3
row2 6 6 6 6 6 6
row3 0 3 0 0 0 3
row4 6 6 6 6 6 6
row5 0 3 0 0 0 3
row6 0 3 0 0 0 3
Running this script, I lose my top-left corner element from mtxf.in
. Here's the output that comes out of this script:
col5 col4 col1 col3 col2 col6
row6 0 0 0 0 3 3
row5 0 0 0 0 3 3
row1 0 0 0 0 3 3
row3 0 0 0 0 3 3
row2 6 6 6 6 6 6
row4 6 6 6 6 6 6
My questions: In addition to looking for a way to preserve the original structure of the input matrix file, I also don't know how much memory this consumes or whether there are faster and cleaner, more "R"-like ways for doing this.
Is it really this hard to cluster by rows and columns in R? Are there constructive ways to improve this script? Thanks for your advice.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一旦你清理了数据(即删除了第一列),这实际上只需要三行代码:
清理数据(从第一列分配行名称,然后删除第一列):
聚类和重新排序:
结果:
Once you have your data cleaned (i.e. removed the first column), this really just requires three lines of code:
Clean data (assign row names from first column, then remove first column):
Cluster and reorder:
Results:
老实说,我并不完全清楚你为什么要做一些你正在做的事情,所以我完全有可能误解了你正在寻找的东西。如果我偏离了基地,请告诉我,我会删除这个答案。
但我怀疑,如果您使用 row.names = 1 来指示第一列实际上是行名称来读取数据,您的生活会更容易(并且您的结果实际上是正确的)。例如:
既然你标记了这个
code-review
,我想我还会指出,在风格上,许多 R 人员不喜欢使用T
和F 用于布尔值,因为它们可以被屏蔽,而
TRUE
和FALSE
则不能。I'll be honest I'm not totally clear on why you're doing some of the stuff your're doing, so it's entirely possible I've misunderstood what you're looking for. If I'm way off base, let me know and I'll delete this answer.
But I suspect that your life will be much easier (and your results actually correct) if you read your data in using
row.names = 1
to indicate that the first column are actually row names. For example:Since you tagged this
code-review
I guess I'll also point out that stylistically, many R folks prefer not to useT
andF
for booleans since they can be masked, whileTRUE
andFALSE
cannot.