R中有更好的分层聚类方法吗？

发布于 2024-12-08 11:55:00 字数 2110 浏览 6 评论 0原文

我想按行然后按列进行层次聚类。我想出了一个完整的解决方案：

#! /path/to/my/Rscript --vanilla
args <- commandArgs(TRUE)
mtxf.in <- args[1]
clusterMethod <- args[2]
mtxf.out <- args[3]

mtx <- read.table(mtxf.in, as.is=T, header=T, stringsAsFactors=T)

mtx.hc <- hclust(dist(mtx), method=clusterMethod)
mtx.clustered <- as.data.frame(mtx[mtx.hc$order,])
mtx.c.colnames <- colnames(mtx.clustered)
rownames(mtx.clustered) <- mtx.clustered$topLeftColumnHeaderName
mtx.clustered$topLeftColumnHeaderName <- NULL
mtx.c.t <- as.data.frame(t(mtx.clustered), row.names=names(mtx))
mtx.c.t.hc <- hclust(dist(mtx.c.t), method=clusterMethod)
mtx.c.t.c <- as.data.frame(mtx.c.t[mtx.c.t.hc$order,])
mtx.c.t.c.t <- as.data.frame(t(mtx.c.t.c))
mtx.c.t.c.t.colnames <- as.vector(names(mtx.c.t.c.t))
names(mtx.c.t.c.t) <- mtx.c.colnames[as.numeric(mtx.c.t.c.t.colnames) + 1]

write.table(mtx.c.t.c.t, file=mtxf.out, sep='\t', quote=F, row.names=T)

变量 mtxf.in 和 mtxf.out 分别表示输入矩阵和集群输出矩阵文件。变量clusterMethod是hclust方法之一，例如single、average等。

作为示例输入，这是一个数据矩阵：

topLeftColumnHeaderName col1    col2    col3    col4    col5    col6
row1    0       3       0       0       0       3
row2    6       6       6       6       6       6
row3    0       3       0       0       0       3
row4    6       6       6       6       6       6
row5    0       3       0       0       0       3
row6    0       3       0       0       0       3

运行此脚本，我丢失了 mtxf.in 中的左上角元素。这是该脚本的输出：

col5    col4    col1    col3    col2    col6
row6    0       0       0       0       3       3
row5    0       0       0       0       3       3
row1    0       0       0       0       3       3
row3    0       0       0       0       3       3
row2    6       6       6       6       6       6
row4    6       6       6       6       6       6

我的问题：除了寻找一种方法来保留输入矩阵文件的原始结构之外，我也不知道这会消耗多少内存，或者是否有更快、更干净的方法，更多类似“R”的方法来做到这一点。

在 R 中按行和列进行聚类真的这么难吗？有建设性的方法来改进这个脚本吗？谢谢你的建议。

原文

I would like to do hierarchical clustering by row and then by column. I came up with this total hack of a solution:

#! /path/to/my/Rscript --vanilla
args <- commandArgs(TRUE)
mtxf.in <- args[1]
clusterMethod <- args[2]
mtxf.out <- args[3]

mtx <- read.table(mtxf.in, as.is=T, header=T, stringsAsFactors=T)

mtx.hc <- hclust(dist(mtx), method=clusterMethod)
mtx.clustered <- as.data.frame(mtx[mtx.hc$order,])
mtx.c.colnames <- colnames(mtx.clustered)
rownames(mtx.clustered) <- mtx.clustered$topLeftColumnHeaderName
mtx.clustered$topLeftColumnHeaderName <- NULL
mtx.c.t <- as.data.frame(t(mtx.clustered), row.names=names(mtx))
mtx.c.t.hc <- hclust(dist(mtx.c.t), method=clusterMethod)
mtx.c.t.c <- as.data.frame(mtx.c.t[mtx.c.t.hc$order,])
mtx.c.t.c.t <- as.data.frame(t(mtx.c.t.c))
mtx.c.t.c.t.colnames <- as.vector(names(mtx.c.t.c.t))
names(mtx.c.t.c.t) <- mtx.c.colnames[as.numeric(mtx.c.t.c.t.colnames) + 1]

write.table(mtx.c.t.c.t, file=mtxf.out, sep='\t', quote=F, row.names=T)

The variables mtxf.in and mtxf.out represent the input matrix and clustered output matrix files, respectively. The variable clusterMethod is one of the hclust methods, such as single, average, etc.

As an example input, here's a data matrix:

topLeftColumnHeaderName col1    col2    col3    col4    col5    col6
row1    0       3       0       0       0       3
row2    6       6       6       6       6       6
row3    0       3       0       0       0       3
row4    6       6       6       6       6       6
row5    0       3       0       0       0       3
row6    0       3       0       0       0       3

Running this script, I lose my top-left corner element from mtxf.in. Here's the output that comes out of this script:

col5    col4    col1    col3    col2    col6
row6    0       0       0       0       3       3
row5    0       0       0       0       3       3
row1    0       0       0       0       3       3
row3    0       0       0       0       3       3
row2    6       6       6       6       6       6
row4    6       6       6       6       6       6

My questions: In addition to looking for a way to preserve the original structure of the input matrix file, I also don't know how much memory this consumes or whether there are faster and cleaner, more "R"-like ways for doing this.

Is it really this hard to cluster by rows and columns in R? Are there constructive ways to improve this script? Thanks for your advice.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

昨迟人 2024-12-15 11:55:00

一旦你清理了数据（即删除了第一列），这实际上只需要三行代码：

清理数据（从第一列分配行名称，然后删除第一列）：

dat <- mtfx.in
rownames(dat) <- dat[, 1]
dat <- dat[, -1]

聚类和重新排序：

row.order <- hclust(dist(dat))$order
col.order <- hclust(dist(t(dat)))$order

dat[row.order, col.order]

结果：

     col5 col4 col1 col3 col2 col6
row6    0    0    0    0    3    3
row5    0    0    0    0    3    3
row1    0    0    0    0    3    3
row3    0    0    0    0    3    3
row2    6    6    6    6    6    6
row4    6    6    6    6    6    6

Once you have your data cleaned (i.e. removed the first column), this really just requires three lines of code:

Clean data (assign row names from first column, then remove first column):

dat <- mtfx.in
rownames(dat) <- dat[, 1]
dat <- dat[, -1]

Cluster and reorder:

row.order <- hclust(dist(dat))$order
col.order <- hclust(dist(t(dat)))$order

dat[row.order, col.order]

Results:

     col5 col4 col1 col3 col2 col6
row6    0    0    0    0    3    3
row5    0    0    0    0    3    3
row1    0    0    0    0    3    3
row3    0    0    0    0    3    3
row2    6    6    6    6    6    6
row4    6    6    6    6    6    6

回复收藏 0 原文

晨光如昨 2024-12-15 11:55:00

老实说，我并不完全清楚你为什么要做一些你正在做的事情，所以我完全有可能误解了你正在寻找的东西。如果我偏离了基地，请告诉我，我会删除这个答案。

但我怀疑，如果您使用 row.names = 1 来指示第一列实际上是行名称来读取数据，您的生活会更容易（并且您的结果实际上是正确的）。例如：

#Read the data in
d1 <- read.table(textConnection("topLeftColumnHeaderName col1    col2    col3    col4    col5    col6
 row1    0       3       0       0       0       3
 row2    6       6       6       6       6       6
 row3    0       3       0       0       0       3
 row4    6       6       6       6       6       6
 row5    0       3       0       0       0       3
 row6    0       3       0       0       0       3"),
   sep = "",as.is = TRUE,header = TRUE,
   stringsAsFactors = TRUE,row.names = 1)

#So d1 looks like this: 
d1
     col1 col2 col3 col4 col5 col6
row1    0    3    0    0    0    3
row2    6    6    6    6    6    6
row3    0    3    0    0    0    3
row4    6    6    6    6    6    6
row5    0    3    0    0    0    3
row6    0    3    0    0    0    3

#Simple clustering based on rows 
clus1 <- hclust(dist(d1))
d2 <- d1[clus1$order,]
d2
     col1 col2 col3 col4 col5 col6
row6    0    3    0    0    0    3
row5    0    3    0    0    0    3
row1    0    3    0    0    0    3
row3    0    3    0    0    0    3
row2    6    6    6    6    6    6
row4    6    6    6    6    6    6

#Now cluster on columns and display the result 
clus2 <- hclust(dist(t(d2)))
t(t(d2)[clus2$order,])
     col5 col4 col1 col3 col2 col6
row6    0    0    0    0    3    3
row5    0    0    0    0    3    3
row1    0    0    0    0    3    3
row3    0    0    0    0    3    3
row2    6    6    6    6    6    6
row4    6    6    6    6    6    6

既然你标记了这个code-review，我想我还会指出，在风格上，许多 R 人员不喜欢使用 T 和 F 用于布尔值，因为它们可以被屏蔽，而 TRUE 和 FALSE 则不能。

I'll be honest I'm not totally clear on why you're doing some of the stuff your're doing, so it's entirely possible I've misunderstood what you're looking for. If I'm way off base, let me know and I'll delete this answer.

But I suspect that your life will be much easier (and your results actually correct) if you read your data in using row.names = 1 to indicate that the first column are actually row names. For example:

#Read the data in
d1 <- read.table(textConnection("topLeftColumnHeaderName col1    col2    col3    col4    col5    col6
 row1    0       3       0       0       0       3
 row2    6       6       6       6       6       6
 row3    0       3       0       0       0       3
 row4    6       6       6       6       6       6
 row5    0       3       0       0       0       3
 row6    0       3       0       0       0       3"),
   sep = "",as.is = TRUE,header = TRUE,
   stringsAsFactors = TRUE,row.names = 1)

#So d1 looks like this: 
d1
     col1 col2 col3 col4 col5 col6
row1    0    3    0    0    0    3
row2    6    6    6    6    6    6
row3    0    3    0    0    0    3
row4    6    6    6    6    6    6
row5    0    3    0    0    0    3
row6    0    3    0    0    0    3

#Simple clustering based on rows 
clus1 <- hclust(dist(d1))
d2 <- d1[clus1$order,]
d2
     col1 col2 col3 col4 col5 col6
row6    0    3    0    0    0    3
row5    0    3    0    0    0    3
row1    0    3    0    0    0    3
row3    0    3    0    0    0    3
row2    6    6    6    6    6    6
row4    6    6    6    6    6    6

#Now cluster on columns and display the result 
clus2 <- hclust(dist(t(d2)))
t(t(d2)[clus2$order,])
     col5 col4 col1 col3 col2 col6
row6    0    0    0    0    3    3
row5    0    0    0    0    3    3
row1    0    0    0    0    3    3
row3    0    0    0    0    3    3
row2    6    6    6    6    6    6
row4    6    6    6    6    6    6

Since you tagged this code-review I guess I'll also point out that stylistically, many R folks prefer not to use T and F for booleans since they can be masked, while TRUE and FALSE cannot.

回复收藏 0 原文

~没有更多了~