您如何比较两者的“相似性”?两个树状图之间(在 R 中)?
我有两个树状图,我希望将它们相互比较,以找出它们的“相似性”。但我不知道有什么方法可以做到这一点(更不用说实现它的代码了,比如在 R 中)。
有线索吗?
更新 (2014-09-13):
自从提出这个问题以来,我编写了一个名为 dendextend,用于树状图的可视化、操作和比较。该软件包位于 CRAN 上,并附带 详细插图。它包括诸如 cor_cophenetic
、cor_bakers_gamma
和 Bk
/ Bk_plot
等函数。以及用于直观比较两棵树的 tanglegram
函数。
I have two dendrograms which I wish to compare to each other in order to find out how "similar" they are. But I don't know of any method to do so (let alone a code to implement it, say, in R).
Any leads ?
UPDATE (2014-09-13):
Since asking this question, I have written an R package called dendextend, for the visualization, manipulation and comparison of dendrogram. This package is on CRAN and comes with a detailed vignette. It includes functions such as cor_cophenetic
, cor_bakers_gamma
and Bk
/ Bk_plot
. As well as a tanglegram
function for visually comparing two trees.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
比较树状图与比较层次聚类不太一样,因为前者包括分支的长度以及分裂,但我也认为这是一个好的开始。我建议您阅读 EB Fowlkes & CL 马洛斯 (1983)。 “比较两个层次聚类的方法”。美国统计协会杂志 78 (383):553–584 (链接)。
他们的方法基于在每个级别 k 砍伐树木,获得一个测量值 Bk,将分组与 k 个集群进行比较,然后检查Bk 与 k 绘图。测量Bk基于查看成对的对象并查看它们是否属于同一簇。
我相信人们可以根据这种方法编写代码,但首先我们需要知道树状图在 R 中是如何表示的。
Comparing dendrograms is not quite the same as comparing hierarchical clusterings, because the former includes the lengths of branches as well as the splits, but I also think that's a good start. I would suggest you read E. B. Fowlkes & C. L. Mallows (1983). "A Method for Comparing Two Hierarchical Clusterings". Journal of the American Statistical Association 78 (383): 553–584 (link).
Their approach is based on cutting the trees at each level k, getting a measure Bk that compares the groupings into k clusters, and then examining the Bk vs k plots. The measure Bk is based upon looking at pairs of objects and seeing whether they fall into the same cluster or not.
I am sure that one can write code based on this method, but first we would need to know how the dendrograms are represented in R.
如您所知,树状图源自分层聚类 - 所以您真正要问的是我如何才能比较两次层次聚类运行的结果。据我所知,没有标准指标,但我会查看找到的集群数量并比较类似集群之间的成员相似性。 这里是我的分层聚类的一个很好的概述同事写了关于苏格兰威士忌的聚类。
As you know, Dendrograms arise from hierarchical clustering - so what you are really asking is how can I compare the results of two hierarchical clustering runs. There are no standard metrics I know of, but I would be looking at the number of clusters found and comparing membership similarity between like clusters. Here is a good overview of hierarchical clustering that my colleague wrote on clustering scotch whiskey's.
看看 此页面:
我也有类似的问题在这里
提出我们可以使用共表相关性来衡量两个树状图之间的相似性。但目前R中似乎没有用于此目的的函数。
编辑于2014年9月18日:
stats
包中的cophenetic
函数能够计算同相相异矩阵。相关性可以使用cor函数计算。正如@Tal指出的那样,as.dendrogram函数返回了不同顺序的树,如果我们根据树状图结果计算相关性,这将导致错误的结果。如dendextend
包中的cor_cophenetic
函数示例所示:have a look at this page:
I also have similar question asked here
It seems we can use cophenetic correlation to measure the similarity between two dendrograms. But there seems no function for this purpose in R currently.
EDIT at 2014,9,18:
The
cophenetic
function instats
package is capable to calculating the cophenetic dissimilarity matrix. and the correlation can be calculated usingcor
function. as @Tal has pointed theas.dendrogram
function returned the tree with different order, which will cause wrong results if we calculate the correlation based on the dendrogram results. As showed in the example of functioncor_cophenetic
function indendextend
package:如果您可以访问生成每个树状图的基础距离矩阵(如果您在 R 中生成树状图,则可能会这样做),您难道不能只使用两个矩阵的相应值之间的相关性吗?我知道这并没有解决你所问问题的字面意思,但它是解决你所问问题的精神的一个很好的解决方案。
If you have access to the underlying distance matrix that generated each dendrogram (you probably do if you generated the dendorograms in R), couldn't you just use correlation between the corresponding values of the two matrices? I know this doesn't address the letter of what you asked, but it's a good solution to the spirit of what you asked.
查看此页面,其中包含有关处理树木的软件的大量信息,包括树状图。我注意到有几个处理树比较的工具,尽管我个人还没有使用过其中任何一个。那里还引用了许多参考文献。
Take a look at this page that has lots of information about software that deals with trees, including dendrograms. I noticed several tools that deal with tree comparison, although I haven't personally used any of them yet. There are a number of references cited there also.
系统发育学界有大量关于树距离度量的文献,但从计算机科学的角度来看,这些文献似乎被忽视了。请参阅
dist.topo
考虑到树分区的相似性,以及 Robinson-Foulds 指标,它在phangorn
包。一个问题是这些指标没有固定的比例,因此它们仅在 1) 树比较或 2) 与某些生成的基线进行比较的情况下有用,可能通过 排列测试类似于 Tal 在他出色的 dendextend 包中使用 Baker's Gamma 所做的事情。
如果您有从 R 层次聚类生成的 hclust 或树状图对象,则使用ape 包中的 as.phylo 会将您的树状图转换为系统发育树在这些函数中的使用。
There is a rich body of literature for tree distance metrics in the phylogenetics community that seems to have been neglected from the computer science perspective. See
dist.topo
of theape
package for two tree distance metrics and several citations (Penny and Hardy 1985, Kuhner and Felsenstein 1994) which considering the similarity of tree partitions, and also the Robinson-Foulds metric which has an R implementation in thephangorn
package.One problem is that these metrics don't have a fixed scale, so they are only useful in the cases of 1) tree comparison or 2) comparison to some generated baseline, perhaps via permutation tests similar to what Tal has done with Baker's Gamma in his fantastic dendextend package.
If you have hclust or dendrogram objects generated from
R
hierarchical clustering, usingas.phylo
from theape
package will convert your dendrograms to phylogenetic trees for usage in these functions.