如何使用 R 应用分层或 k 均值聚类分析？

发布于 2024-10-31 20:26:37 字数 451 浏览 7 评论 0原文

我想使用 R 应用层次聚类分析。我知道 hclust() 函数，但不知道如何在实践中使用它；我一直致力于向函数提供数据并处理输出。

我还想将层次聚类与 kmeans() 生成的聚类进行比较。我再次不确定如何调用此函数或使用/操作它的输出。

我的数据类似于：

## dummy data
require(MASS)
set.seed(1)
dat <- data.frame(mvrnorm(100, mu = c(2,6,3), 
                          Sigma = matrix(c(10,   2,   4,
                                            2,   3, 0.5,
                                            4, 0.5,   2), ncol = 3)))

原文

I want to apply a hierarchical cluster analysis with R. I am aware of the hclust() function but not how to use this in practice; I'm stuck with supplying the data to the function and processing the output.

I would also like to compare the hierarchical clustering with that produced by kmeans(). Again I am not sure how to call this function or use/manipulate the output from it.

My data are similar to:

## dummy data
require(MASS)
set.seed(1)
dat <- data.frame(mvrnorm(100, mu = c(2,6,3), 
                          Sigma = matrix(c(10,   2,   4,
                                            2,   3, 0.5,
                                            4, 0.5,   2), ncol = 3)))

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

莳間冲淡了誓言ζ 2024-11-07 20:26:37

对于层次聚类分析，请仔细查看 ?hclust 并运行其示例。替代函数位于 R 附带的 cluster 包中。k-means 聚类可在函数 kmeans() 以及 kmeans() 中使用。 >cluster 包。

对您显示的虚拟数据进行简单的层次聚类分析，如下所示：

## dummy data first
require(MASS)
set.seed(1)
dat <- data.frame(mvrnorm(100, mu = c(2,6,3), 
                          Sigma = matrix(c(10,   2,   4,
                                            2,   3, 0.5,
                                            4, 0.5,   2), ncol = 3)))

使用欧几里得距离计算相异矩阵（您可以使用您想要的任何距离）

dij <- dist(scale(dat, center = TRUE, scale = TRUE))

然后对它们进行聚类，例如使用组平均分层方法

clust <- hclust(dij, method = "average")

打印结果给我们：

R> clust

Call:
hclust(d = dij, method = "average")

Cluster method   : average 
Distance         : euclidean 
Number of objects: 100
Plot the dendrogram

但是这个简单的输出掩盖了一个复杂的对象，需要进一步的函数来提取或使用包含的信息其中：

R> str(clust)
List of 7
 $ merge      : int [1:99, 1:2] -12 -17 -40 -30 -73 -23 1 -52 -91 -45 ...
 $ height     : num [1:99] 0.0451 0.0807 0.12 0.1233 0.1445 ...
 $ order      : int [1:100] 84 14 24 67 46 34 49 36 41 52 ...
 $ labels     : NULL
 $ method     : chr "average"
 $ call       : language hclust(d = dij, method = "average")
 $ dist.method: chr "euclidean"
 - attr(*, "class")= chr "hclust"

可以使用 plot() 方法生成树状图（hang 获取树状图底部沿 x 轴的标签，以及 cex 只是将所有标签缩小到 70% 或正常）

plot(clust, hang = -0.01, cex = 0.7)

dendrogram

假设我们想要一个 3 集群解决方案，剪切树状图生成 3 个组并返回聚类成员资格。

R> cutree(clust, k = 3)
  [1] 1 2 1 2 2 3 2 2 2 3 2 2 3 1 2 2 2 2 2 2 2 2 2 1 2 3 2 1 1 2 2 2 2 1 1 1 1
 [38] 2 2 2 1 3 2 2 1 1 3 2 1 2 2 1 2 1 2 2 3 1 2 3 2 2 2 3 1 3 1 2 2 2 3 1 2 1
 [75] 1 2 3 3 3 3 1 3 2 1 2 2 2 1 2 2 1 2 2 2 2 2 3 1 1 1

也就是说，cutree() 返回一个与聚类观测值数量相同长度的向量，其元素包含每个观测值所属的组 ID。成员资格是当树状图在规定的高度或在适当的高度切割以提供规定的组数时，每个观察值所属的叶子的 ID。

也许这已经足够你继续下去了？

对于k-means，我们会这样做，这

set.seed(2) ## *k*-means uses a random start
klust <- kmeans(scale(dat, center = TRUE, scale = TRUE), centers = 3)
klust

给出了

> klust
K-means clustering with 3 clusters of sizes 41, 27, 32

Cluster means:
           X1          X2          X3
1  0.04467551  0.69925741 -0.02678733
2  1.11018549 -0.01169576  1.16870206
3 -0.99395950 -0.88605526 -0.95177110

Clustering vector:
  [1] 3 1 3 2 2 3 1 1 1 1 2 1 1 3 2 3 1 2 1 2 2 1 1 3 2 1 1 3 3 1 2 2 1 3 3 3 3
 [38] 1 2 2 3 1 2 2 3 3 1 2 3 2 1 3 1 3 2 2 1 3 2 1 2 1 1 1 3 1 3 2 1 2 1 3 1 3
 [75] 3 1 1 1 1 1 3 1 2 3 1 1 1 3 1 1 3 2 2 1 2 2 3 3 3 3

Within cluster sum of squares by cluster:
[1] 47.27597 31.52213 42.15803
 (between_SS / total_SS =  59.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"

Here，我们得到了有关kmeans()返回的对象中的组件的一些信息。 $cluster 组件将生成成员向量，与我们之前从 cuttree() 看到的输出相当：

R> klust$cluster
  [1] 3 1 3 2 2 3 1 1 1 1 2 1 1 3 2 3 1 2 1 2 2 1 1 3 2 1 1 3 3 1 2 2 1 3 3 3 3
 [38] 1 2 2 3 1 2 2 3 3 1 2 3 2 1 3 1 3 2 2 1 3 2 1 2 1 1 1 3 1 3 2 1 2 1 3 1 3
 [75] 3 1 1 1 1 1 3 1 2 3 1 1 1 3 1 1 3 2 2 1 2 2 3 3 3 3

在这两种情况下，请注意我还对数据进行了缩放（标准化）允许在通用范围内比较每个变量。对于以不同“单位”或不同尺度（如此处具有不同均值和方差）测量的数据，如果要使结果有意义或不受具有大方差的变量的支配，这是一个重要的数据处理步骤。

For hierarchical cluster analysis take a good look at ?hclust and run its examples. Alternative functions are in the cluster package that comes with R. k-means clustering is available in function kmeans() and also in the cluster package.

A simple hierarchical cluster analysis of the dummy data you show would be done as follows:

## dummy data first
require(MASS)
set.seed(1)
dat <- data.frame(mvrnorm(100, mu = c(2,6,3), 
                          Sigma = matrix(c(10,   2,   4,
                                            2,   3, 0.5,
                                            4, 0.5,   2), ncol = 3)))

Compute the dissimilarity matrix using Euclidean distances (you can use whatever distance you want)

dij <- dist(scale(dat, center = TRUE, scale = TRUE))

Then cluster them, say using the group average hierarchical method

clust <- hclust(dij, method = "average")

Printing the result gives us:

R> clust

Call:
hclust(d = dij, method = "average")

Cluster method   : average 
Distance         : euclidean 
Number of objects: 100
Plot the dendrogram

but that simple output belies a complex object that needs further functions to extract or use the information contained therein:

R> str(clust)
List of 7
 $ merge      : int [1:99, 1:2] -12 -17 -40 -30 -73 -23 1 -52 -91 -45 ...
 $ height     : num [1:99] 0.0451 0.0807 0.12 0.1233 0.1445 ...
 $ order      : int [1:100] 84 14 24 67 46 34 49 36 41 52 ...
 $ labels     : NULL
 $ method     : chr "average"
 $ call       : language hclust(d = dij, method = "average")
 $ dist.method: chr "euclidean"
 - attr(*, "class")= chr "hclust"

The dendrogram can be generated using the plot() method (hang gets the labels at the bottom of the dendrogram, along the x-axis, and cex just shrinks all the labels to 70% or normal)

plot(clust, hang = -0.01, cex = 0.7)

dendrogram

Say we want a 3-cluster solution, cut the dendrogram to produce 3 groups and return the cluster memberships

R> cutree(clust, k = 3)
  [1] 1 2 1 2 2 3 2 2 2 3 2 2 3 1 2 2 2 2 2 2 2 2 2 1 2 3 2 1 1 2 2 2 2 1 1 1 1
 [38] 2 2 2 1 3 2 2 1 1 3 2 1 2 2 1 2 1 2 2 3 1 2 3 2 2 2 3 1 3 1 2 2 2 3 1 2 1
 [75] 1 2 3 3 3 3 1 3 2 1 2 2 2 1 2 2 1 2 2 2 2 2 3 1 1 1

That is, cutree() returns a vector the same length as the number of observations clustered, the elements of which contain the group ID that each observation belongs. The membership is the ID of the leaf into which each observation falls when the dendrogram is cut at a stated height or, as done here, at the appropriate height to provide the stated number of groups.

Perhaps that gives you enough to be going on with?

For k-means, we would do this

set.seed(2) ## *k*-means uses a random start
klust <- kmeans(scale(dat, center = TRUE, scale = TRUE), centers = 3)
klust

which gives

> klust
K-means clustering with 3 clusters of sizes 41, 27, 32

Cluster means:
           X1          X2          X3
1  0.04467551  0.69925741 -0.02678733
2  1.11018549 -0.01169576  1.16870206
3 -0.99395950 -0.88605526 -0.95177110

Clustering vector:
  [1] 3 1 3 2 2 3 1 1 1 1 2 1 1 3 2 3 1 2 1 2 2 1 1 3 2 1 1 3 3 1 2 2 1 3 3 3 3
 [38] 1 2 2 3 1 2 2 3 3 1 2 3 2 1 3 1 3 2 2 1 3 2 1 2 1 1 1 3 1 3 2 1 2 1 3 1 3
 [75] 3 1 1 1 1 1 3 1 2 3 1 1 1 3 1 1 3 2 2 1 2 2 3 3 3 3

Within cluster sum of squares by cluster:
[1] 47.27597 31.52213 42.15803
 (between_SS / total_SS =  59.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"

Here we get some information about the components in the object returned by kmeans(). The $cluster component will yield the membership vector, comparable to the output we saw earlier from cutree():

R> klust$cluster
  [1] 3 1 3 2 2 3 1 1 1 1 2 1 1 3 2 3 1 2 1 2 2 1 1 3 2 1 1 3 3 1 2 2 1 3 3 3 3
 [38] 1 2 2 3 1 2 2 3 3 1 2 3 2 1 3 1 3 2 2 1 3 2 1 2 1 1 1 3 1 3 2 1 2 1 3 1 3
 [75] 3 1 1 1 1 1 3 1 2 3 1 1 1 3 1 1 3 2 2 1 2 2 3 3 3 3

In both instances, notice that I also scale (standardise) the data to allow each variable to be compared on a common scale. With data measured in different "units" or on different scales (as here with different means and variances) this is an important data processing step if the results are to be meaningful or not dominated by the variables that have large variances.

回复收藏 0 原文

~没有更多了~