用于确定 k 均值中的 k 的 k 倍交叉验证？

发布于 2024-11-19 01:44:27 字数 1125 浏览 3 评论 0 原文

在文档聚类过程中，作为数据预处理步骤，我首先应用奇异向量分解来获得U、S和Vt，然后通过选择合适数量的特征值，我截断了 Vt，现在它根据我读到的内容提供了良好的文档与文档相关性此处。现在，我正在对矩阵 Vt 的列执行聚类，以将相似的文档聚类在一起，为此我选择了 k-means，初始结果看起来对我来说可以接受（k = 10 个聚类），但我想要更深入地选择 k 值本身。为了确定 k-means 中的簇数 k，我建议查看交叉验证。

在实现它之前，我想弄清楚是否有一种内置方法可以使用 numpy 或 scipy 来实现它。目前，我执行 kmeans 的方式是简单地使用 scipy 中的函数。

import numpy, scipy

# Preprocess the data and compute svd
U, S, Vt = svd(A) # A is the TFIDF representation of the original term-document matrix

# Obtain the document-document correlations from Vt
# This 50 is the threshold obtained after examining a scree plot of S
docvectors = numpy.transpose(self.Vt[0:50, 0:]) 

# Prepare the data to run k-means
whitened = whiten(docvectors)
res, idx = kmeans2(whitened, 10, iter=20)

假设到目前为止我的方法是正确的（如果我遗漏了某些步骤，请纠正我），在这个阶段，使用输出执行交叉验证的标准方法是什么？任何有关如何将其应用于 k 均值的参考/实现/建议将不胜感激。

原文

In a document clustering process, as a data pre-processing step, I first applied singular vector decomposition to obtain U, S and Vt and then by choosing a suitable number of eigen values I truncated Vt, which now gives me a good document-document correlation from what I read here. Now I am performing clustering on the columns of the matrix Vt to cluster similar documents together and for this I chose k-means and the initial results looked acceptable to me (with k = 10 clusters) but I wanted to dig a bit deeper on choosing the k value itself. To determine the number of clusters k in k-means, I was suggested to look at cross-validation.

Before implementing it I wanted to figure out if there is a built-in way to achieve it using numpy or scipy. Currently, the way I am performing kmeans is to simply use the function from scipy.

import numpy, scipy

# Preprocess the data and compute svd
U, S, Vt = svd(A) # A is the TFIDF representation of the original term-document matrix

# Obtain the document-document correlations from Vt
# This 50 is the threshold obtained after examining a scree plot of S
docvectors = numpy.transpose(self.Vt[0:50, 0:]) 

# Prepare the data to run k-means
whitened = whiten(docvectors)
res, idx = kmeans2(whitened, 10, iter=20)

Assuming my methodology is correct so far (please correct me if I am missing some step), at this stage, what is the standard way of using the output to perform cross-validation? Any reference/implementations/suggestions on how this would be applied to k-means would be greatly appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

幽蝶幻影 2024-11-26 01:44:27

要运行 k 重交叉验证，您需要一些质量衡量标准来优化。这可以是分类度量，例如准确性或 F₁，或者一种专门的测量方法，例如 V 测量。

即使是我所知道的聚类质量度量也需要有标签的数据集（“地面实况”）才能发挥作用；与分类的区别在于，您只需要标记部分数据即可进行评估，而 k-means 算法可以使用所有数据来确定质心，从而确定聚类。

V-measure 和其他几个分数的实现scikit-learn，以及通用交叉验证代码和根据以下内容进行优化的“网格搜索”模块使用 k 倍 CV 的指定评估方法。 免责声明：我参与了 scikit-learn 开发，尽管我没有编写任何提到的代码。

回复收藏 0 原文

深海夜未眠 2024-11-26 01:44:27

事实上，要使用 F1-score 或 V-Measure 作为评分函数进行传统的交叉验证，您需要一些标记数据作为基本事实。但在这种情况下，您可以只计算真实数据集中的类数量，并将其用作 K 的最佳值，因此不需要交叉验证。

或者，您可以使用集群稳定性度量作为无监督性能评估，并为此执行某种交叉验证程序。然而，这尚未在 scikit-learn 中实现，尽管它仍然在我的个人待办事项列表中。

您可以在以下答案中找到有关此方法的更多信息metaoptimize.com/qa。特别是，您应该阅读聚类稳定性：Ulrike von Luxburg 的概述。

回复收藏 0 原文

伴我老 2024-11-26 01:44:27

在这里，他们使用 insidess 来找到最佳的簇数。 “withinss”是返回的 kmeans 对象的属性。这可以用来找到最小的“错误”

https://www.statmethods.net/advstats /cluster.html

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, 
   centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
  ylab="Within groups sum of squares")

这个公式并不完全正确。但我自己正在研究一个。该模型每次仍然会发生变化，但它至少是一堆迭代中最好的模型。

Here they use withinss to find an optimal number of clusters. "withinss" is an attribute of the kmeans object returned. That could be used to find a minimum "error"

https://www.statmethods.net/advstats/cluster.html

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, 
   centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
  ylab="Within groups sum of squares")

This formula isn't exactly it. But I'm working on one myself. The model would still change every time, but it would at least be the best model out of a bunch of iterations.

回复收藏 0 原文

~没有更多了~