在文档聚类过程中,作为数据预处理步骤,我首先应用奇异向量分解来获得U
、S
和Vt
,然后通过选择合适数量的特征值,我截断了 Vt
,现在它根据我读到的内容提供了良好的文档与文档相关性 此处。现在,我正在对矩阵 Vt
的列执行聚类,以将相似的文档聚类在一起,为此我选择了 k-means,初始结果看起来对我来说可以接受(k = 10 个聚类),但我想要更深入地选择 k 值本身。为了确定 k-means 中的簇数 k
,我建议查看交叉验证。
在实现它之前,我想弄清楚是否有一种内置方法可以使用 numpy 或 scipy 来实现它。目前,我执行 kmeans 的方式是简单地使用 scipy 中的函数。
import numpy, scipy
# Preprocess the data and compute svd
U, S, Vt = svd(A) # A is the TFIDF representation of the original term-document matrix
# Obtain the document-document correlations from Vt
# This 50 is the threshold obtained after examining a scree plot of S
docvectors = numpy.transpose(self.Vt[0:50, 0:])
# Prepare the data to run k-means
whitened = whiten(docvectors)
res, idx = kmeans2(whitened, 10, iter=20)
假设到目前为止我的方法是正确的(如果我遗漏了某些步骤,请纠正我),在这个阶段,使用输出执行交叉验证的标准方法是什么?任何有关如何将其应用于 k 均值的参考/实现/建议将不胜感激。
In a document clustering process, as a data pre-processing step, I first applied singular vector decomposition to obtain U
, S
and Vt
and then by choosing a suitable number of eigen values I truncated Vt
, which now gives me a good document-document correlation from what I read here. Now I am performing clustering on the columns of the matrix Vt
to cluster similar documents together and for this I chose k-means and the initial results looked acceptable to me (with k = 10 clusters) but I wanted to dig a bit deeper on choosing the k value itself. To determine the number of clusters k
in k-means, I was suggested to look at cross-validation.
Before implementing it I wanted to figure out if there is a built-in way to achieve it using numpy or scipy. Currently, the way I am performing kmeans
is to simply use the function from scipy.
import numpy, scipy
# Preprocess the data and compute svd
U, S, Vt = svd(A) # A is the TFIDF representation of the original term-document matrix
# Obtain the document-document correlations from Vt
# This 50 is the threshold obtained after examining a scree plot of S
docvectors = numpy.transpose(self.Vt[0:50, 0:])
# Prepare the data to run k-means
whitened = whiten(docvectors)
res, idx = kmeans2(whitened, 10, iter=20)
Assuming my methodology is correct so far (please correct me if I am missing some step), at this stage, what is the standard way of using the output to perform cross-validation? Any reference/implementations/suggestions on how this would be applied to k-means would be greatly appreciated.
发布评论
评论(3)
要运行 k 重交叉验证,您需要一些质量衡量标准来优化。这可以是分类度量,例如准确性或 F1,或者一种专门的测量方法,例如 V 测量。
即使是我所知道的聚类质量度量也需要有标签的数据集(“地面实况”)才能发挥作用;与分类的区别在于,您只需要标记部分数据即可进行评估,而 k-means 算法可以使用所有数据来确定质心,从而确定聚类。
V-measure 和其他几个分数的实现scikit-learn,以及通用 交叉验证代码和根据以下内容进行优化的“网格搜索”模块使用 k 倍 CV 的指定评估方法。 免责声明:我参与了 scikit-learn 开发,尽管我没有编写任何提到的代码。
To run k-fold cross validation, you'd need some measure of quality to optimize for. This could be either a classification measure such as accuracy or F1, or a specialized one such as the V-measure.
Even the clustering quality measures that I know of need a labeled dataset ("ground truth") to work; the difference with classification is that you only need part of your data to be labeled for the evaluation, while the k-means algorithm can make use all the data to determine the centroids and thus the clusters.
V-measure and several other scores are implemented in scikit-learn, as well as generic cross validation code and a "grid search" module that optimizes according to a specified measure of evaluation using k-fold CV. Disclaimer: I'm involved in scikit-learn development, though I didn't write any of the code mentioned.
事实上,要使用 F1-score 或 V-Measure 作为评分函数进行传统的交叉验证,您需要一些标记数据作为基本事实。但在这种情况下,您可以只计算真实数据集中的类数量,并将其用作 K 的最佳值,因此不需要交叉验证。
或者,您可以使用集群稳定性度量作为无监督性能评估,并为此执行某种交叉验证程序。然而,这尚未在 scikit-learn 中实现,尽管它仍然在我的个人待办事项列表中。
您可以在以下答案中找到有关此方法的更多信息metaoptimize.com/qa。特别是,您应该阅读聚类稳定性:Ulrike von Luxburg 的概述。
Indeed to do traditional cross validation with F1-score or V-Measure as scoring function you would need some labeled data as ground truth. But in this case you could just count the number of classes in the ground truth dataset and use it as your optimal value for K, hence no-need for cross-validation.
Alternatively you could use a cluster stability measure as unsupervised performance evaluation and do some kind of cross validation procedure for that. However this is not yet implemented in scikit-learn even though it's still on my personal todo list.
You can find additional info on this approach in the following answer on metaoptimize.com/qa. In particular you should read Clustering Stability: An Overview by Ulrike von Luxburg.
在这里,他们使用 insidess 来找到最佳的簇数。 “withinss”是返回的 kmeans 对象的属性。这可以用来找到最小的“错误”
https://www.statmethods.net/advstats /cluster.html
这个公式并不完全正确。但我自己正在研究一个。该模型每次仍然会发生变化,但它至少是一堆迭代中最好的模型。
Here they use withinss to find an optimal number of clusters. "withinss" is an attribute of the kmeans object returned. That could be used to find a minimum "error"
https://www.statmethods.net/advstats/cluster.html
This formula isn't exactly it. But I'm working on one myself. The model would still change every time, but it would at least be the best model out of a bunch of iterations.