数据聚类算法
处理大维度和海量数据集并且速度快的最流行的文本聚类算法是什么? 阅读了这么多论文和这么多方法后,我感到很困惑……现在只想知道哪一种最常用,以便为编写文档聚类应用程序提供一个良好的起点。
What is the most popular text clustering algorithm which deals with large dimensions and huge dataset and is fast?
I am getting confused after reading so many papers and so many approaches..now just want to know which one is used most, to have a good starting point for writing a clustering application for documents.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
为了解决维度灾难,您可以尝试确定生成数据集的
盲源
(即主题)。您可以使用主成分分析或因子分析,以减少特征集的维度并计算有用的索引。PCA 是潜在语义索引中使用的内容,因为SVD 可以被证明是 PCA :)
请记住,当您获得数据集或其因子的主要成分时,您可能会失去解释,因此您可能想要走非负矩阵分解路线。 (重点来了!K-Means 是一种特殊的 NNMF!)在 NNMF 中,数据集可以仅通过其加性、非负分量来解释。
To deal with the curse of dimensionality you can try to determine the
blind sources
(ie topics) that generated your dataset. You could use Principal Component Analysis or Factor Analysis to reduce the dimensionality of your feature set and to compute useful indexes.PCA is what is used in Latent Semantic Indexing, since SVD can be demonstrated to be PCA : )
Remember that you can lose interpretation when you obtain the principal components of your dataset or its factors, so you maybe wanna go the Non-Negative Matrix Factorization route. (And here is the punch! K-Means is a particular NNMF!) In NNMF the dataset can be explained just by its additive, non-negative components.
没有一种放之四海而皆准的方法。层次聚类始终是一种选择。如果您想从数据中形成不同的组,您可以使用 K 均值聚类(据说它的计算强度也较小)。
There is no one size fits all approach. Hierarchical clustering is an option always. If you want to have distinct groups formed out of the data, you can go with K-means clustering (it is also supposedly computationally less intensive).
两种最流行的文档聚类方法是层次聚类和 k-means。 k-means 速度更快,因为它与文档数量呈线性关系,而不是分层,后者是二次的,但通常被认为可以提供更好的结果。数据集中的每个文档通常表示为一个 n 维向量(n 是单词数),每个单词对应的维度大小等于其 词频-逆文档频率得分。 tf-idf分数降低了高频词在相似度计算中的重要性。 余弦相似度通常用作相似性度量。
可以找到一篇比较分层 k 均值和平分 k 均值(k 均值的近亲算法)之间的实验结果的论文 此处。
文档聚类中降维的最简单方法是:a) 丢弃所有罕见和高频的单词(比如出现在少于 1% 和超过 60% 的文档中:这有点随意,您需要为每个单词尝试不同的范围)数据集以查看对结果的影响),b)停止:丢弃停止列表中的所有单词常见英语单词:可以在网上找到列表,以及 c) 词干,或删除后缀仅留下词根。最常见的词干分析器是 Martin Porter 设计的词干分析器。可以在此处找到多种语言的实现。通常,这会将数据集中的唯一单词数量减少到几百或几千个,并且可能不需要进一步降维。否则,可以使用 PCA 等技术。
The two most popular document clustering approaches, are hierarchical clustering and k-means. k-means is faster as it is linear in the number of documents, as opposed to hierarchical, which is quadratic, but is generally believed to give better results. Each document in the dataset is usually represented as an n-dimensional vector (n is the number of words), with the magnitude of the dimension corresponding to each word equal to its term frequency-inverse document frequency score. The tf-idf score reduces the importance of high-frequency words in similarity calculation. The cosine similarity is often used as a similarity measure.
A paper comparing experimental results between hierarchical and bisecting k-means, a cousin algorithm to k-means, can be found here.
The simplest approaches to dimensionality reduction in document clustering are: a) throw out all rare and highly frequent words (say occuring in less than 1% and more than 60% of documents: this is somewhat arbitrary, you need to try different ranges for each dataset to see impact on results), b) stopping: throw out all words in a stop list of common english words: lists can be found online, and c) stemming, or removing suffixes to leave only word roots. The most common stemmer is a stemmer designed by Martin Porter. Implementations in many languages can be found here. Usually, this will reduce the number of unique words in a dataset to a few hundred or low thousands, and further dimensionality reduction may not be required. Otherwise, techniques like PCA could be used.
我会坚持使用 kmedoids,因为您可以在算法开始时计算从任意点到任意点的距离,您只需要执行一次,并且它可以节省您的时间,特别是在有很多维度的情况下。该算法的工作原理是选择距离它较近的点作为簇的中心,而不是根据属于该簇的点的平均值计算的质心。因此,您已经在此算法中完成了所有可能的距离计算。
I will stick with kmedoids, since you can compute the distance from any point to anypoint at the beggining of the algorithm, You only need to do this one time, and it saves you time, specially if there are many dimensions. This algorithm works by choosing as a center of a cluster the point that is nearer to it, not a centroid calculated in base of the averages of the points belonging to that cluster. Therefore you have all possible distance calculations already done for you in this algorithm.
如果您不寻找语义文本聚类(我无法判断这是否是您原始问题的要求),请尝试使用 Levenshtein 距离并用它构建相似度矩阵。由此,您可以使用 k-medoids 进行聚类,然后通过使用轮廓系数来验证您的聚类。不幸的是,Levensthein 可能非常慢,但有一些方法可以通过使用阈值和其他方法来加快速度。
处理维数灾难的另一种方法是找到“对比集”,即在一组中比在其余组中更突出的属性-值对的连接。然后,您可以使用这些对比集作为维度来代替原始属性或使用有限数量的属性。
In the case where you aren't looking for semantic text clustering (I can't tell if this is a requirement or not from your original question), try using Levenshtein distance and building a similarity matrix with it. From this, you can use k-medoids to cluster and subsequently validate your clustering through use of silhouette coefficients. Unfortunately, Levensthein can be quite slow, but there are ways to speed it up through uses of thresholds and other methods.
Another way to deal with the curse of dimensionality would be to find 'contrasting sets,', conjunctions of attribute-value pairs that are more prominent in one group than in the rest. You can then use those contrasting sets as dimensions either in lieu of the original attributes or with a restricted number of attributes.