Mahout K-means 聚类给我的结果是 0 属于聚类 1.0: []

发布于 2024-11-28 15:11:08 字数 633 浏览 3 评论 0原文

我针对一组序列文件运行了 K 均值聚类算法。然而，生成的结果如下所示：

0 属于集群 1.0: []
0 属于集群 1.0: []
0 属于集群 1.0: []
0 属于集群 1.0: []
0 属于集群 1.0: []
0 属于集群 1.0: []

程序我使用的是从 NewsKMeansClustering.java 借用的，这是 Mahout-in-Action 第 9 章中给出的示例。

您想让我知道为什么会得到这样的结果吗？是因为任何特定的参数设置要求还是其他原因？

该程序中核心的聚类代码是

CanopyDriver.run(vectorsFolder, canopyCentroids, new EuclideanDistanceMeasure(), 250,    120, false, false);

KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, "clusters-0"), 
clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false);

原文

I ran the K-means clustering algorithm against a set of sequence files. However, the generated result looks like this:

0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []
0 belongs to cluster 1.0: []

The program I use is borrowed from NewsKMeansClustering.java, an example given in chapter 9 of Mahout-in-Action.

Would you like to let me know why I get this type of result? Is that because of any specific parameter setting requirement or anything else?

The core clustering code in this program is

CanopyDriver.run(vectorsFolder, canopyCentroids, new EuclideanDistanceMeasure(), 250,    120, false, false);

KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, "clusters-0"), 
clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false);

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

来世叙缘 2024-12-05 15:11:08

我使用 Mahout 0.5 遇到了同样的问题。
我认为问题在于这两个函数中都使用了normPower参数。
尝试与此类似的代码。

DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
                outputDir, conf, minSupport, maxNGramSize,
                minLLRValue,
                -1.0f, // no normalization here
                logNormalize, numReducers, chunkSize,
                sequentialAccessOutput, namedVector);
TFIDFConverter.processTfIdf(vectorOutput, new Path(outputDir,
                "tfidf"), conf, chunkSize, minDf, 
                maxDFPercent,normPower,
                logNormalize, sequentialAccessOutput, namedVector,
                numReducers);

从那以后，我就不再遇到空集群的问题了。

I ran into the same issue using Mahout 0.5.
I think the problem is that the normPower parameter is used in both functions.
Try code similar to this.

DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
                outputDir, conf, minSupport, maxNGramSize,
                minLLRValue,
                -1.0f, // no normalization here
                logNormalize, numReducers, chunkSize,
                sequentialAccessOutput, namedVector);
TFIDFConverter.processTfIdf(vectorOutput, new Path(outputDir,
                "tfidf"), conf, chunkSize, minDf, 
                maxDFPercent,normPower,
                logNormalize, sequentialAccessOutput, namedVector,
                numReducers);

After that I stopped having problems with empty clusters.

回复收藏 0 原文

音盲 2024-12-05 15:11:08

我遇到了这个问题。作为一个新手，解决起来非常困难。然而，就我而言，我意识到 Canopy 聚类的 T1 和 T2 值仅对提供的路透社数据（和欧几里德范数）有效。我使用了自己的文档数据，这些数据似乎在文档向量之间具有本质上不同的距离分布。所以我做了一些基本分析，然后根据我自己的数据重新估计了 T1 和 T2。然后事情就成功了。另请参阅我的帖子...

如何选择 Canopy 聚类的 T1 和 T2 阈值？

希望这会有所帮助。

回复收藏 0 原文

~没有更多了~