Mahout K-means 聚类给我的结果是 0 属于聚类 1.0: []

发布于 2024-11-28 15:11:08 字数 633 浏览 3 评论 0原文

我针对一组序列文件运行了 K 均值聚类算法。然而,生成的结果如下所示:

0 属于集群 1.0: []

0 属于集群 1.0: []

0 属于集群 1.0: []

0 属于集群 1.0: []

0 属于集群 1.0: []

0 属于集群 1.0: []

程序我使用的是从 NewsKMeansClustering.java 借用的,这是 Mahout-in-Action 第 9 章中给出的示例。

您想让我知道为什么会得到这样的结果吗?是因为任何特定的参数设置要求还是其他原因?

该程序中核心的聚类代码是

CanopyDriver.run(vectorsFolder, canopyCentroids, new EuclideanDistanceMeasure(), 250,    120, false, false);

KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, "clusters-0"), 
clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false);

I ran the K-means clustering algorithm against a set of sequence files. However, the generated result looks like this:

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

The program I use is borrowed from NewsKMeansClustering.java, an example given in chapter 9 of Mahout-in-Action.

Would you like to let me know why I get this type of result? Is that because of any specific parameter setting requirement or anything else?

The core clustering code in this program is

CanopyDriver.run(vectorsFolder, canopyCentroids, new EuclideanDistanceMeasure(), 250,    120, false, false);

KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, "clusters-0"), 
clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

来世叙缘 2024-12-05 15:11:08

我使用 Mahout 0.5 遇到了同样的问题。
我认为问题在于这两个函数中都使用了normPower参数。
尝试与此类似的代码。

DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
                outputDir, conf, minSupport, maxNGramSize,
                minLLRValue,
                -1.0f, // no normalization here
                logNormalize, numReducers, chunkSize,
                sequentialAccessOutput, namedVector);
TFIDFConverter.processTfIdf(vectorOutput, new Path(outputDir,
                "tfidf"), conf, chunkSize, minDf, 
                maxDFPercent,normPower,
                logNormalize, sequentialAccessOutput, namedVector,
                numReducers);

从那以后,我就不再遇到空集群的问题了。

I ran into the same issue using Mahout 0.5.
I think the problem is that the normPower parameter is used in both functions.
Try code similar to this.

DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
                outputDir, conf, minSupport, maxNGramSize,
                minLLRValue,
                -1.0f, // no normalization here
                logNormalize, numReducers, chunkSize,
                sequentialAccessOutput, namedVector);
TFIDFConverter.processTfIdf(vectorOutput, new Path(outputDir,
                "tfidf"), conf, chunkSize, minDf, 
                maxDFPercent,normPower,
                logNormalize, sequentialAccessOutput, namedVector,
                numReducers);

After that I stopped having problems with empty clusters.

音盲 2024-12-05 15:11:08

我遇到了这个问题。作为一个新手,解决起来非常困难。然而,就我而言,我意识到 Canopy 聚类的 T1 和 T2 值仅对提供的​​路透社数据(和欧几里德范数)有效。我使用了自己的文档数据,这些数据似乎在文档向量之间具有本质上不同的距离分布。所以我做了一些基本分析,然后根据我自己的数据重新估计了 T1 和 T2。然后事情就成功了。另请参阅我的帖子...

如何选择 Canopy 聚类的 T1 和 T2 阈值?

希望这会有所帮助。

I had this problem. As a newbie it was very difficult to solve. However, in my case, I realised that the T1 and T2 values for the canopy clustering were only valid for the Reuters data (and Euclidean norm) provided. I had used my own document data which seemed to have an inherently different distribution of distances between document vectors. So I did some rudimentary analysis then re-estimated T1 and T2 from my own data. Then things worked. See my post also at...

How to pick the the T1 and T2 threshold values for Canopy Clustering?

Hope this helps.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文