Mahout 上的 K 均值返回非独占簇

发布于 2024-11-07 10:23:31 字数 985 浏览 5 评论 0原文

在我的数据中,我有一些喜欢列表的用户,我已将这些喜欢转储到每个用户的单独文件中,并希望将它们聚集起来。除了输出在多个集群中具有相同的点之外,一切都正常。我的理解是 k 均值应该是排他性的。我认为问题可能出在我转储数据的方式上。我还暂时放弃了所有没有空格的喜欢,直到我可以编写自定义标记器。这是我正在运行的内容(来自 ruby​​ 脚本)。

system("#{MAHOUT_CMD} seqdirectory -c UTF-8 -i data/users -o data/kmeans/converted")
system("#{MAHOUT_CMD} seq2sparse -i data/kmeans/converted -o data/kmeans/vectors")
system("#{MAHOUT_CMD} kmeans -i data/kmeans/vectors/tfidf-vectors -c data/kmeans/initial_clusters -o data/kmeans/kmeans_clusters -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 0.1 -k 20 -x 20")

last_cluster_folder = Dir["data/kmeans/kmeans_clusters/*"].last.gsub("data/kmeans/kmeans_clusters/", "")

system("#{MAHOUT_CMD} clusterdump -s data/kmeans/kmeans_clusters/#{last_cluster_folder}/ -d data/kmeans/vectors/dictionary.file-0 -dt sequencefile -o data/kmeans/clusters.txt -n 1000")

输出列出了每个集群中的“热门术语”,但是每个集群中都出现了许多类似的术语(尽管权重不同)。 clusterdumper 的输出是正常的吗,我需要根据权重找出每个单词属于哪个簇吗?

谢谢

In my data I have users with a list of likes, I've dumped these likes into individual files for each user and would like to cluster them. Everything is working except the output has the same likes in multiple clusters. My understanding is k-means should be exclusive. I figure the problem is perhaps with how I am dumping the data. I have also dumped all of the likes without spaces for the time being until I can write a custom tokenizer. Here's what I'm running (from a ruby script).

system("#{MAHOUT_CMD} seqdirectory -c UTF-8 -i data/users -o data/kmeans/converted")
system("#{MAHOUT_CMD} seq2sparse -i data/kmeans/converted -o data/kmeans/vectors")
system("#{MAHOUT_CMD} kmeans -i data/kmeans/vectors/tfidf-vectors -c data/kmeans/initial_clusters -o data/kmeans/kmeans_clusters -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 0.1 -k 20 -x 20")

last_cluster_folder = Dir["data/kmeans/kmeans_clusters/*"].last.gsub("data/kmeans/kmeans_clusters/", "")

system("#{MAHOUT_CMD} clusterdump -s data/kmeans/kmeans_clusters/#{last_cluster_folder}/ -d data/kmeans/vectors/dictionary.file-0 -dt sequencefile -o data/kmeans/clusters.txt -n 1000")

The output lists the "top terms" in each cluster, however many of the likes occur in each cluster (though with different weights). Is the normal output for clusterdumper, do I need to find out what cluster each word belongs to by its weight?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夏了南城 2024-11-14 10:23:31

Mahout 可能只是执行近似 k 均值。另外,可能存在与多个簇具有相同距离的对象。

然而,您应该能够仅采用 k 方法,然后进行 1-最近邻分类以获得每个对象的唯一结果(这对于并行化来说很简单并且非常快)。

Mahout probably is only doing approximate k-means. Plus, there might be objects that have the same distance to more than one cluster.

You should however be able to just take the k means, and then do a 1-nearest-neighbor classification to get a unique result for each objects (this is trivial to parallelize and very fast).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文