Mahout 上的 K 均值返回非独占簇
在我的数据中,我有一些喜欢列表的用户,我已将这些喜欢转储到每个用户的单独文件中,并希望将它们聚集起来。除了输出在多个集群中具有相同的点之外,一切都正常。我的理解是 k 均值应该是排他性的。我认为问题可能出在我转储数据的方式上。我还暂时放弃了所有没有空格的喜欢,直到我可以编写自定义标记器。这是我正在运行的内容(来自 ruby 脚本)。
system("#{MAHOUT_CMD} seqdirectory -c UTF-8 -i data/users -o data/kmeans/converted")
system("#{MAHOUT_CMD} seq2sparse -i data/kmeans/converted -o data/kmeans/vectors")
system("#{MAHOUT_CMD} kmeans -i data/kmeans/vectors/tfidf-vectors -c data/kmeans/initial_clusters -o data/kmeans/kmeans_clusters -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 0.1 -k 20 -x 20")
last_cluster_folder = Dir["data/kmeans/kmeans_clusters/*"].last.gsub("data/kmeans/kmeans_clusters/", "")
system("#{MAHOUT_CMD} clusterdump -s data/kmeans/kmeans_clusters/#{last_cluster_folder}/ -d data/kmeans/vectors/dictionary.file-0 -dt sequencefile -o data/kmeans/clusters.txt -n 1000")
输出列出了每个集群中的“热门术语”,但是每个集群中都出现了许多类似的术语(尽管权重不同)。 clusterdumper 的输出是正常的吗,我需要根据权重找出每个单词属于哪个簇吗?
谢谢
In my data I have users with a list of likes, I've dumped these likes into individual files for each user and would like to cluster them. Everything is working except the output has the same likes in multiple clusters. My understanding is k-means should be exclusive. I figure the problem is perhaps with how I am dumping the data. I have also dumped all of the likes without spaces for the time being until I can write a custom tokenizer. Here's what I'm running (from a ruby script).
system("#{MAHOUT_CMD} seqdirectory -c UTF-8 -i data/users -o data/kmeans/converted")
system("#{MAHOUT_CMD} seq2sparse -i data/kmeans/converted -o data/kmeans/vectors")
system("#{MAHOUT_CMD} kmeans -i data/kmeans/vectors/tfidf-vectors -c data/kmeans/initial_clusters -o data/kmeans/kmeans_clusters -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 0.1 -k 20 -x 20")
last_cluster_folder = Dir["data/kmeans/kmeans_clusters/*"].last.gsub("data/kmeans/kmeans_clusters/", "")
system("#{MAHOUT_CMD} clusterdump -s data/kmeans/kmeans_clusters/#{last_cluster_folder}/ -d data/kmeans/vectors/dictionary.file-0 -dt sequencefile -o data/kmeans/clusters.txt -n 1000")
The output lists the "top terms" in each cluster, however many of the likes occur in each cluster (though with different weights). Is the normal output for clusterdumper, do I need to find out what cluster each word belongs to by its weight?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Mahout 可能只是执行近似 k 均值。另外,可能存在与多个簇具有相同距离的对象。
然而,您应该能够仅采用 k 方法,然后进行 1-最近邻分类以获得每个对象的唯一结果(这对于并行化来说很简单并且非常快)。
Mahout probably is only doing approximate k-means. Plus, there might be objects that have the same distance to more than one cluster.
You should however be able to just take the
k
means, and then do a 1-nearest-neighbor classification to get a unique result for each objects (this is trivial to parallelize and very fast).