mahout lucene 文档聚类如何?
我读到我可以从 lucene 索引创建 mahout 向量,该向量可用于应用 mahout 聚类算法。 http://cwiki.apache.org/confluence/display/ MAHOUT/Creating+Vectors+from+Text
我想在 Lucene 索引中的文档中应用 K 均值聚类算法,但不清楚如何应用此算法(或层次聚类)来提取有意义的内容与这些文档聚类。
在此页面中 http://cwiki.apache.org/confluence/display/MAHOUT /k-均值 表示该算法接受两个输入目录:一个用于数据点,一个用于初始簇。我的数据点是文件吗?我如何“声明”这些是我的文档(或其向量),只需将它们进行聚类?
提前抱歉我的语法不好
谢谢
I'm reading that i can create mahout vectors from a lucene index that can be used to apply the mahout clustering algorithms.
http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
I would like to apply K-means clustering algorithm in the documents in my Lucene index, but it is not clear how can i apply this algorithm (or hierarchical clustering) to extract meaningful clusters with these documents.
In this page http://cwiki.apache.org/confluence/display/MAHOUT/k-Means
says that the algorithm accepts two input directories: one for the data points and one for the initial clusters. My data points are the documents? How can i "declare" that these are my documents (or their vectors) , simply take them and do the clustering?
sorry in advance for my poor grammar
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您有向量,则可以运行 KMeansDriver。这是相同的帮助。
更新:将结果目录从HDFS获取到本地fs。然后使用 ClusterDumper 实用程序获取集群以及该集群中的文档列表。
If you have vectors, you can run KMeansDriver. Here is the help for the same.
Update: Get the result directory from HDFS to local fs. Then use ClusterDumper utility to get the cluster and list of documents in that cluster.
一个非常好的方法在这里:
将 apache mahout 与 apache lucene 集成
A pretty good howto is here:
integrating apache mahout with apache lucene
@迈基
您可以在此页面中阅读有关读取输出和使用 clusterdump 实用程序的更多信息 -> https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+自卸车
@ maiky
You can read more about reading the output and using clusterdump utility in this page -> https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper