mahout lucene 文档聚类如何?

发布于 2024-08-13 16:02:47 字数 597 浏览 12 评论 0原文

我读到我可以从 lucene 索引创建 mahout 向量,该向量可用于应用 mahout 聚类算法。 http://cwiki.apache.org/confluence/display/ MAHOUT/Creating+Vectors+from+Text

我想在 Lucene 索引中的文档中应用 K 均值聚类算法,但不清楚如何应用此算法(或层次聚类)来提取有意义的内容与这些文档聚类。

在此页面中 http://cwiki.apache.org/confluence/display/MAHOUT /k-均值 表示该算法接受两个输入目录:一个用于数据点,一个用于初始簇。我的数据点是文件吗?我如何“声明”这些是我的文档(或其向量),只需将它们进行聚类?

提前抱歉我的语法不好

谢谢

I'm reading that i can create mahout vectors from a lucene index that can be used to apply the mahout clustering algorithms.
http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text

I would like to apply K-means clustering algorithm in the documents in my Lucene index, but it is not clear how can i apply this algorithm (or hierarchical clustering) to extract meaningful clusters with these documents.

In this page http://cwiki.apache.org/confluence/display/MAHOUT/k-Means
says that the algorithm accepts two input directories: one for the data points and one for the initial clusters. My data points are the documents? How can i "declare" that these are my documents (or their vectors) , simply take them and do the clustering?

sorry in advance for my poor grammar

Thank you

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

爺獨霸怡葒院 2024-08-20 16:02:47

如果您有向量,则可以运行 KMeansDriver。这是相同的帮助。

Usage:
 [--input <input> --clusters <clusters> --output <output> --distance <distance>
--convergence <convergence> --max <max> --numReduce <numReduce> --k <k>
--vectorClass <vectorClass> --overwrite --help]
Options
  --input (-i) input                The Path for input Vectors. Must be a
                                    SequenceFile of Writable, Vector
  --clusters (-c) clusters          The input centroids, as Vectors.  Must be a
                                    SequenceFile of Writable, Cluster/Canopy.
                                    If k is also specified, then a random set
                                    of vectors will be selected and written out
                                    to this path first
  --output (-o) output              The Path to put the output in
  --distance (-m) distance          The Distance Measure to use.  Default is
                                    SquaredEuclidean
  --convergence (-d) convergence    The threshold below which the clusters are
                                    considered to be converged.  Default is 0.5
  --max (-x) max                    The maximum number of iterations to
                                    perform.  Default is 20
  --numReduce (-r) numReduce        The number of reduce tasks
  --k (-k) k                        The k in k-Means.  If specified, then a
                                    random selection of k Vectors will be
                                    chosen as the Centroid and written to the
                                    clusters output path.
  --vectorClass (-v) vectorClass    The Vector implementation class name.
                                    Default is SparseVector.class
  --overwrite (-w)                  If set, overwrite the output directory
  --help (-h)                       Print out help

更新:将结果目录从HDFS获取到本地fs。然后使用 ClusterDumper 实用程序获取集群以及该集群中的文档列表。

If you have vectors, you can run KMeansDriver. Here is the help for the same.

Usage:
 [--input <input> --clusters <clusters> --output <output> --distance <distance>
--convergence <convergence> --max <max> --numReduce <numReduce> --k <k>
--vectorClass <vectorClass> --overwrite --help]
Options
  --input (-i) input                The Path for input Vectors. Must be a
                                    SequenceFile of Writable, Vector
  --clusters (-c) clusters          The input centroids, as Vectors.  Must be a
                                    SequenceFile of Writable, Cluster/Canopy.
                                    If k is also specified, then a random set
                                    of vectors will be selected and written out
                                    to this path first
  --output (-o) output              The Path to put the output in
  --distance (-m) distance          The Distance Measure to use.  Default is
                                    SquaredEuclidean
  --convergence (-d) convergence    The threshold below which the clusters are
                                    considered to be converged.  Default is 0.5
  --max (-x) max                    The maximum number of iterations to
                                    perform.  Default is 20
  --numReduce (-r) numReduce        The number of reduce tasks
  --k (-k) k                        The k in k-Means.  If specified, then a
                                    random selection of k Vectors will be
                                    chosen as the Centroid and written to the
                                    clusters output path.
  --vectorClass (-v) vectorClass    The Vector implementation class name.
                                    Default is SparseVector.class
  --overwrite (-w)                  If set, overwrite the output directory
  --help (-h)                       Print out help

Update: Get the result directory from HDFS to local fs. Then use ClusterDumper utility to get the cluster and list of documents in that cluster.

恍梦境° 2024-08-20 16:02:47

@迈基
您可以在此页面中阅读有关读取输出和使用 clusterdump 实用程序的更多信息 -> https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+自卸车

@ maiky
You can read more about reading the output and using clusterdump utility in this page -> https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文