mahout lucene 文档聚类如何？

发布于 2024-08-13 16:02:47 字数 597 浏览 12 评论 0原文

我读到我可以从 lucene 索引创建 mahout 向量，该向量可用于应用 mahout 聚类算法。 http://cwiki.apache.org/confluence/display/ MAHOUT/Creating+Vectors+from+Text

我想在 Lucene 索引中的文档中应用 K 均值聚类算法，但不清楚如何应用此算法（或层次聚类）来提取有意义的内容与这些文档聚类。

在此页面中 http://cwiki.apache.org/confluence/display/MAHOUT /k-均值表示该算法接受两个输入目录：一个用于数据点，一个用于初始簇。我的数据点是文件吗？我如何“声明”这些是我的文档（或其向量），只需将它们进行聚类？

提前抱歉我的语法不好

谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爺獨霸怡葒院 2024-08-20 16:02:47

如果您有向量，则可以运行 KMeansDriver。这是相同的帮助。

Usage:
 [--input <input> --clusters <clusters> --output <output> --distance <distance>
--convergence <convergence> --max <max> --numReduce <numReduce> --k <k>
--vectorClass <vectorClass> --overwrite --help]
Options
  --input (-i) input                The Path for input Vectors. Must be a
                                    SequenceFile of Writable, Vector
  --clusters (-c) clusters          The input centroids, as Vectors.  Must be a
                                    SequenceFile of Writable, Cluster/Canopy.
                                    If k is also specified, then a random set
                                    of vectors will be selected and written out
                                    to this path first
  --output (-o) output              The Path to put the output in
  --distance (-m) distance          The Distance Measure to use.  Default is
                                    SquaredEuclidean
  --convergence (-d) convergence    The threshold below which the clusters are
                                    considered to be converged.  Default is 0.5
  --max (-x) max                    The maximum number of iterations to
                                    perform.  Default is 20
  --numReduce (-r) numReduce        The number of reduce tasks
  --k (-k) k                        The k in k-Means.  If specified, then a
                                    random selection of k Vectors will be
                                    chosen as the Centroid and written to the
                                    clusters output path.
  --vectorClass (-v) vectorClass    The Vector implementation class name.
                                    Default is SparseVector.class
  --overwrite (-w)                  If set, overwrite the output directory
  --help (-h)                       Print out help

更新：将结果目录从HDFS获取到本地fs。然后使用 ClusterDumper 实用程序获取集群以及该集群中的文档列表。

If you have vectors, you can run KMeansDriver. Here is the help for the same.

Usage:
 [--input <input> --clusters <clusters> --output <output> --distance <distance>
--convergence <convergence> --max <max> --numReduce <numReduce> --k <k>
--vectorClass <vectorClass> --overwrite --help]
Options
  --input (-i) input                The Path for input Vectors. Must be a
                                    SequenceFile of Writable, Vector
  --clusters (-c) clusters          The input centroids, as Vectors.  Must be a
                                    SequenceFile of Writable, Cluster/Canopy.
                                    If k is also specified, then a random set
                                    of vectors will be selected and written out
                                    to this path first
  --output (-o) output              The Path to put the output in
  --distance (-m) distance          The Distance Measure to use.  Default is
                                    SquaredEuclidean
  --convergence (-d) convergence    The threshold below which the clusters are
                                    considered to be converged.  Default is 0.5
  --max (-x) max                    The maximum number of iterations to
                                    perform.  Default is 20
  --numReduce (-r) numReduce        The number of reduce tasks
  --k (-k) k                        The k in k-Means.  If specified, then a
                                    random selection of k Vectors will be
                                    chosen as the Centroid and written to the
                                    clusters output path.
  --vectorClass (-v) vectorClass    The Vector implementation class name.
                                    Default is SparseVector.class
  --overwrite (-w)                  If set, overwrite the output directory
  --help (-h)                       Print out help

Update: Get the result directory from HDFS to local fs. Then use ClusterDumper utility to get the cluster and list of documents in that cluster.

回复收藏 0 原文