当前位置：文江博客话题详情

从 Mahout 聚类结果中识别文档

发布于 2024-09-27 20:22:37 字数 214 浏览 6 评论 0原文

我正在使用 mahout 对使用 solr 索引的文本文档进行聚类。

我已经使用文档中的“文本”字段来形成向量。然后我使用 mahout 中的 k-means 驱动程序进行集群，然后使用 clusterdumper 实用程序转储结果。

我很难理解转储程序的输出结果。我可以看到由这些簇中的术语向量形成的簇。但是我如何从这些集群中提取文档。我希望结果是出现在不同集群中的输入文档。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

有木有妳兜一样 2024-10-04 20:22:37

我也遇到了这个问题。这个想法是集群转储程序转储所有带有点等的集群数据。您有两种选择：

修改 ClusterDumper.printClusters() 方法，使其不会打印所有术语和权重。我有一些代码，例如：



    String clusterInfo = String.format("Cluster %d (%d) with %d points.\n", value.getId(), clusterCount, value.getNumPoints());
                    writer.write(clusterInfo);
                    writer.write('\n');
    // list all top terms
    if (dictionary != null) {
                        String topTerms = getTopFeatures(value.getCenter(), dictionary, numTopFeatures);
                        writer.write("\tTop Terms: ");
                        writer.write(topTerms);
                        writer.write('\n');
                    }

    // list all the points in the cluster
    List points = clusterIdToPoints.get(value.getId());
                    if (points != null) {
                        writer.write("\tCluster points:\n\t");
                        for (Iterator iterator = points.iterator(); iterator.hasNext();) {
                            WeightedVectorWritable point = iterator.next();
                            writer.write(String.valueOf(point.getWeight()));
                            writer.write(": ");

                            if (point.getVector() instanceof NamedVector) {
                                writer.write(((NamedVector) point.getVector()).getName() + " ");
                            }

                        }
                        writer.write('\n');
                    }

如果可能的话，执行一些 grep 魔法，并消除有关术语和权重的所有信息。

I also had this problem. The idea is that cluster dumper dumps all your cluster data with points and so on. You have two choices:

modify ClusterDumper.printClusters() method so it will not print all the terms and weights. I have some code like:



    String clusterInfo = String.format("Cluster %d (%d) with %d points.\n", value.getId(), clusterCount, value.getNumPoints());
                    writer.write(clusterInfo);
                    writer.write('\n');
    // list all top terms
    if (dictionary != null) {
                        String topTerms = getTopFeatures(value.getCenter(), dictionary, numTopFeatures);
                        writer.write("\tTop Terms: ");
                        writer.write(topTerms);
                        writer.write('\n');
                    }

    // list all the points in the cluster
    List points = clusterIdToPoints.get(value.getId());
                    if (points != null) {
                        writer.write("\tCluster points:\n\t");
                        for (Iterator iterator = points.iterator(); iterator.hasNext();) {
                            WeightedVectorWritable point = iterator.next();
                            writer.write(String.valueOf(point.getWeight()));
                            writer.write(": ");

                            if (point.getVector() instanceof NamedVector) {
                                writer.write(((NamedVector) point.getVector()).getName() + " ");
                            }

                        }
                        writer.write('\n');
                    }