解释 mahout clusterdumper 的输出

发布于 2024-11-03 15:06:56 字数 1026 浏览 6 评论 0原文

我对爬行页面(超过 25K 文档;个人数据集)进行了聚类测试。 我已经完成了 clusterdump :

$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-1/ --output clusteranalyze.txt

运行 cluster dumper 后的输出显示 25 个元素“VL-xxxxx {}”:

VL-24130{n=1312 c=[0:0.017, 10:0.007, 11:0.005, 14:0.017, 31:0.016, 35:0.006, 41:0.010, 43:0.008, 52:0.005, 59:0.010, 68:0.037, 72:0.056, 87:0.028, ... ] r=[0:0.442, 10:0.271, 11:0.198, 14:0.369, 31:0.421, ... ]}
...
VL-24868{n=311 c=[0:0.042, 11:0.016, 17:0.046, 72:0.014, 96:0.044, 118:0.015, 135:0.016, 195:0.017, 318:0.040, 319:0.037, 320:0.036, 330:0.030, ...] ] r=[0:0.740, 11:0.287, 17:0.576, 72:0.239, 96:0.549, 118:0.273, ...]}

如何解释此输出?

简而言之:我正在寻找属于特定集群的文档 ID。

是什么意思

  • : VL-x
  • ? n=yc=[z:z', ...]
  • r=[z'':z''', ...]

0:0.017 是否表示“0”是属于该簇的文档 ID?

我已经在 mahout wiki 页面上读过 CL、n、c 和 r 的含义。但是有人可以向我更好地解释它们或者指出一个解释得更详细的资源吗?

抱歉,如果我问一些愚蠢的问题,但我是 apache mahout 的新手,并将其用作我的集群课程作业的一部分。

I ran a clustering test on crawled pages (more than 25K docs ; personal data set).
I've done a clusterdump :

$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-1/ --output clusteranalyze.txt

The output after running cluster dumper is shown 25 elements "VL-xxxxx {}" :

VL-24130{n=1312 c=[0:0.017, 10:0.007, 11:0.005, 14:0.017, 31:0.016, 35:0.006, 41:0.010, 43:0.008, 52:0.005, 59:0.010, 68:0.037, 72:0.056, 87:0.028, ... ] r=[0:0.442, 10:0.271, 11:0.198, 14:0.369, 31:0.421, ... ]}
...
VL-24868{n=311 c=[0:0.042, 11:0.016, 17:0.046, 72:0.014, 96:0.044, 118:0.015, 135:0.016, 195:0.017, 318:0.040, 319:0.037, 320:0.036, 330:0.030, ...] ] r=[0:0.740, 11:0.287, 17:0.576, 72:0.239, 96:0.549, 118:0.273, ...]}

How to interpret this output?

In short : I am looking for document ids which belong to a particular cluster.

What is the meaning of :

  • VL-x ?
  • n=y c=[z:z', ...]
  • r=[z'':z''', ...]

Does 0:0.017 means "0" is the document id which belongs to this cluster?

I already have read on mahout wiki-pages what CL, n, c and r means. But can someone please explain them to me better or points to a resource where it is explained a bit more in detail?

Sorry, if i am asking some stupid questions, but i am a newbie wih apache mahout and using it as part of my course assignment for clustering.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

执手闯天涯 2024-11-10 15:06:57

我认为您需要阅读源代码 - 从 http://mahout.apache.org 下载。 VL-24130 只是融合集群的集群标识符。

I think you need to read the source code -- download from http://mahout.apache.org. VL-24130 is just a cluster identifier for a converged cluster.

冰魂雪魄 2024-11-10 15:06:56
  1. 默认情况下,kmeans 聚类使用不包含数据点名称的 WeightedVector。因此,您想使用 NamedVector 自己制作一个序列文件。 seq文件的数量和映射任务之间存在一一对应的关系。所以如果你的映射容量是12,你想在制作seqfiles时将你的数据切成12块
    命名Vecotr:

    向量 = new NamedVector(new SequentialAccessSparseVector(Cardinality),arrField[0]);
    
  2. 基本上,您需要从 HDFS 系统下载 clusteredPoints 并编写自己的代码来输出结果。这是我编写的用于输出簇点成员资格的代码。

    导入 java.io.*;
    导入java.util.ArrayList;
    导入java.util.HashMap;
    导入java.util.List;
    导入java.util.Map;
    导入java.util.Set;
    导入 java.util.TreeMap;
    
    导入org.apache.hadoop.conf.Configuration;
    导入 org.apache.hadoop.fs.FileSystem;
    导入org.apache.hadoop.fs.Path;
    导入 org.apache.hadoop.io.IntWritable;
    导入 org.apache.hadoop.io.SequenceFile;
    导入 org.apache.mahout.clustering.WeightedVectorWritable;
    导入 org.apache.mahout.common.Pair;
    导入 org.apache.mahout.common.iterator.sequencefile.PathFilters;
    导入 org.apache.mahout.common.iterator.sequencefile.PathType;
    导入 org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;
    导入 org.apache.mahout.math.NamedVector;
    
    公共类 ClusterOutput {
    
    /**
     * @参数参数
     */
    公共静态无效主(字符串[] args){
            // TODO 自动生成的方法存根
            尝试 {
                    BufferedWriter bw;
                    配置conf = new Configuration();
                    文件系统 fs = FileSystem.get(conf);
                    文件点文件夹 = new File(args[0]);
                    文件 files[] = pointFolder.listFiles();
                    bw = new BufferedWriter(new FileWriter(new File(args[1])));
                    HashMap<字符串、整数>集群 ID;
                    clusterIds = new HashMap(5000);
                    为(文件文件:文件){
                            if(file.getName().indexOf("part-m")<0)
                                    继续;
                            SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(file.getAbsolutePath()), conf);
                            IntWritable key = new IntWritable();
                            WeightedVectorWritable 值 = new WeightedVectorWritable();
                            while (reader.next(key, value)) {
                                    NamedVector向量 = (NamedVector) value.getVector();
                                    字符串向量名称=向量.getName();
                                    bw.write(vectorName + "\t" + key.toString()+"\n");
                                    if(clusterIds.containsKey(key.toString())){
                                            clusterIds.put(key.toString(), clusterIds.get(key.toString())+1);
                                    }
                                    别的
                                            clusterIds.put(key.toString(), 1);
                            }
                            bw.flush();
                            reader.close(); 
                    }
                    bw.flush();
                    bw.close();
                    bw = new BufferedWriter(new FileWriter(new File(args[2])));
                    设置<字符串>键= clusterIds.keySet();
                    for(字符串键:键){
                            bw.write(key+" "+clusterIds.get(key)+"\n");
                    }
                    bw.flush();
                    bw.close();
                    } catch (IOException e) {
                            e.printStackTrace();
                    }
            }
    }
     基本上,
  1. By default, kmeans clustering uses WeightedVector which does not include the data point name. So, you would like to make a sequence file yourself using NamedVector. There is a one to one correspondence between the number of seq files and the mapping tasks. So if your mapping capacity is 12, you want to chop your data into 12 pieces when making seqfiles
    NamedVecotr:

    vector = new NamedVector(new SequentialAccessSparseVector(Cardinality),arrField[0]);
    
  2. Basically you need to download the clusteredPoints from your HDFS system and write your own code to output the results. Here is the code that I wrote to output the cluster point membership.

    import java.io.*;
    import java.util.ArrayList;
    import java.util.HashMap;
    import java.util.List;
    import java.util.Map;
    import java.util.Set;
    import java.util.TreeMap;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.SequenceFile;
    import org.apache.mahout.clustering.WeightedVectorWritable;
    import org.apache.mahout.common.Pair;
    import org.apache.mahout.common.iterator.sequencefile.PathFilters;
    import org.apache.mahout.common.iterator.sequencefile.PathType;
    import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;
    import org.apache.mahout.math.NamedVector;
    
    public class ClusterOutput {
    
    /**
     * @param args
     */
    public static void main(String[] args) {
            // TODO Auto-generated method stub
            try {
                    BufferedWriter bw;
                    Configuration conf = new Configuration();
                    FileSystem fs = FileSystem.get(conf);
                    File pointsFolder = new File(args[0]);
                    File files[] = pointsFolder.listFiles();
                    bw = new BufferedWriter(new FileWriter(new File(args[1])));
                    HashMap<String, Integer> clusterIds;
                    clusterIds = new HashMap<String, Integer>(5000);
                    for(File file:files){
                            if(file.getName().indexOf("part-m")<0)
                                    continue;
                            SequenceFile.Reader reader = new SequenceFile.Reader(fs,  new Path(file.getAbsolutePath()), conf);
                            IntWritable key = new IntWritable();
                            WeightedVectorWritable value = new WeightedVectorWritable();
                            while (reader.next(key, value)) {
                                    NamedVector vector = (NamedVector) value.getVector();
                                    String vectorName = vector.getName();
                                    bw.write(vectorName + "\t" + key.toString()+"\n");
                                    if(clusterIds.containsKey(key.toString())){
                                            clusterIds.put(key.toString(), clusterIds.get(key.toString())+1);
                                    }
                                    else
                                            clusterIds.put(key.toString(), 1);
                            }
                            bw.flush();
                            reader.close(); 
                    }
                    bw.flush();
                    bw.close();
                    bw = new BufferedWriter(new FileWriter(new File(args[2])));
                    Set<String> keys=clusterIds.keySet();
                    for(String key:keys){
                            bw.write(key+" "+clusterIds.get(key)+"\n");
                    }
                    bw.flush();
                    bw.close();
                    } catch (IOException e) {
                            e.printStackTrace();
                    }
            }
    }
    
最佳男配角 2024-11-10 15:06:56

完成答案:

  • VL-x:是簇的标识符
  • n=y:是簇中元素的数量
  • c=[z, ...]:是簇的质心,其中
    z 是不同维度的权重
  • r=[z, ...]:是簇的半径。

更多信息请点击这里:
https://mahout.apache.org/users/clustering/cluster-dumper.html

To complete the answer:

  • VL-x: is the identifier of the cluster
  • n=y: is the number of elements in the cluster
  • c=[z, ...]: is the centroid of the cluster, with the
    z's being the weights of the different dimensions
  • r=[z, ...]: is the radius of the cluster.

More info here:
https://mahout.apache.org/users/clustering/cluster-dumper.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文