如何在 Mahout K-means 聚类中维护数据条目 id

发布于 2024-12-21 22:57:21 字数 693 浏览 1 评论 0原文

我正在使用mahout来运行k-means聚类,并且在聚类时遇到识别数据条目的问题,例如

id      data
0       0.1 0.2 0.3 0.4
1       0.2 0.3 0.4 0.5
...     ...
100     0.2 0.4 0.4 0.5

聚类后我有100个数据条目,我需要从聚类结果中获取id以查看哪些数据条目point属于哪个簇,但是似乎没有方法来维护id。

在聚类合成控制数据的官方 mahout 示例中,只有数据输入到没有 id 的 mahout 中

28.7812 34.4632 31.3381 31.2834 28.9207 ...
...
24.8923 25.741  27.5532 32.8217 27.8789 ...

,聚类结果只有 cluster-id 和点值:

VL-539{n=38 c=[29.950, 30.459, ...
   Weight:  Point:
   1.0: [28.974, 29.026, 31.404, 27.894, 35.985...
   2.0: [24.214, 33.150, 31.521, 31.986, 29.064

但不存在 point-id,所以,任何人都可以知道如何添加在进行 Mahout 聚类时维护点 ID?非常感谢!

I'm using mahout to run k-means clustering, and I got a problem of identifying the data entry when clustering, for example I have a 100 data entries

id      data
0       0.1 0.2 0.3 0.4
1       0.2 0.3 0.4 0.5
...     ...
100     0.2 0.4 0.4 0.5

after clustering, I need to get the id back from the cluster result to see which point belongs to which cluster, but there seems no method to maintain the id.

In the official mahout example of clustering synthetic control data, only data were inputted to mahout without id like

28.7812 34.4632 31.3381 31.2834 28.9207 ...
...
24.8923 25.741  27.5532 32.8217 27.8789 ...

and the cluster result only have cluster-id and point value:

VL-539{n=38 c=[29.950, 30.459, ...
   Weight:  Point:
   1.0: [28.974, 29.026, 31.404, 27.894, 35.985...
   2.0: [24.214, 33.150, 31.521, 31.986, 29.064

but no point-id exists, so, can anyone have idea on how to add maintain a point-id when doing mahout clustering? thank you very much!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

合久必婚 2024-12-28 22:57:21

为了实现这一点,我使用 NamedVectors。

如您所知,在对数据进行任何聚类之前,必须对其进行矢量化。

这意味着您必须将数据转换为 Mahout 向量,因为这是
聚类算法处理的数据类型。

矢量化过程将取决于数据的性质,即矢量化文本与
向量化数值。

您的数据似乎很容易矢量化,因为它只有一个 ID 和 4 个数值。

您可以编写一个 Hadoop 作业来获取输入数据,例如,作为 CSV 文件,
并输出一个包含已矢量化数据的 SequenceFile。

然后,将 Mahout 聚类算法应用于此输入,并将每个向量的 ID(向量名称)保留在聚类结果中。

可以使用以下类来实现对数据进行矢量化的示例作业:

public class DenseVectorizationDriver extends Configured implements Tool{

    @Override
    public int run(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.printf("Usage: %s [generic options] <input> <output>\n", getClass().getSimpleName());
            ToolRunner.printGenericCommandUsage(System.err); return -1;
        }
        Job job = new Job(getConf(), "Create Dense Vectors from CSV input");
        job.setJarByClass(DenseVectorizationDriver.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(DenseVectorizationMapper.class);
        job.setReducerClass(DenseVectorizationReducer.class);

        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(VectorWritable.class);

        job.setOutputFormatClass(SequenceFileOutputFormat.class);

        return job.waitForCompletion(true) ? 0 : 1;
    }
}


public class DenseVectorizationMapper extends Mapper<LongWritable, Text, LongWritable, VectorWritable>{
/*
 * This mapper class takes the input from a CSV file whose fields are separated by TAB and emits
 * the same key it receives (useless in this case) and a NamedVector as value.
 * The "name" of the NamedVector is the ID of each row.
 */
    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        System.out.println("LINE: "+line);
        String[] lineParts = line.split("\t", -1);    
        String id = lineParts[0];

        //you should do some checks here to assure that this piece of data is correct

        Vector vector = new DenseVector(lineParts.length -1);
        for (int i = 1; i < lineParts.length -1; i++){
            String strValue = lineParts[i];
            System.out.println("VALUE: "+strValue);
            vector.set(i, Double.parseDouble(strValue));

        }

        vector =  new NamedVector(vector, id);

        context.write(key, new VectorWritable(vector));
    }
}


public class DenseVectorizationReducer extends Reducer<LongWritable, VectorWritable, LongWritable, VectorWritable>{
/*
 * This reducer simply writes the output without doing any computation.
 * Maybe it would be better to define this hadoop job without reduce phase.
 */
    @Override
    public void reduce(LongWritable key, Iterable<VectorWritable> values, Context context) throws IOException, InterruptedException{

        VectorWritable writeValue = values.iterator().next();
        context.write(key, writeValue);
    }
}

To achieve that I use NamedVectors.

As you know, before doing any clusterization with your data, you have to vectorize it.

This means that you have to transform your data into Mahout vectors, because that is the
kind of data that clusterization algoritms work with.

Vectorization process will depend on the nature of your data, i.e. vectorizing text is not the same to
vectorize numerical values.

Your data seems to be easily vectorizable, since it only have an ID and 4 numerical values.

You could write a Hadoop Job that takes your input data, for example, as a CSV file,
and outputs a SequenceFile with your data already vectorized.

Then, you apply the Mahout clustering algorithms to this input and you will keep the ID (vector name) of each vector in the clustering results.

An example job to vectorize your data could be implemented with the following classes:

public class DenseVectorizationDriver extends Configured implements Tool{

    @Override
    public int run(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.printf("Usage: %s [generic options] <input> <output>\n", getClass().getSimpleName());
            ToolRunner.printGenericCommandUsage(System.err); return -1;
        }
        Job job = new Job(getConf(), "Create Dense Vectors from CSV input");
        job.setJarByClass(DenseVectorizationDriver.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(DenseVectorizationMapper.class);
        job.setReducerClass(DenseVectorizationReducer.class);

        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(VectorWritable.class);

        job.setOutputFormatClass(SequenceFileOutputFormat.class);

        return job.waitForCompletion(true) ? 0 : 1;
    }
}


public class DenseVectorizationMapper extends Mapper<LongWritable, Text, LongWritable, VectorWritable>{
/*
 * This mapper class takes the input from a CSV file whose fields are separated by TAB and emits
 * the same key it receives (useless in this case) and a NamedVector as value.
 * The "name" of the NamedVector is the ID of each row.
 */
    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        System.out.println("LINE: "+line);
        String[] lineParts = line.split("\t", -1);    
        String id = lineParts[0];

        //you should do some checks here to assure that this piece of data is correct

        Vector vector = new DenseVector(lineParts.length -1);
        for (int i = 1; i < lineParts.length -1; i++){
            String strValue = lineParts[i];
            System.out.println("VALUE: "+strValue);
            vector.set(i, Double.parseDouble(strValue));

        }

        vector =  new NamedVector(vector, id);

        context.write(key, new VectorWritable(vector));
    }
}


public class DenseVectorizationReducer extends Reducer<LongWritable, VectorWritable, LongWritable, VectorWritable>{
/*
 * This reducer simply writes the output without doing any computation.
 * Maybe it would be better to define this hadoop job without reduce phase.
 */
    @Override
    public void reduce(LongWritable key, Iterable<VectorWritable> values, Context context) throws IOException, InterruptedException{

        VectorWritable writeValue = values.iterator().next();
        context.write(key, writeValue);
    }
}
旧城空念 2024-12-28 22:57:21

您的请求经常被那些本身不是从业者的程序员所忽视......不幸的是。我不知道如何做到 Mahout(到目前为止),但我从 Apache-commons-math 开始,它包括具有相同缺陷的 K-means。我对其进行了调整,以满足您的要求。你会在这里找到它:
http://code.google.com/p/noolabsimplecluster/
此外,不要忘记将数据标准化(线性)到区间 [0..1],否则任何聚类算法都会产生垃圾!

Your request is often overlooked by programmers who are not themselves practitioners... unfortunately. I do not know how to do it Mahout (so far), but I started with Apache-commons-math, which includes a K-means with the same defect. I adapted it such that your request is satisfied. You will find it here:
http://code.google.com/p/noolabsimplecluster/
Additionally, don't forget to normalize (linearly) the data to the interval [0..1], otherwise any clustering algo will produce garbage!

少女情怀诗 2024-12-28 22:57:21

kmeans 生成的 clusteredPoints 目录包含此映射。
请注意,您应该使用 -cl 选项来获取此数据。

The clusteredPoints directory which is produced by the kmeans contains this mapping.
Please note that you should have used the -cl option to get this data.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文