如何在 Mahout K-means 聚类中维护数据条目 id
我正在使用mahout来运行k-means聚类,并且在聚类时遇到识别数据条目的问题,例如
id data
0 0.1 0.2 0.3 0.4
1 0.2 0.3 0.4 0.5
... ...
100 0.2 0.4 0.4 0.5
聚类后我有100个数据条目,我需要从聚类结果中获取id以查看哪些数据条目point属于哪个簇,但是似乎没有方法来维护id。
在聚类合成控制数据的官方 mahout 示例中,只有数据输入到没有 id 的 mahout 中
28.7812 34.4632 31.3381 31.2834 28.9207 ...
...
24.8923 25.741 27.5532 32.8217 27.8789 ...
,聚类结果只有 cluster-id 和点值:
VL-539{n=38 c=[29.950, 30.459, ...
Weight: Point:
1.0: [28.974, 29.026, 31.404, 27.894, 35.985...
2.0: [24.214, 33.150, 31.521, 31.986, 29.064
但不存在 point-id,所以,任何人都可以知道如何添加在进行 Mahout 聚类时维护点 ID?非常感谢!
I'm using mahout to run k-means clustering, and I got a problem of identifying the data entry when clustering, for example I have a 100 data entries
id data
0 0.1 0.2 0.3 0.4
1 0.2 0.3 0.4 0.5
... ...
100 0.2 0.4 0.4 0.5
after clustering, I need to get the id back from the cluster result to see which point belongs to which cluster, but there seems no method to maintain the id.
In the official mahout example of clustering synthetic control data, only data were inputted to mahout without id like
28.7812 34.4632 31.3381 31.2834 28.9207 ...
...
24.8923 25.741 27.5532 32.8217 27.8789 ...
and the cluster result only have cluster-id and point value:
VL-539{n=38 c=[29.950, 30.459, ...
Weight: Point:
1.0: [28.974, 29.026, 31.404, 27.894, 35.985...
2.0: [24.214, 33.150, 31.521, 31.986, 29.064
but no point-id exists, so, can anyone have idea on how to add maintain a point-id when doing mahout clustering? thank you very much!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
为了实现这一点,我使用 NamedVectors。
如您所知,在对数据进行任何聚类之前,必须对其进行矢量化。
这意味着您必须将数据转换为 Mahout 向量,因为这是
聚类算法处理的数据类型。
矢量化过程将取决于数据的性质,即矢量化文本与
向量化数值。
您的数据似乎很容易矢量化,因为它只有一个 ID 和 4 个数值。
您可以编写一个 Hadoop 作业来获取输入数据,例如,作为 CSV 文件,
并输出一个包含已矢量化数据的 SequenceFile。
然后,将 Mahout 聚类算法应用于此输入,并将每个向量的 ID(向量名称)保留在聚类结果中。
可以使用以下类来实现对数据进行矢量化的示例作业:
To achieve that I use NamedVectors.
As you know, before doing any clusterization with your data, you have to vectorize it.
This means that you have to transform your data into Mahout vectors, because that is the
kind of data that clusterization algoritms work with.
Vectorization process will depend on the nature of your data, i.e. vectorizing text is not the same to
vectorize numerical values.
Your data seems to be easily vectorizable, since it only have an ID and 4 numerical values.
You could write a Hadoop Job that takes your input data, for example, as a CSV file,
and outputs a SequenceFile with your data already vectorized.
Then, you apply the Mahout clustering algorithms to this input and you will keep the ID (vector name) of each vector in the clustering results.
An example job to vectorize your data could be implemented with the following classes:
您的请求经常被那些本身不是从业者的程序员所忽视......不幸的是。我不知道如何做到 Mahout(到目前为止),但我从 Apache-commons-math 开始,它包括具有相同缺陷的 K-means。我对其进行了调整,以满足您的要求。你会在这里找到它:
http://code.google.com/p/noolabsimplecluster/
此外,不要忘记将数据标准化(线性)到区间 [0..1],否则任何聚类算法都会产生垃圾!
Your request is often overlooked by programmers who are not themselves practitioners... unfortunately. I do not know how to do it Mahout (so far), but I started with Apache-commons-math, which includes a K-means with the same defect. I adapted it such that your request is satisfied. You will find it here:
http://code.google.com/p/noolabsimplecluster/
Additionally, don't forget to normalize (linearly) the data to the interval [0..1], otherwise any clustering algo will produce garbage!
kmeans 生成的 clusteredPoints 目录包含此映射。
请注意,您应该使用 -cl 选项来获取此数据。
The clusteredPoints directory which is produced by the kmeans contains this mapping.
Please note that you should have used the -cl option to get this data.