解释 mahout clusterdumper 的输出
我对爬行页面(超过 25K 文档;个人数据集)进行了聚类测试。 我已经完成了 clusterdump :
$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-1/ --output clusteranalyze.txt
运行 cluster dumper 后的输出显示 25 个元素“VL-xxxxx {}”:
VL-24130{n=1312 c=[0:0.017, 10:0.007, 11:0.005, 14:0.017, 31:0.016, 35:0.006, 41:0.010, 43:0.008, 52:0.005, 59:0.010, 68:0.037, 72:0.056, 87:0.028, ... ] r=[0:0.442, 10:0.271, 11:0.198, 14:0.369, 31:0.421, ... ]}
...
VL-24868{n=311 c=[0:0.042, 11:0.016, 17:0.046, 72:0.014, 96:0.044, 118:0.015, 135:0.016, 195:0.017, 318:0.040, 319:0.037, 320:0.036, 330:0.030, ...] ] r=[0:0.740, 11:0.287, 17:0.576, 72:0.239, 96:0.549, 118:0.273, ...]}
如何解释此输出?
简而言之:我正在寻找属于特定集群的文档 ID。
是什么意思
- : VL-x
- ? n=yc=[z:z', ...]
- r=[z'':z''', ...]
0:0.017 是否表示“0”是属于该簇的文档 ID?
我已经在 mahout wiki 页面上读过 CL、n、c 和 r 的含义。但是有人可以向我更好地解释它们或者指出一个解释得更详细的资源吗?
抱歉,如果我问一些愚蠢的问题,但我是 apache mahout 的新手,并将其用作我的集群课程作业的一部分。
I ran a clustering test on crawled pages (more than 25K docs ; personal data set).
I've done a clusterdump :
$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-1/ --output clusteranalyze.txt
The output after running cluster dumper is shown 25 elements "VL-xxxxx {}" :
VL-24130{n=1312 c=[0:0.017, 10:0.007, 11:0.005, 14:0.017, 31:0.016, 35:0.006, 41:0.010, 43:0.008, 52:0.005, 59:0.010, 68:0.037, 72:0.056, 87:0.028, ... ] r=[0:0.442, 10:0.271, 11:0.198, 14:0.369, 31:0.421, ... ]}
...
VL-24868{n=311 c=[0:0.042, 11:0.016, 17:0.046, 72:0.014, 96:0.044, 118:0.015, 135:0.016, 195:0.017, 318:0.040, 319:0.037, 320:0.036, 330:0.030, ...] ] r=[0:0.740, 11:0.287, 17:0.576, 72:0.239, 96:0.549, 118:0.273, ...]}
How to interpret this output?
In short : I am looking for document ids which belong to a particular cluster.
What is the meaning of :
- VL-x ?
- n=y c=[z:z', ...]
- r=[z'':z''', ...]
Does 0:0.017 means "0" is the document id which belongs to this cluster?
I already have read on mahout wiki-pages what CL, n, c and r means. But can someone please explain them to me better or points to a resource where it is explained a bit more in detail?
Sorry, if i am asking some stupid questions, but i am a newbie wih apache mahout and using it as part of my course assignment for clustering.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我认为您需要阅读源代码 - 从 http://mahout.apache.org 下载。 VL-24130 只是融合集群的集群标识符。
I think you need to read the source code -- download from http://mahout.apache.org.
VL-24130
is just a cluster identifier for a converged cluster.您可以使用 mahout clusterdump
https://cwiki.apache.org/MAHOUT/cluster-dumper.html
You can use mahout clusterdump
https://cwiki.apache.org/MAHOUT/cluster-dumper.html
默认情况下,kmeans 聚类使用不包含数据点名称的 WeightedVector。因此,您想使用 NamedVector 自己制作一个序列文件。 seq文件的数量和映射任务之间存在一一对应的关系。所以如果你的映射容量是12,你想在制作seqfiles时将你的数据切成12块
命名Vecotr:
基本上,您需要从 HDFS 系统下载 clusteredPoints 并编写自己的代码来输出结果。这是我编写的用于输出簇点成员资格的代码。
By default, kmeans clustering uses WeightedVector which does not include the data point name. So, you would like to make a sequence file yourself using NamedVector. There is a one to one correspondence between the number of seq files and the mapping tasks. So if your mapping capacity is 12, you want to chop your data into 12 pieces when making seqfiles
NamedVecotr:
Basically you need to download the clusteredPoints from your HDFS system and write your own code to output the results. Here is the code that I wrote to output the cluster point membership.
完成答案:
z 是不同维度的权重
更多信息请点击这里:
https://mahout.apache.org/users/clustering/cluster-dumper.html
To complete the answer:
z's being the weights of the different dimensions
More info here:
https://mahout.apache.org/users/clustering/cluster-dumper.html