如何将 mclust 中的聚类结果写入文件?
我正在使用 R 的 mclust 库 ( http://www.stat.washington.edu/mclust ) 做一些实验性的基于 EM 的 GMM 聚类。该软件包很棒,似乎通常可以为我的数据找到非常好的集群。
问题是我根本不了解 R,虽然我已经根据 help() 内容和广泛的自述文件设法搞定了聚类过程,但我一生都无法弄清楚如何写出将实际的聚类结果写入文件。我正在使用以下极其简单的脚本来执行聚类,
myData <- read.csv("data.csv", sep=",", header=FALSE)
attach(myData)
myBIC <- mclustBIC(myData)
mySummary <- summary( myBIC, data=myData )
此时我有聚类结果和摘要。 data.csv 中的数据只是多维点的列表,每行一个。因此每条线看起来都像“x,y,z”(在 3 维的情况下)。
如果我使用 2d 点(例如,仅 x 和 y 值),我可以使用内部绘图函数来获得一个非常漂亮的图表,该图表绘制原始点并根据分配到的簇对每个点进行颜色编码。所以我知道所有信息都在“myBIC”中,但文档和帮助似乎没有提供有关如何打印这些数据的任何见解!
我想根据我认为在 myBIC 中编码的结果打印出一个新文件。类似的,
CLUST x, y, z
1 1.2, 3.4, 5.2
1 1.2, 3.3, 5.2
2 5.5, 1.3, 1.3
3 7.1, 1.2, -1.0
3 7.2, 1.2, -1.1
然后 - 希望 - 也打印出聚类过程找到的各个高斯/聚类的参数/质心。
当然,这是一件非常简单的事情,我对 R 太无知了,无法弄清楚……
编辑:我似乎已经取得了一些进展。执行以下操作会打印出一个有点神秘的矩阵,
> mySummary$classification
[1] 1 1 2 1 3
[6] 1 1 1 3 1
[12] 1 2 1 3 1
[18] 1 3
经过反思,我意识到它实际上是样本及其分类的列表。我想不可能直接通过 write 命令来写这个,但是在 R 控制台中进行更多实验让我意识到我可以做到这一点:
> newData <- mySummary$classification
> write( newData, file="class.csv" )
而且结果实际上看起来相当不错!
$ head class.csv
"","x"
"1",1
"2",2
"3",2
其中第一列显然与输入数据的索引匹配,第二列描述分配的类标识。
“mySummary$parameters”对象似乎是嵌套的,并且有一堆与各个高斯及其参数相对应的子对象等。当我尝试将其写出但单独写入时,“write”函数失败列出每个子对象的名称有点繁琐。这引出了一个新问题:如何迭代 R 中的嵌套对象并将元素以串行方式打印到文件描述符?
我有这个“mySummary$parameters”对象。它由几个子对象组成,例如“mySummary$parameters$variance$sigma”等。我想迭代所有内容并将其全部打印到文件中,就像在 CLI 中自动完成的那样...
I'm using the mclust library for R ( http://www.stat.washington.edu/mclust ) to do some experimental EM-based GMM clustering. The package is great and seems to generally find very good clusters for my data.
The problem is that I don't really know R at all, and while I have managed to muddle through the clustering process based on the help() contents and the extensive readme, I cannot for the life of me figure out how to write out the actual cluster results to file. I am using the following absurdly simple script to perform the clustering,
myData <- read.csv("data.csv", sep=",", header=FALSE)
attach(myData)
myBIC <- mclustBIC(myData)
mySummary <- summary( myBIC, data=myData )
at which point I have cluster results and a summary. The data in data.csv is just a list of multi-dimensional points, one per line. So each line looks like 'x,y,z' (in the case of 3 dimensions).
If I use 2d points (e.g. just the x and y vals) I can then use the internal plot function to get a very pretty graph that plots the original points and color codes each point based on the cluster it was assigned to. So I know all the info is somewhere in 'myBIC', but the docs and help don't seem to provide any insight as to how to print out this data!
I want to print out a new file based on the results I believe are encoded in myBIC. Something like,
CLUST x, y, z
1 1.2, 3.4, 5.2
1 1.2, 3.3, 5.2
2 5.5, 1.3, 1.3
3 7.1, 1.2, -1.0
3 7.2, 1.2, -1.1
and then - hopefully - also print out the parameters/centroids of the individual gaussians/clusters that the clustering process found.
Surely this is an absurdly easy thing to do and I'm just too ignorant of R to figure it out...
EDIT: I seem to have gotten a little bit further along. Doing the following prints out a somewhat cryptic matrix,
> mySummary$classification
[1] 1 1 2 1 3
[6] 1 1 1 3 1
[12] 1 2 1 3 1
[18] 1 3
which upon reflection I realized is actually the list of samples and their classifications. I guess it is not possible to write this directly via the write command, but a bit more experimentation in the R console lead me to realize that I can do this:
> newData <- mySummary$classification
> write( newData, file="class.csv" )
and that the result actually looks pretty nice!
$ head class.csv
"","x"
"1",1
"2",2
"3",2
where the first column apparenly matches the index for the input data, and the second column describes the assigned class identity.
The 'mySummary$parameters' object appears to be nested though, and has a bunch of sub-objects corresponding to the individual gaussians and their parameters, etc. The 'write' function fails when I try to just write it out, but individually writing out each sub object name is a bit tedious. Which leads me to a new question: how do I iterate over a nested object in R and print the elements out in a serial fashion to a file descriptor?
I have this 'mySummary$parameters' object. It is composed of several sub-objects like 'mySummary$parameters$variance$sigma', etc. I would like to just iterate over everything and print it all to file in the same way that this is done to the CLI automatically...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
要计算实际的聚类参数本身(均值、方差、每个点属于哪个聚类),您需要使用
Mclust
。要进行写入,您可以使用(例如)
write.csv
。默认情况下,
Mclust
根据 BIC 确定的最佳模型计算参数,因此如果您想要这样做,您可以这样做:然后
myMclust$BIC
将包含所有其他模型的结果(即myMclust$BIC
与mclustBIC(myData)
或多或少相同)。请参阅
Value:
部分中的?Mclust
,了解myMclust
还具有哪些其他信息。例如,myMclust$parameters$mean
是每个聚类的平均值,myMclust$parameters$variance
是每个聚类的方差,...但是
myMclust$classification
将包含每个点属于哪个簇,为最佳模型计算。因此,要获得您想要的输出,您可以执行以下操作:
write.csv
上的注释 - 如果您不输入row.names=FALSE
,您将在 csv 中获取包含行号的额外列。此外,quote=FALSE
将列标题设置为CLUST,x,y,z
,否则它们将为"CLUST","x","y ”,“z”
。这是你的选择。假设我们想做同样的事情,但使用来自不是最佳的不同模型的参数。但是,
Mclust
默认情况下仅计算最佳模型的参数。要计算特定模型的参数(例如“EEI”
),您需要执行以下操作:然后像以前一样继续。
To calculate the actual clustering parameters themselves (mean, variance, what cluster each point belongs to), you need to use
Mclust
.To do the writing you can use (for example)
write.csv
.By default
Mclust
calculates the parameters based on the most optimal model as determined by BIC, so if that's what you want to do, you can do:Then
myMclust$BIC
will contain the results for all the other models (iemyMclust$BIC
is more-or-less the same asmclustBIC(myData)
).See
?Mclust
in theValue:
section to see what other informationmyMclust
has. For example,myMclust$parameters$mean
is the mean for each cluster,myMclust$parameters$variance
the variance for each cluster, ...However
myMclust$classification
will contain which cluster each point belongs to, calculated for the most optimal model.So, to get the output you want, you can do:
A note on the
write.csv
- if you don't put inrow.names=FALSE
you'll get an extra column in your csv containing the row number. Also,quote=FALSE
puts your column headings asCLUST,x,y,z
whereas otherwise they'd be"CLUST","x","y","z"
. It's your choice.Suppose we wanted to do the same, but use the parameters from a different model that was not optimal. However,
Mclust
calculates parameters only for the optimal model by default. To calculate parameters for a particular model (say"EEI"
), you'd do:and then proceed as before.