如何将 mclust 中的聚类结果写入文件?

发布于 2024-12-28 02:13:06 字数 1691 浏览 3 评论 0原文

我正在使用 R 的 mclust 库 ( http://www.stat.washington.edu/mclust ) 做一些实验性的基于 EM 的 GMM 聚类。该软件包很棒,似乎通常可以为我的数据找到非常好的集群。

问题是我根本不了解 R,虽然我已经根据 help() 内容和广泛的自述文件设法搞定了聚类过程,但我一生都无法弄清楚如何写出将实际的聚类结果写入文件。我正在使用以下极其简单的脚本来执行聚类,

myData <- read.csv("data.csv", sep=",", header=FALSE)
attach(myData)
myBIC <- mclustBIC(myData)
mySummary <- summary( myBIC, data=myData )

此时我有聚类结果和摘要。 data.csv 中的数据只是多维点的列表,每行一个。因此每条线看起来都像“x,y,z”(在 3 维的情况下)。

如果我使用 2d 点(例如,仅 x 和 y 值),我可以使用内部绘图函数来获得一个非常漂亮的图表,该图表绘制原始点并根据分配到的簇对每个点进行颜色编码。所以我知道所有信息都在“myBIC”中,但文档和帮助似乎没有提供有关如何打印这些数据的任何见解!

我想根据我认为在 myBIC 中编码的结果打印出一个新文件。类似的,

CLUST x, y, z
1 1.2, 3.4, 5.2
1 1.2, 3.3, 5.2
2 5.5, 1.3, 1.3
3 7.1, 1.2, -1.0
3 7.2, 1.2, -1.1

然后 - 希望 - 也打印出聚类过程找到的各个高斯/聚类的参数/质心。

当然,这是一件非常简单的事情,我对 R 太无知了,无法弄清楚……

编辑:我似乎已经取得了一些进展。执行以下操作会打印出一个有点神秘的矩阵,

    > mySummary$classification
[1] 1 1 2 1 3
[6] 1 1 1 3 1
[12] 1 2 1 3 1
[18] 1 3 

经过反思,我意识到它实际上是样本及其分类的列表。我想不可能直接通过 write 命令来写这个,但是在 R 控制台中进行更多实验让我意识到我可以做到这一点:

> newData <- mySummary$classification
> write( newData, file="class.csv" )

而且结果实际上看起来相当不错!

 $ head class.csv
"","x"
"1",1
"2",2
"3",2

其中第一列显然与输入数据的索引匹配,第二列描述分配的类标识。

“mySummary$parameters”对象似乎是嵌套的,并且有一堆与各个高斯及其参数相对应的子对象等。当我尝试将其写出但单独写入时,“write”函数失败列出每个子对象的名称有点繁琐。这引出了一个新问题:如何迭代 R 中的嵌套对象并将元素以串行方式打印到文件描述符?

我有这个“mySummary$parameters”对象。它由几个子对象组成,例如“mySummary$parameters$variance$sigma”等。我想迭代所有内容并将其全部打印到文件中,就像在 CLI 中自动完成的那样...

I'm using the mclust library for R ( http://www.stat.washington.edu/mclust ) to do some experimental EM-based GMM clustering. The package is great and seems to generally find very good clusters for my data.

The problem is that I don't really know R at all, and while I have managed to muddle through the clustering process based on the help() contents and the extensive readme, I cannot for the life of me figure out how to write out the actual cluster results to file. I am using the following absurdly simple script to perform the clustering,

myData <- read.csv("data.csv", sep=",", header=FALSE)
attach(myData)
myBIC <- mclustBIC(myData)
mySummary <- summary( myBIC, data=myData )

at which point I have cluster results and a summary. The data in data.csv is just a list of multi-dimensional points, one per line. So each line looks like 'x,y,z' (in the case of 3 dimensions).

If I use 2d points (e.g. just the x and y vals) I can then use the internal plot function to get a very pretty graph that plots the original points and color codes each point based on the cluster it was assigned to. So I know all the info is somewhere in 'myBIC', but the docs and help don't seem to provide any insight as to how to print out this data!

I want to print out a new file based on the results I believe are encoded in myBIC. Something like,

CLUST x, y, z
1 1.2, 3.4, 5.2
1 1.2, 3.3, 5.2
2 5.5, 1.3, 1.3
3 7.1, 1.2, -1.0
3 7.2, 1.2, -1.1

and then - hopefully - also print out the parameters/centroids of the individual gaussians/clusters that the clustering process found.

Surely this is an absurdly easy thing to do and I'm just too ignorant of R to figure it out...

EDIT: I seem to have gotten a little bit further along. Doing the following prints out a somewhat cryptic matrix,

    > mySummary$classification
[1] 1 1 2 1 3
[6] 1 1 1 3 1
[12] 1 2 1 3 1
[18] 1 3 

which upon reflection I realized is actually the list of samples and their classifications. I guess it is not possible to write this directly via the write command, but a bit more experimentation in the R console lead me to realize that I can do this:

> newData <- mySummary$classification
> write( newData, file="class.csv" )

and that the result actually looks pretty nice!

 $ head class.csv
"","x"
"1",1
"2",2
"3",2

where the first column apparenly matches the index for the input data, and the second column describes the assigned class identity.

The 'mySummary$parameters' object appears to be nested though, and has a bunch of sub-objects corresponding to the individual gaussians and their parameters, etc. The 'write' function fails when I try to just write it out, but individually writing out each sub object name is a bit tedious. Which leads me to a new question: how do I iterate over a nested object in R and print the elements out in a serial fashion to a file descriptor?

I have this 'mySummary$parameters' object. It is composed of several sub-objects like 'mySummary$parameters$variance$sigma', etc. I would like to just iterate over everything and print it all to file in the same way that this is done to the CLI automatically...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

酷炫老祖宗 2025-01-04 02:13:06

要计算实际的聚类参数本身(均值、方差、每个点属于哪个聚类),您需要使用 Mclust
要进行写入,您可以使用(例如)write.csv

默认情况下,Mclust 根据 BIC 确定的最佳模型计算参数,因此如果您想要这样做,您可以这样做:

myMclust <- Mclust(myData)

然后 myMclust$BIC 将包含所有其他模型的结果(即 myMclust$BICmclustBIC(myData) 或多或少相同)。

请参阅 Value: 部分中的 ?Mclust,了解 myMclust 还具有哪些其他信息。例如,myMclust$parameters$mean 是每个聚类的平均值,myMclust$parameters$variance 是每个聚类的方差,...

但是 myMclust$classification 将包含每个点属于哪个簇,为最佳模型计算。

因此,要获得您想要的输出,您可以执行以下操作:

# create some data for example purposes -- you have your read.csv(...) instead.
myData <- data.frame(x=runif(100),y=runif(100),z=runif(100))
# get parameters for most optimal model
myMclust <- Mclust(myData)
# if you wanted to do your summary like before:
mySummary <- summary( myMclust$BIC, data=myData )

# add a column in myData CLUST with the cluster.
myData$CLUST <- myMclust$classification
# now to write it out:
write.csv(myData[,c("CLUST","x","y","z")], # reorder columns to put CLUST first
          file="out.csv",                  # output filename
          row.names=FALSE,                 # don't save the row numbers
          quote=FALSE)                     # don't surround column names in ""

write.csv 上的注释 - 如果您不输入 row.names=FALSE,您将在 csv 中获取包含行号的额外列。此外,quote=FALSE 将列标题设置为 CLUST,x,y,z,否则它们将为 "CLUST","x","y ”,“z”。这是你的选择。

假设我们想做同样的事情,但使用来自不是最佳的不同模型的参数。但是,Mclust 默认情况下仅计算最佳模型的参数。要计算特定模型的参数(例如“EEI”),您需要执行以下操作:

myMclust <- Mclust(myData,modelNames="EEI")

然后像以前一样继续。

To calculate the actual clustering parameters themselves (mean, variance, what cluster each point belongs to), you need to use Mclust.
To do the writing you can use (for example) write.csv.

By default Mclust calculates the parameters based on the most optimal model as determined by BIC, so if that's what you want to do, you can do:

myMclust <- Mclust(myData)

Then myMclust$BIC will contain the results for all the other models (ie myMclust$BIC is more-or-less the same as mclustBIC(myData)).

See ?Mclust in the Value: section to see what other information myMclust has. For example, myMclust$parameters$mean is the mean for each cluster, myMclust$parameters$variance the variance for each cluster, ...

However myMclust$classification will contain which cluster each point belongs to, calculated for the most optimal model.

So, to get the output you want, you can do:

# create some data for example purposes -- you have your read.csv(...) instead.
myData <- data.frame(x=runif(100),y=runif(100),z=runif(100))
# get parameters for most optimal model
myMclust <- Mclust(myData)
# if you wanted to do your summary like before:
mySummary <- summary( myMclust$BIC, data=myData )

# add a column in myData CLUST with the cluster.
myData$CLUST <- myMclust$classification
# now to write it out:
write.csv(myData[,c("CLUST","x","y","z")], # reorder columns to put CLUST first
          file="out.csv",                  # output filename
          row.names=FALSE,                 # don't save the row numbers
          quote=FALSE)                     # don't surround column names in ""

A note on the write.csv - if you don't put in row.names=FALSE you'll get an extra column in your csv containing the row number. Also, quote=FALSE puts your column headings as CLUST,x,y,z whereas otherwise they'd be "CLUST","x","y","z". It's your choice.

Suppose we wanted to do the same, but use the parameters from a different model that was not optimal. However, Mclust calculates parameters only for the optimal model by default. To calculate parameters for a particular model (say "EEI"), you'd do:

myMclust <- Mclust(myData,modelNames="EEI")

and then proceed as before.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文