如何评估聚类?

发布于 2025-01-01 10:39:35 字数 447 浏览 4 评论 0原文

我仍在研究评估使用聚类(无监督学习)形成的聚类?

我尝试过谷歌搜索,但得到的措施过于理论化。如果人们能够分享他们用来评估形成的集群的机制,那就太好了。假设我有一个 Java 集群,其中包含 Java EE、Java ME、RMI、JVM 等。另一个集群是 NoSQL,其中包含 Neo4j、OrientDB、CouchDB 等。这是完美的,我的集群算法给了我最多的东西。准确的聚类。

然而,经过训练和测试后,我可能会在 NoSQL 集群下得到 MySQL、Oracle,所以我只需进行手动/可视化解释,然后重新训练我的算法或对其进行调整,以便获得更好的集群。

现在,我想自动化手动可视化集群的过程,并拥有一个可以为我提供形成的集群准确性的系统。我正在寻找类似于搜索中使用的 Precision 、 Recall 、 NDCG 、 Map 等的东西。我的簇的长度各不相同,并且可能形成 n 个不同的簇,因此精度/召回率不是正确的事情。

I am still researching on evaluating clusters formed using clustering (unsupervised learning)?

I tried googling but the measures I get are too theoretical. It will be great if people can share the mechanisms they are using to evaluate the clusters formed. Say I have a Java Cluster so that will contain Java EE, Java ME, RMI, JVM etc. ,another cluster say NoSQL and that will have something like Neo4j, OrientDB, CouchDB etc. This is perfect and my clustering Algorithm has given me most accurate clusters.

However after training and then testing I may get say MySQL, Oracle under NoSQL cluster so I just do a manual/visual interpretation and then re-train my Algorithm or tweak it so that I get better Clustering.

Now I want to automate this process of visualizing clusters manually and have a system that gives me the accuracy of clusters formed. I am looking out for something similar to Precision , Recall, NDCG, Map etc used in search. My clusters are varying in length and there can be n - different cluster formed so precision/recall would not be the right thing.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

自此以后,行同陌路 2025-01-08 10:39:35

我正在开发一个集群项目,到目前为止我也有同样的问题。

现在我正在使用 JavaML 库,它内置了多种聚类算法(在我的例子中,我'm 使用 K-means),这个库还有几个函数来评估这个算法。

我用来评估集群“质量”的函数是每个集群元素的平方误差之和。为了不那么数学地解释这种评估方法,基本上,误差平方和总结了每个簇的每个元素到各自簇质心的距离(在 K 均值的情况下)。这不是您喜欢的完美和理想的评估,它可能比视觉比较更好(我有同样的问题),但至少是识别“您的集群有多好”的正式方法。它价格便宜、速度快,并且可以为您提供集群的总体视图。

您可能还想检查“集群标签”问题。这并不是一件小事,但它旨在解决同样的问题。

我认为您问题的正确答案取决于您正在使用的聚类算法并理解这里的一些数学理论,因为这不是一个简单的主题:)

祝您好运!

I'm working on a project with Clustering and I'm having the same question so far.

Right now I'm using the JavaML library which has built-in several clustering algorithms (in my case I'm using K-means) and this library also has several functions to evaluate this algorithms.

The function I'm using to evaluate the 'quality' of my clusters is the sum of the squared errors of the elements of each cluster. To explain not so mathematically this method of evaluation, basically the sum of squared errors summarize the distance of each element of every cluster to their respective cluster centroid (in case of K-means). This is not a perfect and ideal evaluation as you like that may be better than the visual comparation (I have the same problem) but at least is a formal way to identify 'how good are your clusters'. It's cheap, fast and can give you a general view of your clusters.

You may also want to check the 'Cluster labeling' problem. It's not trivial but it intends to attack that same problem.

I think the right answer for your question depends on the clustering algorithm you are using and understand some mathematical theories here because that's not an easy subject :)

Good luck with that!

瞳孔里扚悲伤 2025-01-08 10:39:35

通常聚类被用作无监督和半监督学习算法。由于您提到“但是在训练和测试之后,我可能会说 MySQL,......”我假设您正在为您的应用程序使用半监督聚类算法。

您可以增加输入特征的数量(或者可能在增加输入特征数量的同时进行多次实验),看看系统的准确性如何改变特征向量的大小。

此外,您可以评估不同的聚类算法并选择提供最佳预测精度的最佳算法。

Normally clustering is used as a unsupervised and semi-supervised learning algorithm. Since your have mentioned “However after training and then testing I may get say MySQL,…..” I assume that you are using a semi-supervised clustering algorithm for your application.

You can increase the number of input features (or probably do several experiments while increasing number of input features) see how the accuracy of your system changes w.r.t. size of the feature vector.

Moreover, You can evaluate different cluster algorithm and select the best algorithm which gives best prediction accuracy.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文