非基于密度的数据聚类算法
我正在开发一个聚类分析程序,该程序将一组点 S 作为输入,并用它所属的聚类索引来标记每个点。我已经实现了 DBScan 和 OPTICS 算法,它们都按预期工作。 然而,这些算法的结果可能会非常不同,具体取决于 MinPts 和 Epsilon 的初始值。我在网上搜索并阅读了大量有关数据挖掘和聚类分析的论文,但我似乎无法找到一种无需 MinPts 和 Epsilon 来确定一个点是否在这样的聚类中即可分析数据的方法。 我猜想基于密度的聚类分析不是我的情况。
有谁知道或知道我可以使用不需要这种配置的算法吗? 或者只是给我指出正确的方向。欢迎任何帮助。
谢谢!
这是我试图完成的一个学校项目,其中我有一组代表平面上点的二维坐标,我必须确定每个点属于哪个簇。现在我已经使用 OPTICS 完成了这项工作,并且工作正常,但我需要调整 Eps 值,以便我的输出与给定的示例输出相匹配。但由于我没有描述主题中的簇是什么,或者它的特征是什么,所以我不可能仅仅基于点之间的距离,或者给定区域中点的密度。另外,我事先不知道簇的数量,因此我使用 OPTICS 算法。所以在我看来,要么我做得非常错误,要么这个主题中缺少一条重要的信息。 而且,我并不是在寻找任何人做我的作业或给我任何源代码,只是一些想法或指导,因为我几乎迷失了如何获得数据集示例中给出的确切结果(我是也不允许得到任何错误的值,如果我这样做,他们认为该项目是失败的,因此不能使用具有误差范围的算法)。
再次感谢,并对这么长的帖子表示歉意。
I'm working on a cluster analysis program that takes a set of points S as an input and labels each point with that index of the cluster it belong to. I've implemented the DBScan and OPTICS algorithms and they both work as expected.
However, the results of those algorithms can be very different depending on the initial values of MinPts and Epsilon. I've searched all over the net and read lots of papers about data mining and cluster analysis and yet I can't seem to find a way of analyzing the data without needing MinPts and Epsilon to determine if a point is in such a cluster.
I'm guessing density based cluster analysis is not the way to go in my case.
Does anyone have an idea or know about an algorithm I could use that wont require that kind of configuration ?
Or simply point me in the right direction. Any help is welcome.
Thanks!
It's a school project I'm trying to finish, in which I have a set of 2D coords representing points on a plane, and I have to determine what cluster each point belongs to. Now I've done that using OPTICS and it work fine but I need to tweak the Eps value so that my output matches the example outputs I'm given. But since I have no description of what a cluster is in the subject, or what it's characteristics are, there is no way I can base myself solely on the distance between the points, or the density of points in a given region. Also, I do not know the number of clusters in advance, hence my use of the OPTICS algorithm. So in my opinion, either I'm doing it very wrong, either there is a crucial piece of info missing in the subject.
And also, I'm not looking for anyone to do my homework or give me any source code, just some ideas or guidance, since I'm pretty much lost as how get the exact results given in the data set examples (I'm also not allowed to get any wrong values, if I do they consider the project is a failure, so algorithms with error margins can't be used).
Thanks again, and sorry for the long post.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一般来说,一组点可以通过多种方式分配给簇(例如,它们可以全部分配给一个大簇,或者分为两个或三个),因此您必须有一些参数。
您为什么反对 MinPts 和 Epsilon?如果您不喜欢更改它们时发生的情况,请不要更改它们。严重地。
编辑:
多么奇怪的任务啊!你的聚类必须与他们的完全匹配,没有其他线索?我将假设他们既不是白痴也不是虐待狂,并做出以下猜测:在示例中,存在肉眼可见的“自然”聚类。我说得对吗?如果是这样,那么我们可以通过编程方式设置参数,作为点集中距离的函数。有多少个例子,可以发布一个吗?
编辑:
哈!我就知道!这里有一个规则可以正确地将这种情况划分为簇:找到从任何点到其最近邻居的最大距离,如果任何两个点的距离小于该距离的两倍,则它们属于同一个簇。我敢打赌它也适用于其他情况。
In general a set of points can be assigned to clusters in more than one way (e.g. they can all be assigned to one big cluster, or divided into two or three), so you have to have some parameters.
Why do you object to MinPts and Epsilon? If you don't like what happens when you change them, don't change them. Seriously.
EDIT:
What a bizarre assignment! Your clustering must match theirs perfectly, with no other clues? I will assume that they are neither idiots nor sadists and make the following guess: in the examples, there is a "natural" clustering that is obvious to the eye. Am I right? If so then there is a way we can set the parameters programmatically, as a function of distances in the point set. How many examples are there, and is it possible to post one?
EDIT:
Hah! I knew it! Here is a rule that will correctly divide this case into clusters: find the biggest distance from any point to its nearest neighbor, and if any two points are less than twice that distance apart, they belong to the same cluster. I'll bet it will work on the other cases too.
您可以尝试研究许多其他集群算法。您有概率聚类 (EM)、分区聚类 (KMeans)、层次聚类等等...当然,每种聚类都需要不同类型的配置
另外请务必尝试 Weka,一个开源工具,包含大量机器学习算法(分类、聚类、预处理,...)。我相信它有一个针对所有提到的算法的实现(Java)。
编辑:确定哪种聚类最好的问题非常依赖于领域。这一切都取决于如何在应用程序的上下文中使用集群,这决定了它们的有用程度(此外,您的数据可能有多个自然集群)。
You could try looking into the many other cluster algorithms out there. You have probabilistic clustering (EM), partitional clustering (KMeans), hierarchical clustering, and many others... Of course each requires a different kind of configuration
Also make sure to try Weka, an open source tool containing lots and lots of machine learning algorithms (classification, clustering, pre-processing, ...). I believe it has an implementation (Java) for all those algorithms mentioned.
Edit: The question of determining which clustering is best is very domain-dependent. And it all comes down to how the clusters are put to use in the context of your application that determines how useful they are (Besides there could be more than one natural clustering of your data).