从列表中过滤掉附近的点
我半回答了有关在位图中查找质量簇的问题。 我说半回答是因为我将其保留在位图中的所有点按质量排序的状态,并将其留给读者过滤列表,从同一簇中删除点。
然后,当我思考这一步时,我发现解决方案并没有像我想象的那样突然出现在我面前。 所以现在我向你们寻求帮助。 我们有一个具有质量的点列表,如下所示(一个 Python 元组列表,但您可以用任何语言表示它):
[ (6, 2, 6.1580555555555554),
(2, 1, 5.4861111111111107),
(1, 1, 4.6736111111111107),
(1, 4, 4.5938888888888885),
(2, 0, 4.54),
(1, 5, 4.4480555555555554),
(4, 7, 4.4480555555555554),
(5, 7, 4.4059637188208614),
(4, 8, 4.3659637188208613),
(1, 0, 4.3611111111111107),
(5, 8, 4.3342191043083904),
(5, 2, 4.119574829931973),
...
(8, 8, 0.27611111111111108),
(0, 8, 0.24138888888888888) ]
每个元组的形式如下:
(x, y, mass)
请注意,此处的列表已排序。 如果您的解决方案不希望对它们进行排序,那完全没问题。
如果你还记得,挑战是找到主要的质量簇。 簇的数量未知。 但您知道位图的尺寸。 有时,簇内的几个点的质量比下一个(大小)簇的中心更大。 所以我想做的是从质量较高的点出发,删除同一簇中的点(附近的点)。
当我尝试这样做时,我最终不得不一遍又一遍地浏览列表的某些部分。 我有一种感觉,我只是愚蠢而已。 你会怎么做? 伪代码或真实代码。 当然,如果你可以用 Python 代码去掉我在答案中留下的地方,我就可以更容易地进行实验。
下一步是计算位图中真正有多少个簇。 我仍在努力定义这个问题,所以我可能会带着一个关于它的问题回来。
编辑:我应该澄清一下,我知道这个问题没有“正确”的答案。 问题的名称是关键。 我的聚类的第一阶段已经完成。 我正在寻找一种快速、准确“足够”的方法来过滤附近的点。
如果您知道我如何使问题更清晰,请告诉我。
I half-answered a question about finding clusters of mass in a bitmap. I say half-answered because I left it in a condition where I had all the points in the bitmap sorted by mass and left it to the reader to filter the list removing points from the same cluster.
Then when thinking about that step I found that the solution didn't jump out at me like I thought it would. So now I'm asking you guys for help. We have a list of points with masses like so (a Python list of tuples, but you can represent it as you see fit in any language):
[ (6, 2, 6.1580555555555554),
(2, 1, 5.4861111111111107),
(1, 1, 4.6736111111111107),
(1, 4, 4.5938888888888885),
(2, 0, 4.54),
(1, 5, 4.4480555555555554),
(4, 7, 4.4480555555555554),
(5, 7, 4.4059637188208614),
(4, 8, 4.3659637188208613),
(1, 0, 4.3611111111111107),
(5, 8, 4.3342191043083904),
(5, 2, 4.119574829931973),
...
(8, 8, 0.27611111111111108),
(0, 8, 0.24138888888888888) ]
Each tuple is of the form:
(x, y, mass)
Note that the list is sorted here. If your solution prefers to not have them sorted it's perfectly OK.
The challenge, if you recall, is to find the main clusters of mass. The number of clusters is not known. But you know the dimensions of the bitmap. Sometimes several points within a cluster has more mass than the center of the next (in size) cluster. So what I want to do is go from the higher-mass points and remove points in the same cluster (points nearby).
When I tried this I ended up having to walk through parts of the list over and over again. I have a feeling I'm just stupid about it. How would you do it? Pseudo code or real code. Of course, if you can just take off where I left in that answer with Python code it's easier for me to experiment with it.
Next step is to figure out how many clusters there really are in the bitmap. I'm still struggling with defining that problem so I might return with a question about it.
EDIT: I should clarify that I know that there's no "correct" answer to this question. And the name of the question is key. Phase one of the my clustering is done. Im in search of a fast, accurate-"enough" method of filtering away nearby points.
Let me know if you see how I can make the question clearer.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
想让您知道,您正在寻求解决不适定问题的方法:不存在确定的解决方案。 没关系……这只会让事情变得更有趣。 您的问题主要是不适定的,因为您不知道需要多少个集群。 聚类是机器学习的关键领域之一,多年来已经开发了很多方法。
正如 Arachnid 指出的那样,k-means 算法往往是一个很好的算法,而且非常简单来实施。 结果主要取决于最初的猜测和所需聚类的数量。 为了克服初始猜测问题,通常会通过随机初始化多次运行算法并选择最佳结果。 您需要定义“最佳”的含义。 一种度量是每个点到其聚类中心的均方距离。 如果您想自动猜测有多少个簇,则应该使用整个簇数范围运行该算法。 对于任何良好的“最佳”度量,更多的聚类总是比更少的聚类更好,因此您需要一种方法来惩罚过多的聚类。 维基百科上的 MDL 讨论是一个很好的起点。
K-means 聚类基本上是最简单的混合模型。 有时,升级到通过期望最大化学习的高斯混合是有帮助的(在刚刚给出的链接中描述)。 这比 k 均值更稳健。 理解它需要花费更多的努力,但是当你理解时,实现起来并不比 k-means 难多少。
还有许多其他聚类技术,例如凝聚聚类和谱聚类。 凝聚集群非常容易实现,但选择何时停止构建集群可能很棘手。 如果您进行凝聚聚类,您可能需要查看 kd 树 以获得更快的速度最近邻搜索。 smacl 的答案描述了一种使用 Voronoi 图进行凝聚聚类的稍微不同的方法。
有些模型可以自动为您选择集群数量,例如基于潜在狄利克雷分配,但正确理解工具要困难得多。
您可能还想查看 mean-shift 算法,看看它是否更接近您真正想要的。
Just so you know, you are asking for a solution to an ill-posed problem: no definitive solution exists. That's fine...it just makes it more fun. Your problem is ill-posed mostly because you don't know how many clusters you want. Clustering is one of the key areas of machine learning and there a quite a few approaches that have been developed over the years.
As Arachnid pointed out, the k-means algorithm tends to be a good one and it's pretty easy to implement. The results depend critically on the initial guess made and on the number of desired clusters. To overcome the initial guess problem, it's common to run the algorithm many times with random initializations and pick the best result. You'll need to define what "best" means. One measure would be the mean squared distance of each point to its cluster center. If you want to automatically guess how many clusters there are, you should run the algorithm with a whole range of numbers of clusters. For any good "best" measure, more clusters will always look better than fewer, so you'll need a way to penalize having too many clusters. The MDL discussion on wikipedia is a good starting point.
K-means clustering is basically the simplest mixture model. Sometimes it's helpful to upgrade to a mixture of Gaussians learned by expectation maximization (described in the link just given). This can be more robust than k-means. It takes a little more effort to understand it, but when you do, it's not much harder than k-means to implement.
There are plenty of other clustering techniques such as agglomerative clustering and spectral clustering. Agglomerative clustering is pretty easy to implement, but choosing when to stop building the clusters can be tricky. If you do agglomerative clustering, you'll probably want to look at kd trees for faster nearest neighbor searches. smacl's answer describes one slightly different way of doing agglomerative clustering using a Voronoi diagram.
There are models that can automatically choose the number of clusters for you such as ones based on Latent Dirichlet Allocation, but they are a lot harder to understand an implement correctly.
You might also want to look at the mean-shift algorithm to see if it's closer to what you really want.
在我看来,您正在寻找 K-means 算法。
It sounds to me like you're looking for the K-means algorithm.
正如我在对您的问题的评论中提到的,答案取决于质量在这种情况下是否可以被视为标量。 如果是这样,基于颜色的解决方案可能不起作用,因为颜色通常不被视为标量。
例如,如果给定区域有 1 个高质量点,这与在相同区域有 10 个 1/10 质量点是一样的吗? 如果这是真的,那么质量在这种情况下不是标量,我倾向于查看用于空间分组相似的不可缩放值的算法,例如 voronoi 图。
在这种情况下,其中两个相邻的 voronoi 区域具有足够接近的质量匹配和距离,它们可以聚集在一起。 您可以重复此操作来查找所有簇。
另一方面,如果您的质量是可扩展的,或者未知位置的质量可以从周围点插值,我倾向于 对输入数据进行三角测量并绘制轮廓,并使用轮廓之间的区域来查找质量相似的簇。
As I mentioned in the comment to your question, the answer is based on whether or not mass can be considered scalar in this context. If so, color based solutions are probably not going to work as color is often not taken as being scalar.
For example, if I have a given area with 1 point of high mass, is that the same as having the same area with 10 points of 1/10 the mass? If this is true, mass is not scalar in this context, and I would tend to look at an algorithm used for spatially gouping similar non-scalable values, e.g. voronoi diagrams.
In this case, where two adjacent voronoi areas have a close enough mass match and distance, they can be clustered together. You could repeat this to find all clusters.
If on the other hand, your mass is scalable, or that the mass at an unknown position can be interpolated from surrounding points, I would tend to triangulate and contour the input data and use areas between contours to find clusters of similar mass.
这听起来像是颜色量化,即减少图像中颜色的数量。 一种方法是在空间中绘制颜色,并将聚类组合到聚类的中心(或加权平均值)。
触发此记忆的算法的确切名称让我失败,但如果它弹出,我会编辑答案,但与此同时,您应该查看颜色量化并看看某些算法是否有用。
This sounds like color quantization, where you reduce the number of colors in an image. One way would be to plot the colors in space, and combine clusters into the center (or a weighted average) of a cluster.
The exact name of the algorithm that triggered this memory fails me, but I'll edit the answer if it pops up, but in the meantime, you should look at color quantization and see if some of the algorithms are useful.
从“凸包”问题开始。 您还在寻找一些类似“凸包”的集群。
请注意,“集群”是模糊的。 您在整个领域的平均质量。 有些点的质量高于平均水平,有些则低于平均水平。 比平均水平高出多少意味着您已经找到了一个集群? 节点必须相距多远才能成为集群或单独集群的一部分?
两座山峰和一座山脊有什么区别?
您必须计算“地形” - 将所有具有相同密度的点连接到区域中。 这要求您选择一个点并从一个点径向计算出您想要的结果,找到密度相等的位置。 您可以将这些点连接到区域中。
如果您明智地选择了初始点,这些区域应该嵌套。 选择起点很容易,因为您从局部高点开始。
Start with the "Convex Hull" problem. You're also looking for some "convex hull"-like clusters.
Note that "clusters" is vague. You have an average mass across your field. Some points have above average mass, and some below average. How far above average means you've found a cluster? How far apart do nodes have to be to be part of a cluster or a separate cluster?
What's the difference between two mountain peaks and a ridge?
You have to compute a "topography" - joining all points with equal density into regions. This requires that you pick a spot and work your want out from a point radially, locating positions where the densities are equal. You can connect those points into regions.
If you picked your initial point wisely, the regions should nest. Picking your starting point is easy because you start at local highs.
既然您已经在谈论质量,为什么不采用基于重力的解决方案呢? 一个简单的粒子系统不需要非常精确,并且您不必运行它太长时间就可以更好地猜测簇的数量。
如果您对簇数有更好的了解,k 均值最近邻就变得可行。
Since you are already talking about mass, why not a gravity based solution. A simple particle system would not need to be super accurate, and you would not have to run it for too long before you could make a much better guess at the number of clusters.
If you have a better idea about cluster numbers, k-means nearest neighbour becomes feasible.