从列表中过滤掉附近的点

发布于 2024-07-11 11:36:39 字数 1425 浏览 6 评论 0原文

我半回答了有关在位图中查找质量簇的问题。我说半回答是因为我将其保留在位图中的所有点按质量排序的状态，并将其留给读者过滤列表，从同一簇中删除点。

然后，当我思考这一步时，我发现解决方案并没有像我想象的那样突然出现在我面前。所以现在我向你们寻求帮助。我们有一个具有质量的点列表，如下所示（一个 Python 元组列表，但您可以用任何语言表示它）：

[ (6, 2, 6.1580555555555554),
  (2, 1, 5.4861111111111107),
  (1, 1, 4.6736111111111107),
  (1, 4, 4.5938888888888885),
  (2, 0, 4.54),
  (1, 5, 4.4480555555555554),
  (4, 7, 4.4480555555555554),
  (5, 7, 4.4059637188208614),
  (4, 8, 4.3659637188208613),
  (1, 0, 4.3611111111111107),
  (5, 8, 4.3342191043083904),
  (5, 2, 4.119574829931973),
  ...
  (8, 8, 0.27611111111111108),
  (0, 8, 0.24138888888888888) ]

每个元组的形式如下：

(x, y, mass)

请注意，此处的列表已排序。如果您的解决方案不希望对它们进行排序，那完全没问题。

如果你还记得，挑战是找到主要的质量簇。簇的数量未知。但您知道位图的尺寸。有时，簇内的几个点的质量比下一个（大小）簇的中心更大。所以我想做的是从质量较高的点出发，删除同一簇中的点（附近的点）。

当我尝试这样做时，我最终不得不一遍又一遍地浏览列表的某些部分。我有一种感觉，我只是愚蠢而已。你会怎么做？伪代码或真实代码。当然，如果你可以用 Python 代码去掉我在答案中留下的地方，我就可以更容易地进行实验。

下一步是计算位图中真正有多少个簇。我仍在努力定义这个问题，所以我可能会带着一个关于它的问题回来。

编辑：我应该澄清一下，我知道这个问题没有“正确”的答案。问题的名称是关键。我的聚类的第一阶段已经完成。 我正在寻找一种快速、准确“足够”的方法来过滤附近的点。

如果您知道我如何使问题更清晰，请告诉我。

原文

I half-answered a question about finding clusters of mass in a bitmap. I say half-answered because I left it in a condition where I had all the points in the bitmap sorted by mass and left it to the reader to filter the list removing points from the same cluster.

Then when thinking about that step I found that the solution didn't jump out at me like I thought it would. So now I'm asking you guys for help. We have a list of points with masses like so (a Python list of tuples, but you can represent it as you see fit in any language):

[ (6, 2, 6.1580555555555554),
  (2, 1, 5.4861111111111107),
  (1, 1, 4.6736111111111107),
  (1, 4, 4.5938888888888885),
  (2, 0, 4.54),
  (1, 5, 4.4480555555555554),
  (4, 7, 4.4480555555555554),
  (5, 7, 4.4059637188208614),
  (4, 8, 4.3659637188208613),
  (1, 0, 4.3611111111111107),
  (5, 8, 4.3342191043083904),
  (5, 2, 4.119574829931973),
  ...
  (8, 8, 0.27611111111111108),
  (0, 8, 0.24138888888888888) ]

Each tuple is of the form:

(x, y, mass)

Note that the list is sorted here. If your solution prefers to not have them sorted it's perfectly OK.

The challenge, if you recall, is to find the main clusters of mass. The number of clusters is not known. But you know the dimensions of the bitmap. Sometimes several points within a cluster has more mass than the center of the next (in size) cluster. So what I want to do is go from the higher-mass points and remove points in the same cluster (points nearby).

When I tried this I ended up having to walk through parts of the list over and over again. I have a feeling I'm just stupid about it. How would you do it? Pseudo code or real code. Of course, if you can just take off where I left in that answer with Python code it's easier for me to experiment with it.

Next step is to figure out how many clusters there really are in the bitmap. I'm still struggling with defining that problem so I might return with a question about it.

EDIT: I should clarify that I know that there's no "correct" answer to this question. And the name of the question is key. Phase one of the my clustering is done. Im in search of a fast, accurate-"enough" method of filtering away nearby points.

Let me know if you see how I can make the question clearer.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浸婚纱 2024-07-18 11:36:39

想让您知道，您正在寻求解决不适定问题的方法：不存在确定的解决方案。没关系……这只会让事情变得更有趣。您的问题主要是不适定的，因为您不知道需要多少个集群。聚类是机器学习的关键领域之一，多年来已经开发了很多方法。

正如 Arachnid 指出的那样，k-means 算法往往是一个很好的算法，而且非常简单来实施。结果主要取决于最初的猜测和所需聚类的数量。为了克服初始猜测问题，通常会通过随机初始化多次运行算法并选择最佳结果。您需要定义“最佳”的含义。一种度量是每个点到其聚类中心的均方距离。如果您想自动猜测有多少个簇，则应该使用整个簇数范围运行该算法。对于任何良好的“最佳”度量，更多的聚类总是比更少的聚类更好，因此您需要一种方法来惩罚过多的聚类。维基百科上的 MDL 讨论是一个很好的起点。

K-means 聚类基本上是最简单的混合模型。有时，升级到通过期望最大化学习的高斯混合是有帮助的（在刚刚给出的链接中描述）。这比 k 均值更稳健。理解它需要花费更多的努力，但是当你理解时，实现起来并不比 k-means 难多少。

还有许多其他聚类技术，例如凝聚聚类和谱聚类。凝聚集群非常容易实现，但选择何时停止构建集群可能很棘手。如果您进行凝聚聚类，您可能需要查看 kd 树以获得更快的速度最近邻搜索。 smacl 的答案描述了一种使用 Voronoi 图进行凝聚聚类的稍微不同的方法。

有些模型可以自动为您选择集群数量，例如基于潜在狄利克雷分配，但正确理解工具要困难得多。

您可能还想查看 mean-shift 算法，看看它是否更接近您真正想要的。

Just so you know, you are asking for a solution to an ill-posed problem: no definitive solution exists. That's fine...it just makes it more fun. Your problem is ill-posed mostly because you don't know how many clusters you want. Clustering is one of the key areas of machine learning and there a quite a few approaches that have been developed over the years.

As Arachnid pointed out, the k-means algorithm tends to be a good one and it's pretty easy to implement. The results depend critically on the initial guess made and on the number of desired clusters. To overcome the initial guess problem, it's common to run the algorithm many times with random initializations and pick the best result. You'll need to define what "best" means. One measure would be the mean squared distance of each point to its cluster center. If you want to automatically guess how many clusters there are, you should run the algorithm with a whole range of numbers of clusters. For any good "best" measure, more clusters will always look better than fewer, so you'll need a way to penalize having too many clusters. The MDL discussion on wikipedia is a good starting point.

K-means clustering is basically the simplest mixture model. Sometimes it's helpful to upgrade to a mixture of Gaussians learned by expectation maximization (described in the link just given). This can be more robust than k-means. It takes a little more effort to understand it, but when you do, it's not much harder than k-means to implement.

There are plenty of other clustering techniques such as agglomerative clustering and spectral clustering. Agglomerative clustering is pretty easy to implement, but choosing when to stop building the clusters can be tricky. If you do agglomerative clustering, you'll probably want to look at kd trees for faster nearest neighbor searches. smacl's answer describes one slightly different way of doing agglomerative clustering using a Voronoi diagram.

There are models that can automatically choose the number of clusters for you such as ones based on Latent Dirichlet Allocation, but they are a lot harder to understand an implement correctly.

You might also want to look at the mean-shift algorithm to see if it's closer to what you really want.

回复收藏 0 原文