聚类问题

发布于 2024-09-26 15:58:09 字数 171 浏览 6 评论 0原文

我的任务是找到包含特定数据集最多点的 N 个聚类,前提是这些聚类受特定大小的限制。目前,我正在尝试通过将数据插入 kd 树、迭代数据并找到其最近邻居来实现此目的,然后在它们生成的簇不超过限制的情况下合并这些点。我不确定这种方法会给我一个全局解决方案,所以我正在寻找调整它的方法。如果您能告诉我这会属于什么类型的问题,那就太好了。

I've been tasked to find N clusters containing the most points for a certain data set given that the clusters are bounded by a certain size. Currently, I am attempting to do this by plugging in my data into a kd-tree, iterating over the data and finding its nearest neighbor, and then merging the points if the cluster they make does not exceed a limit. I'm not sure this approach will give me a global solution so I'm looking for ways to tweak it. If you can tell me what type of problem this would go under, that'd be great too.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

嘿看小鸭子会跑 2024-10-03 15:58:09

首先查看 scipy.clustering。关键词搜索可以提供有关所使用的不同算法的大量信息。集群是一个很大的领域,有大量的研究和实际应用,并且有许多简单的方法已被发现效果相当好,因此您可能不想从自己开始。

也就是说,聚类算法通常相当容易编程,如果您确实想自己编程,k 均值和凝聚聚类是最喜欢的快速实现方法。

最后,我不确定你关于由一定大小限制的 N 个簇的想法是否是自洽的,但这取决于你所说的“大小”和“簇”的确切含义(单个点是一个簇吗?) 。

更新:

根据下面OP的评论,我认为标准聚类方法不会给出这个问题的最佳解决方案,因为没有可以优化的点之间“距离”的连续度量。尽管在某些情况下它们可能会给出很好的解决方案或近似值。对于聚类方法,我会尝试 k-means,因为该方法的前提是具有固定的 N。

但是,这看起来更像是 覆盖问题,你有 N 个固定大小的矩形,并且你试图用它们覆盖所有的点),但我不'对这些不太了解,所以我将其留给其他人。

Check out scipy.clustering for a start. Key word searches can then give a lot of info on the different algorithms that are used there. Clustering is a big field, with a lot of research and practical applications, and a number of simple approaches that have been found to work fairly well, so you may not want to start by rolling your own.

This said, clustering algorithms are generally fairly easy to program, and if you do want to program your own, k-means and agglomerative clustering are some of the favorites that are quick to do.

Finally, I'm not sure that your idea of exactly N clusters that are bounded by a certain size is self-consistent, but it depends on exactly what you mean by "size" and "cluster" (are single points a cluster?).

Update:

Following the OP's comments below, I think that the standard clustering methods won't give an optimal solution to this problem because there's not a continuous metric for the "distance" between points that can be optimized. Although they may give a good solution or approximation in some cases. For a clustering approach I'd try k-means since the premise of this method is having a fixed N.

But instead of clustering, this seems more like a covering problem (i.e., you have N rectangles of fixed size and you're trying to cover all of the points with them), but I don't know much about these, so I'll leave it to someone else.

甜尕妞 2024-10-03 15:58:09

如果您的簇数是固定的,并且您只想最大化这些簇中的点数,那么我认为贪婪的解决方案会很好:

  • 找到可以包含最大点数的矩形,
  • 删除这些点,
  • 找到下一个矩形
  • ...

那么如何找到包含最多点数的最大面积A的矩形(实际上每个矩形都会有这个面积)?

矩形对于欧几里德距离来说并不常见,在尝试解决这个问题之前,您能否精确确定您是否真的需要矩形或只是对簇大小进行一些限制?圆形/椭圆形可以吗?

编辑
贪婪是行不通的(见下面的评论),它确实需要是矩形......

If your number of clusters is fixed and you only want to maximize the number of points that are in these clusters then I think a greedy solution would be good :

  • find the rectangle that can contains the maximum number of points,
  • remove these points,
  • find the next rectangle
  • ...

So how to find the rectangle of maximum area A (in fact each rectangle will have this area) that contains the maximum number of points ?

A rectangle is not really common for euclidean distance, before trying to solve this, could you precise if you really need rectangle or just some king of limit on the cluster size ? Would a circle/ellipse work ?

EDIT :
greedy will not work (see comment below) and it really need to be rectangles...

夢归不見 2024-10-03 15:58:09

链接文本实际上,我认为这是有两个关键假设确实很容易。

1) 假设“一定大小”,我们可以说“任何簇必须完全包含在半径为 r 的圆内”。

2) 您的所有点都是簇中心的候选“种子”点。

首先计算所有点之间所有小于r的距离。现在仅使用小于 r 的可行边来解决集合覆盖问题。如果任何点的最近邻距离大于 r 距离,则它会形成自己的簇。

link textActually, I think this is really pretty easy with two key assumptions.

1) Assume the by "a certain size" we can say "any cluster must be contained completely within a circle with radius, r".

2) All your points are candidate "seed" points at the center of the cluster.

First calculate all the distances less than r among all points. Now solve a set covering problem using only the feasible edges that are less than r. If any point has a nearest neighbor greater than r distance away, it forms its own cluster.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文