将 2d 整数坐标聚类为最多 N 个点的集合
我在一个相对较小的二维网格上有许多点,这些点在两个维度上都环绕。坐标只能是整数。我需要将它们分成最多 N 个靠近的点的集合,其中 N 将是一个相当小的截止点,我怀疑最多 10 个。
我正在为一款游戏设计 AI,并且我 99% 确信在所有游戏片段上使用极小极大会给我提供大约 1 步的可用前瞻(如果是的话)。然而,在我们向前进行大量移动之前,相距较远的游戏棋子应该无法相互影响,因此我想将游戏划分为一次包含 N 个棋子的多个子游戏。但是,我需要确保一次选择合理的 N 件,即靠近的件。
我不在乎异常值是单独存在还是与最远的簇集中在一起。分解大于 N 的自然簇是不可避免的,并且只需要合理即可。因为这是在响应时间有限的游戏人工智能中使用的,所以我正在寻找尽可能快的算法,并愿意牺牲准确性来换取性能。
有人对算法有什么建议来考虑适应吗? K-means 和亲戚似乎不合适,因为我不知道我想要找到多少个集群,但我对我想要的集群大小有一个限制。我已经看到一些证据表明通过将点捕捉到网格来近似解决方案可以帮助某些聚类算法,因此我希望整数坐标使问题变得更容易。基于分层距离的聚类将很容易适应环绕坐标,因为我只需插入不同的距离函数,并且也相对容易限制聚类的大小。还有其他我应该考虑的想法吗?
我对算法比对库更感兴趣,尽管有关于其工作原理的良好文档的库将受到欢迎。
编辑:我最初在为 2011 年秋季撰写条目时提出了这个问题AI 挑战,遗憾的是我从未完成。我链接到的页面对游戏进行了相当简短的高级描述。
两个关键点是:
- 每个玩家都有潜在大量的蚂蚁
- 每只蚂蚁每回合都会收到命令,向北、向南、向东、向西移动 1 格;这意味着游戏的分支因子是 O(4ants)。
在比赛中,每个机器人的回合也有严格的时间限制。我曾想过通过使用极小极大来接近游戏(回合确实是同时的,但作为一种启发式,我认为这是可以的),但我担心如果我考虑整个游戏,就没有时间向前看很多动作立刻。但是,由于每只蚂蚁每回合仅移动一格,两只蚂蚁不能通过最短路线相距 N 个距离,可能会互相干扰,直到我们向前看 N/2 个移动。
因此,我正在寻找的解决方案是一次选择较小的蚂蚁群并分别最小化每组蚂蚁的好方法。我希望这能让我更深入地搜索移动树,而不会损失太多准确性。但显然,使用非常昂贵的聚类算法作为节省时间的启发式算法是没有意义的!
我仍然对这个问题的答案感兴趣,尽管我更感兴趣的是我可以从技术中学到什么,而不是这次特定的比赛,因为它已经结束了!感谢到目前为止所有的答案。
I have a number of points on a relatively small 2-dimensional grid, which wraps around in both dimensions. The coordinates can only be integers. I need to divide them into sets of at most N points that are close together, where N will be quite a small cut-off, I suspect 10 at most.
I'm designing an AI for a game, and I'm 99% certain using minimax on all the game pieces will give me a usable lookahead of about 1 move, if that. However distant game pieces should be unable to affect each other until we're looking ahead by a large number of moves, so I want to partition the game into a number of sub-games of N pieces at a time. However, I need to ensure I select a reasonable N pieces at a time, i.e. ones that are close together.
I don't care whether outliers are left on their own or lumped in with their least-distant cluster. Breaking up natural clusters larger than N is inevitable, and only needs to be sort-of reasonable. Because this is used in a game AI with limited response time, I'm looking for as fast an algorithm as possible, and willing to trade off accuracy for performance.
Does anyone have any suggestions for algorithms to look at adapting? K-means and relatives don't seem appropriate, as I don't know how many clusters I want to find but I have a bound on how large clusters I want. I've seen some evidence that approximating a solution by snapping points to a grid can help some clustering algorithms, so I'm hoping the integer coordinates makes the problem easier. Hierarchical distance-based clustering will be easy to adapt to the wrap-around coordinates, as I just plug in a different distance function, and also relatively easy to cap the size of the clusters. Are there any other ideas I should be looking at?
I'm more interested in algorithms than libraries, though libraries with good documentation of how they work would be welcome.
EDIT: I originally asked this question when I was working on an entry for the Fall 2011 AI Challenge, which I sadly never got finished. The page I linked to has a reasonably short reasonably high-level description of the game.
The two key points are:
- Each player has a potentially large number of ants
- Every ant is given orders every turn, moving 1 square either north, south, east or west; this means the branching factor of the game is O(4ants).
In the contest there were also strict time constraints on each bot's turn. I had thought to approach the game by using minimax (the turns are really simultaneous, but as a heuristic I thought it would be okay), but I feared there wouldn't be time to look ahead very many moves if I considered the whole game at once. But as each ant moves only one square each turn, two ants cannot N spaces apart by the shortest route possibly interfere with one another until we're looking ahead N/2 moves.
So the solution I was searching for was a good way to pick smaller groups of ants at a time and minimax each group separately. I had hoped this would allow me to search deeper into the move-tree without losing much accuracy. But obviously there's no point using a very expensive clustering algorithm as a time-saving heuristic!
I'm still interested in the answer to this question, though more in what I can learn from the techniques than for this particular contest, since it's over! Thanks for all the answers so far.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
中值切割算法在 2D 中实现起来非常简单,并且在这里效果很好。您的异常值最终会以 1 为一组,您可以将其丢弃或其他。
要求进一步解释:
中值切割是一种量化算法,但所有量化算法都是特殊情况的聚类算法。在这种情况下,算法非常简单:找到包含所有点的最小边界框,沿其最长边分割框(并将其缩小以适合点),重复直到达到目标框数量。
更详细的描述和编码示例
有关颜色量化的 Wiki 有一些不错的视觉效果和链接
The median-cut algorithm is very simple to implement in 2D and would work well here. Your outliers would end up as groups of 1 which you could discard or whatever.
Further explanation requested:
Median cut is a quantization algorithm but all quantization algorithms are special case clustering algorithms. In this case the algorithm is extremely simple: find the smallest bounding box containing all points, split the box along its longest side (and shrink it to fit the points), repeat until the target amount of boxes is achieved.
A more detailed description and coded example
Wiki on color quantization has some good visuals and links
由于您正在编写一个游戏(我假设)每个聚类之间只有恒定数量的棋子移动,因此您可以利用在线算法来获得恒定的更新时间。
我相信,不将自己锁定在多个集群上的属性称为非平稳。
本文似乎有一个很好的算法,具有上述两个属性:提高“在线聚合的鲁棒性”基于核诱导距离测量的聚类方法(您也许也可以在其他地方找到它)。
这是一个很好的视频,展示了正在运行的算法:
Since you are writing a game where (I assume) only a constant number of pieces move between each clusering, you can take advantage of a Online algorithm to get consant update times.
The property of not locking yourself to a number of clusters is called Nonstationary, I believe.
This paper seams to have a good algorithm with both of the above two properties: Improving the Robustness of 'Online Agglomerative Clustering Method' Based on Kernel-Induce Distance Measures (You might be able to find it elsewhere as well).
Here is a nice video showing the algorithm in works:
在网格上构造一个图 G=(V, E) 并对其进行分区。
由于您对算法而不是库感兴趣,因此这是最近的一篇论文:
来自文本:
因此您将设置U=10。
Construct a graph G=(V, E) over your grid, and partition it.
Since you are interested in algorithms rather than libraries, here is a recent paper:
From the text:
So you will set U=10.
您可以计算最小生成树并删除最长的边。然后您可以计算 k 均值。删除另一条长边并计算 k 均值。冲洗并重复,直到 N=10。我相信这个算法被命名为单链接 k 均值,并且聚类类似于 voronoi 图:
“单链接 k 聚类算法......正是 Kruskal 算法......相当于找到一个 MST 并删除 k- 1 个最昂贵的边。”
例如,请参见此处: https://stats.stackexchange.com/questions/1475/visualization-software -用于聚类
You can calculate a minimum spanning tree and remove the longest edges. Then you can calculate the k-means. Remove another long edge and calculate the k-means. Rinse and repeat until you have N=10. I believe this algorithm is named single-link k-means and the cluster are similar to voronoi diagrams:
"The single-link k-clustering algorithm ... is precisely Kruskal's algorithm ... equivalent to finding an MST and deleting the k-1 most expensive edges."
See for example here: https://stats.stackexchange.com/questions/1475/visualization-software-for-clustering
考虑您只需要两个集群的情况。如果运行 k-means,那么您将得到两个点,并且两个簇之间的分界线是与两个簇中心之间的线正交的平面。您可以通过将点向下投影到线上,然后将其在线上的位置与阈值进行比较(例如,取线与来自两个聚类中心和点中任一个的向量之间的点积来找出点所在的聚类) )。
对于两个集群,这意味着您可以通过移动阈值来调整集群的大小。您可以沿着连接两个聚类中心的线对点的距离进行排序,然后很容易地沿着线移动阈值,以权衡分割的不平等与聚类的整齐程度。
您可能没有 k=2,但您可以通过划分为两个集群,然后细分集群来分层运行。
(评论后)
我不擅长图片,但这里有一些相关的代数。
使用 k 均值,我们根据点与聚类中心的距离来划分点,因此对于一个点 Xi 和两个中心 Ai 和 Bi,我们可能对
SUM_i (Xi - Ai)^2 - SUM_i(Xi - Bi)^2
感兴趣,这是SUM_i Ai^2 - SUM_i Bi^2 + 2 SUM_i (Bi - Ai)Xi
因此,根据 K + 2(B - A).X 的符号,将点分配给任一簇- 一个常数加上到该点的向量与连接两个簇圆的向量之间的点积。在二维中,平面上最终属于一个簇的点与平面上最终属于另一个簇的点之间的分界线是垂直于两个簇中心之间的线的线。我的建议是,为了控制除法后的点数,您可以为每个点 X 计算 (B - A).X,然后选择一个阈值,将一个簇中的所有点与另一个簇中的所有点分开簇。这相当于将分界线沿着两个簇中心之间的线向上或向下滑动,同时保持其垂直于它们之间的线。
一旦有了点积 Yi,其中 Yi = SUM_j (Bj - Aj) Xij,衡量簇分组紧密程度的指标就是 SUM_i (Yi - Ym)^2,其中 Ym 是簇中 Yi 的平均值。我建议您使用两个集群的这些值的总和来判断您的分割效果如何。要将点移入或移出簇并获得新的平方和,而无需从头开始重新计算所有内容,请注意 SUM_i (Si + T)^2 = SUM_i Si^2 + 2T SUM_i Si + T^2,因此如果您跟踪总和和平方和,您可以计算出当您向每个分量添加或减去一个值时,平方和会发生什么,因为当您添加或删除一个点时,簇的平均值会发生变化。
Consider the case where you only want two clusters. If you run k-means, then you will get two points, and the division between the two clusters is a plane orthogonal to the line between the centres of the two clusters. You can find out which cluster a point is in by projecting it down to the line and then comparing its position on the line with a threshold (e.g. take the dot product between the line and a vector from either of the two cluster centres and the point).
For two clusters, this means that you can adjust the sizes of the clusters by moving the threshold. You can sort the points on their distance along the line connecting the two cluster centres and then move the threshold along the line quite easily, trading off the inequality of the split with how neat the clusters are.
You probably don't have k=2, but you can run this hierarchically, by dividing into two clusters, and then sub-dividing the clusters.
(After comment)
I'm not good with pictures, but here is some relevant algebra.
With k-means we divide points according to their distance from cluster centres, so for a point Xi and two centres Ai and Bi we might be interested in
SUM_i (Xi - Ai)^2 - SUM_i(Xi - Bi)^2
This is SUM_i Ai^2 - SUM_i Bi^2 + 2 SUM_i (Bi - Ai)Xi
So a point gets assigned to either cluster depending on the sign of K + 2(B - A).X - a constant plus the dot product between the vector to the point and the vector joining the two cluster circles. In two dimensions, the dividing line between the points on the plane that end up in one cluster and the points on the plane that end up in the other cluster is a line perpendicular to the line between the two cluster centres. What I am suggesting is that, in order to control the number of points after your division, you compute (B - A).X for each point X and then choose a threshold that divides all points in one cluster from all points in the other cluster. This amounts to sliding the dividing line up or down the line between the two cluster centres, while keeping it perpendicular to the line between them.
Once you have dot products Yi, where Yi = SUM_j (Bj - Aj) Xij, a measure of how closely grouped a cluster is is SUM_i (Yi - Ym)^2, where Ym is the mean of the Yi in the cluster. I am suggesting that you use the sum of these values for the two clusters to tell how good a split you have. To move a point into or out of a cluster and get the new sum of squares without recomputing everything from scratch, note that SUM_i (Si + T)^2 = SUM_i Si^2 + 2T SUM_i Si + T^2, so if you keep track of sums and sums of squares you can work out what happens to a sum of squares when you add or subtract a value to every component, as the mean of the cluster changes when you add or remove a point to it.