聚类问题到图论语言的翻译
我有一个矩形平面网格,每个单元格分配了一些整数权重。我正在寻找一种算法来识别具有高于平均权重的 3 到 6 个相邻细胞的簇。这些斑点应具有近似圆形的形状。
对于我的情况,不包含簇的单元格的平均权重约为 6,而包含簇的单元格的平均权重约为 6+4,即在 6 左右存在“背景权重”。权重随泊松统计量波动。
对于小背景,贪婪或种子算法表现得相当好,但如果我的簇单元的权重接近背景的波动,即它们倾向于找到一个簇,即使什么也没有,这种情况就会崩溃。另外,我无法对所有可能的设置进行循环搜索,因为我的网格很大(大约 1000x1000),而且我计划经常这样做(10^9 次)。我的印象是,在图论中可能存在解决这个问题的方法。我听说过顶点覆盖和派系,但我不确定如何最好地将我的问题翻译成他们的语言。我知道图论可能存在输入的统计性质问题,但我有兴趣看看那里的算法可以找到什么,即使它们无法识别每个簇。
这里是一个剪辑示例:框架区域每个单元格平均有 10 个条目,所有其他单元格平均有 6 个条目。当然,网格会进一步延伸。
| 8| 8| 2| 8| 2| 3|
| 6| 4| 3| 6| 4| 4|
===========
| 8| 3||13| 7| 11|| 7|
|10| 4||10| 12| 3|| 2|
| 5| 6||11| 6| 8||12|
===========
| 9| 4| 0| 2| 8| 7|
I have a rectangular planar grid, with each cell assigned some integer weight. I am looking for an algorithm to identify clusters of 3 to 6 adjacent cells with higher-than-average weight. These blobs should have approximately circular shape.
For my case the average weight of the cells not containing a cluster is around 6, and that for cells containing a cluster is around 6+4, i.e. there is a "background weight" somewhere around 6. The weights fluctuate with a Poisson statistic.
For small background greedy or seeded algorithms perform pretty well, but this breaks down if my cluster cells have weights close to fluctuations in the background i.e. they will tend to find a cluster even though there is nothing. Also, I cannot do a brute-force search looping through all possible setups because my grid is large (something like 1000x1000) and I plan to do this very often (10^9 times). I have the impression there might exist ways to tackle this in graph theory. I heard of vertex-covers and cliques, but am not sure how to best translate my problem into their language. I know that graph theory might have issues with the statistical nature of the input, but I would be interest to see what algorithms from there could find, even if they cannot identify every cluster.
Here an example clipping: the framed region has on average 10 entries per cell, all other cells have on average 6. Of course the grid extends further.
| 8| 8| 2| 8| 2| 3|
| 6| 4| 3| 6| 4| 4|
===========
| 8| 3||13| 7| 11|| 7|
|10| 4||10| 12| 3|| 2|
| 5| 6||11| 6| 8||12|
===========
| 9| 4| 0| 2| 8| 7|
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对于图论解决方案,维基百科中有几句话,但你是可能最好的帖子是在 MathOverflow 上。 这个问题可能也有用。
解决这些问题的传统计算方法(考虑到其普遍性可能是最好的)是栅格分析 - 在 GIS 和遥感领域众所周知,因此有许多工具可以提供实现。用于查找最适合您的问题的关键词是栅格、最近邻、重采样和聚类。 GDAL 库通常是其他工具的基础。
例如 http://clusterville.org/spatialtools/index.html
您可以尝试查看GDAL 库和源代码,看看您是否可以在您的情况下使用它,或者看看它是如何实现的。
要检查圆形形状,您可以将剩余值转换为多边形并检查结果特征
http://www.gdal .org/gdal_polygonize.html
For graph theory solutions there are a couple of sentences in wikipedia, but you are probably best posting on MathOverflow. This question might also be useful.
The traditional method (and probably best considering its ubiquity) in computing to solve these is raster analysis - well known in the world of GIS and Remote Sensing, so there are a number of tools that provide implementations. Keywords to use to find the one most suited to your problem would be raster, nearest neighbor, resampling, and clustering. The GDAL library is often the basis for other tools.
E.g. http://clusterville.org/spatialtools/index.html
You could try checking out the GDAL library and source code to see if you can use it in your situation, or to see how it is implemented.
For checking for circular shapes you could convert remaing values to polygons and check the resultant feature
http://www.gdal.org/gdal_polygonize.html
我不确定我是否看到了图论类比,但您可以通过预先计算面积积分来加快速度。这感觉像是一个多尺度的事情。
那么矩形(a,b),(c,d)区域的平均亮度为
如果你有很大的数据,溢出可能不是你的朋友细胞中的数字。
I'm not sure I see a graph theory analogy, but you can speed things up by pre-computing an area integral. This feels like a multi-scale thing.
Then the average brightness of a rectangular (a,b), (c,d) region is
Overflow is probably not your friend if you have big numbers in your cells.
使用并查算法进行聚类?速度非常快。
我猜想该图是通过考虑每对相邻的高值单元格连接而得出的。使用并查找算法查找所有簇,并接受所有超过特定大小的簇,可能也具有形状约束(例如,基于距簇中心的平均平方距离与簇大小)。这是并查找算法的一个微不足道的变体,用于收集您需要的统计信息(计数、x 之和、x^2 之和、y 之和、y^2 之和)。
Use the union-find algorithm for clustering? It's very fast.
I guess the graph would result from considering each pair of neighboring high-valued cells connected. Use the union-find algorithm to find all clusters, and accept all those above a certain size, perhaps with shape constraints too (eg based on average squared distance from cluster center vs cluster size). It's a trivial variation on the union-find algorithm to collect statistics that you would need for this as you go (count, sum of x, sum of x^2, sum of y, sum of y^2).
如果您只是在寻找一种将问题转化为图形问题的方法,那么您可以这样做。
从每个点观察您的所有邻居(这可能是 8 个相邻的方格或 4 个相邻的方格,具体取决于您想要的内容)。寻找具有最大值的邻居,如果它比您大,则将自己连接到该正方形。如果它更小,则不执行任何操作。
之后你将拥有一片森林或可能有一棵树(尽管我认为这不太可能)。这是矩阵到图形的一种可能的转换。我不确定这是否是最有用的翻译。
If you are just looking for a way to translate your problem into a graph problem heres what you could do.
From each point look at all your neighbors (this could be the 8 adjacent squares or the 4 adjacent depending on what you want). Look for the neighbor with the max value, if it is larger than you then connect yourself to this square. if it is smaller then do nothing.
after this you will have a forest or possibly a tree (though i imagine that that is unlikely). This is one possible translation of your matrix to a graph. Im not sure if it is the most useful translation.