如何最好地进行服务器端地理集群?
我想对一组大约进行预聚类。 50万积分。
我还没有开始,但这就是我想要做的:
- 将所有点存储在 localSOLR 索引中,
- 根据一些管理信息(例如大城市)确定“自然聚类位置”
- ,然后计算每个城市的聚类:
- 每个城市
- 对于每个缩放级别
- 查询索引获取城市周围半径范围内包含的点(半径长度取决于缩放级别)
- 对于每个缩放级别
- 每个城市
这应该非常高效,因为只有 100 个主要城市,而且 SOLR 查询非常快。但更多的思考表明这是错误的:
- 可能存在比靠近城市更“接近”的点簇:它们应该
- 在某些缩放级别上获得自己的簇,某些点将不在可接受的距离内任何城市,因此它们不会被计算
- 在内。 一些城市彼此靠近,因此,某些点将被计算两次(添加到两个集群中)。
还有其他方法:
- 检查每个点并确定它属于哪个集群;这消除了上面的问题 2 和 3,但没有消除问题 1,并且
- (对于每个缩放级别)制作(矩形)网格的效率也极低;这有效,但会导致疯狂/任意的集群,这些集群并不“意味着”任何东西,
我想我正在寻找通用的地理集群算法(或想法),但似乎找不到任何。
编辑以回答 Geert-Jan 的评论
我想构建“自然”集群,是的,是的,我担心如果我使用任意网格,它不会反映数据的现实。例如,如果在两个矩形交点处或附近的点周围发生许多事件,我应该只得到一个集群,但实际上会构建两个集群(每个矩形一个)。
最初,出于性能原因,我想使用 localSOLR(而且因为我了解这一点,并且将大量数据索引到 SOLR 中比将其加载到传统数据库中更有经验);但由于我们谈论的是预聚类,也许性能并不那么重要(尽管不应该花费几天的时间来可视化新聚类实验的结果)。我的第一种根据预定义的“大点”集查询大量点的方法显然是有缺陷的,我提到的第一个原因是最强的:集群应该反映数据的现实,而不是其他一些官僚定义(它们将当然,明显重叠,但数据应该放在第一位)。
有一个很棒的实时集群集群器,已添加到核心 Google Maps API 中:标记聚类器。我想知道是否有人尝试过“离线”运行它:运行它所需的任意时间,然后存储结果?
或者是否有一个聚类器逐点检查每个点,并输出包含其坐标和点数的聚类,并且在合理的时间内完成此操作?
I want to do pre-clustering for a set of approx. 500,000 points.
I haven't started yet but this is what I had thought I would do:
- store all points in a localSOLR index
- determine "natural cluster positions" according to some administrative information (big cities for example)
- and then calculate a cluster for each city:
- for each city
- for each zoom level
- query the index to get the points contained in a radius around the city (the length of the radius depends on the zoom level)
- for each zoom level
- for each city
This should be quite efficient because there are only 100 major cities and SOLR queries are very fast. But a little more thinking revealed it was wrong:
- there may be clusters of points that are more "near" one another than near a city: they should get their own cluster
- at some zoom levels, some points will not be within the acceptable distance of any city, and so they will not be counted
- some cities are near one another and therefore, some points will be counted twice (added to both clusters)
There are other approaches:
- examine each point and determine to which cluster it belongs; this eliminates problems 2 and 3 above, but not 1, and is also extremely inefficient
- make a (rectangular) grid (for each zoom level); this works but will result in crazy / arbitrary clusters that don't "mean" anything
I guess I'm looking for a general purpose geo-clustering algorithm (or idea) and can't seem to find any.
Edit to answer comment from Geert-Jan
I'd like to build "natural" clusters, yes, and yes I'm afraid that if I use an arbitrary grid, it will not reflect the reality of the data. For example if there are many events that occur around a point that is at or near the intersection of two rectangles, I should get just one cluster but will in fact build two (one in each rectangle).
Originally I wanted to use localSOLR for performance reasons (and because I know it, and have better experience indexing a lot of data into SOLR than loading it in a conventional database); but since we're talking of pre-clustering, maybe performance is not that important (although it should not take days to visualize a result of a new clustering experiment). My first approach of querying lots of points according to a predefined set of "big points" is clearly flawed anyway, the first reason I mentioned being the strongest: clusters should reflect the reality of the data, and not some other bureaucratic definition (they will clearly overlap, sure, but data should come first).
There is a great clusterer for live clustering, that has been added to the core Google Maps API: Marker Clusterer. I wonder if anyone has tried to run it "offline": run it for any amount of time it needs, and then store the results?
Or is there a clusterer that examines each point, point after point, and outputs clusters with their coordinates and number of points included, and which does this in a reasonable amount of time?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可能想研究高级聚类算法,例如 OPTICS。
有了良好的数据库索引,它应该相当快。
You may want to look into advanced clustering algorithms such as OPTICS.
With a good database index, it should be fairly fast.