如何最好地进行服务器端地理集群？

发布于 2024-12-19 21:43:37 字数 1301 浏览 5 评论 0原文

我想对一组大约进行预聚类。 50万积分。

我还没有开始，但这就是我想要做的：

将所有点存储在 localSOLR 索引中，
根据一些管理信息（例如大城市）确定“自然聚类位置”
，然后计算每个城市的聚类：
- 每个城市
  - 对于每个缩放级别
    - 查询索引获取城市周围半径范围内包含的点（半径长度取决于缩放级别）

这应该非常高效，因为只有 100 个主要城市，而且 SOLR 查询非常快。但更多的思考表明这是错误的：

可能存在比靠近城市更“接近”的点簇：它们应该
在某些缩放级别上获得自己的簇，某些点将不在可接受的距离内任何城市，因此它们不会被计算
在内。一些城市彼此靠近，因此，某些点将被计算两次（添加到两个集群中）。

还有其他方法：

检查每个点并确定它属于哪个集群；这消除了上面的问题 2 和 3，但没有消除问题 1，并且
（对于每个缩放级别）制作（矩形）网格的效率也极低；这有效，但会导致疯狂/任意的集群，这些集群并不“意味着”任何东西，

我想我正在寻找通用的地理集群算法（或想法），但似乎找不到任何。

编辑以回答 Geert-Jan 的评论

我想构建“自然”集群，是的，是的，我担心如果我使用任意网格，它不会反映数据的现实。例如，如果在两个矩形交点处或附近的点周围发生许多事件，我应该只得到一个集群，但实际上会构建两个集群（每个矩形一个）。

最初，出于性能原因，我想使用 localSOLR（而且因为我了解这一点，并且将大量数据索引到 SOLR 中比将其加载到传统数据库中更有经验）；但由于我们谈论的是预聚类，也许性能并不那么重要（尽管不应该花费几天的时间来可视化新聚类实验的结果）。我的第一种根据预定义的“大点”集查询大量点的方法显然是有缺陷的，我提到的第一个原因是最强的：集群应该反映数据的现实，而不是其他一些官僚定义（它们将当然，明显重叠，但数据应该放在第一位）。

有一个很棒的实时集群集群器，已添加到核心 Google Maps API 中：标记聚类器。我想知道是否有人尝试过“离线”运行它：运行它所需的任意时间，然后存储结果？

或者是否有一个聚类器逐点检查每个点，并输出包含其坐标和点数的聚类，并且在合理的时间内完成此操作？

原文

I want to do pre-clustering for a set of approx. 500,000 points.

I haven't started yet but this is what I had thought I would do:

store all points in a localSOLR index
determine "natural cluster positions" according to some administrative information (big cities for example)
and then calculate a cluster for each city:
- for each city
  - for each zoom level
    - query the index to get the points contained in a radius around the city (the length of the radius depends on the zoom level)

This should be quite efficient because there are only 100 major cities and SOLR queries are very fast. But a little more thinking revealed it was wrong:

there may be clusters of points that are more "near" one another than near a city: they should get their own cluster
at some zoom levels, some points will not be within the acceptable distance of any city, and so they will not be counted
some cities are near one another and therefore, some points will be counted twice (added to both clusters)

There are other approaches:

examine each point and determine to which cluster it belongs; this eliminates problems 2 and 3 above, but not 1, and is also extremely inefficient
make a (rectangular) grid (for each zoom level); this works but will result in crazy / arbitrary clusters that don't "mean" anything

I guess I'm looking for a general purpose geo-clustering algorithm (or idea) and can't seem to find any.

Edit to answer comment from Geert-Jan

I'd like to build "natural" clusters, yes, and yes I'm afraid that if I use an arbitrary grid, it will not reflect the reality of the data. For example if there are many events that occur around a point that is at or near the intersection of two rectangles, I should get just one cluster but will in fact build two (one in each rectangle).

Originally I wanted to use localSOLR for performance reasons (and because I know it, and have better experience indexing a lot of data into SOLR than loading it in a conventional database); but since we're talking of pre-clustering, maybe performance is not that important (although it should not take days to visualize a result of a new clustering experiment). My first approach of querying lots of points according to a predefined set of "big points" is clearly flawed anyway, the first reason I mentioned being the strongest: clusters should reflect the reality of the data, and not some other bureaucratic definition (they will clearly overlap, sure, but data should come first).

There is a great clusterer for live clustering, that has been added to the core Google Maps API: Marker Clusterer. I wonder if anyone has tried to run it "offline": run it for any amount of time it needs, and then store the results?

Or is there a clusterer that examines each point, point after point, and outputs clusters with their coordinates and number of points included, and which does this in a reasonable amount of time?

分享到QQ

分享到微博