为什么我在 SQL Server 中的空间搜索比 PostGIS 慢?

发布于 2024-09-14 06:19:20 字数 2755 浏览 1 评论 0原文

我正在努力将一些空间搜索功能从带有 PostGIS 的 Postgres 迁移到 SQL Server,并且我看到一些非常糟糕的性能,即使使用索引也是如此。

我的数据大约有一百万个点,我想找出哪些点在给定的形状内,所以查询看起来像这样:

DECLARE @Shape GEOMETRY = ...
SELECT * FROM PointsTable WHERE Point.STWithin(@Shape) = 1

如果我选择一个相当小的形状,我有时可以获得亚秒级的时间,但是如果我的体型相当大(有时确实如此),我可以完成超过 5 分钟的时间。如果我在 Postgres 中运行相同的搜索,它们总是低于一秒(事实上,几乎所有搜索都低于 200 毫秒)。

我在索引上尝试了几种不同的网格大小(全高、全中、全低),每个对象使用不同的单元格(16、64、256),无论我做什么,时间都保持相当恒定。我想尝试更多的组合,但我什至不知道该往哪个方向走。每个对象有更多单元格?较少的?网格大小的一些奇怪组合?

我查看了我的查询计划,他们总是使用索引,但它根本没有帮助。我什至尝试过不使用索引,结果也没有差多少。

有人可以就此提供任何建议吗?我能找到的一切都表明“我们不能给你任何关于索引的建议,只要尝试一切,也许其中一个会起作用”,但由于创建索引需要 10 分钟,盲目地这样做是一种巨大的时间浪费。

编辑: 我还在 a Microsoft 上发布了此内容论坛。以下是他们要求的一些信息:

我可以获得的最佳工作索引是这个:

CREATE SPATIAL INDEX MapTesting_Location_Medium_Medium_Medium_Medium_16_NDX
    ON MapTesting (Location)
 USING GEOMETRY_GRID
  WITH (
    BOUNDING_BOX = ( -- The extent of our data, data is clustered in cities, but this is about as small as the index can be without missing thousands of points
        XMIN = -12135832,
        YMIN = 4433884,
        XMAX = -11296439,
        YMAX = 5443645),
    GRIDS = (
        LEVEL_1 = MEDIUM,
        LEVEL_2 = MEDIUM,
        LEVEL_3 = MEDIUM,
        LEVEL_4 = MEDIUM),
     CELLS_PER_OBJECT = 256 -- This was set to 16 but it was much slower
  )

我在使用索引时遇到了一些问题,但这是不同的。

对于这些测试,我运行了一个测试搜索(在我的原始帖子中列出的搜索),其中每个索引都带有一个WITH(INDEX(...)) 子句(测试网格大小和每个对象的单元格的各种设置),而一个没有任何提示。我还使用每个索引和相同的搜索形状运行 sp_help_spatial_geometry_index 。上面列出的索引运行速度最快,并且在 sp_help_spatial_geometry_index 中被列为最高效。

运行搜索时,我得到了这些统计数据:

(1 row(s) affected)
Table 'MapTesting'. Scan count 0, logical reads 361142, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'extended_index_592590491_384009'. Scan count 1827, logical reads 8041, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 6735 ms,  elapsed time = 13499 ms.

我还尝试使用随机点作为数据(因为我无法给出我们的真实数据),但事实证明,使用随机数据进行搜索确实很快。这让我们相信我们的问题是网格系统如何处理我们的数据。

我们的数据是整个州的地址,因此有一些非常高密度的区域,但大部分是稀疏数据。我认为问题在于网格大小的设置对两者都不起作用。将网格设置为 HIGH 时,索引会在低密度区域返回太多单元格,而将网格设置为 LOW 时,网格在高密度区域中无用(在 < code>MEDIUM,它没有那么糟糕,但仍然不擅长)。

我能够使用索引,但这没有帮助。每个测试都是在打开“显示实际执行计划”的情况下运行的,并且它始终显示索引。

I'm working on moving some spatial searching capabilities from Postgres with PostGIS to SQL Server and I'm seeing some pretty terrible performance, even with indexes.

My data is around a million points, and I want to find out which of those points are within given shapes, so the query looks something like this:

DECLARE @Shape GEOMETRY = ...
SELECT * FROM PointsTable WHERE Point.STWithin(@Shape) = 1

If I select a fairly small shape, I can sometimes get sub-second times, but if my shape is fairly large (which they sometimes are), I can get times over 5 minutes. If I run the same searches in Postgres, they're always under a second (in fact, almost all are under 200 ms).

I've tried several different grid sizes on my indexes (all high, all medium, all low), different cells per object (16, 64, 256), and no matter what I do the times stay fairly constant. I'd like to try more combinations but I don't even know what direction to go. More cells per object? Less? Some strange combination of grid sizes?

I've looked at my query plans and they're always using the index, it's just not helping at all. I've even tried without the index, and it's not much worse.

Is there any advice anyone can give on this? Everything I can find suggests "we can't give you any advice on indexes, just try everything and maybe one will work", but with it taking 10 minutes to create an index, doing this blindly is a massive waste of time.

EDIT:
I also posted this on a Microsoft forum. Here's some information they asked for on there:

The best working index I could get was this one:

CREATE SPATIAL INDEX MapTesting_Location_Medium_Medium_Medium_Medium_16_NDX
    ON MapTesting (Location)
 USING GEOMETRY_GRID
  WITH (
    BOUNDING_BOX = ( -- The extent of our data, data is clustered in cities, but this is about as small as the index can be without missing thousands of points
        XMIN = -12135832,
        YMIN = 4433884,
        XMAX = -11296439,
        YMAX = 5443645),
    GRIDS = (
        LEVEL_1 = MEDIUM,
        LEVEL_2 = MEDIUM,
        LEVEL_3 = MEDIUM,
        LEVEL_4 = MEDIUM),
     CELLS_PER_OBJECT = 256 -- This was set to 16 but it was much slower
  )

I had some issues getting the index used, but this is different.

For these tests I ran a test search (the one listed in my original post) with a WITH(INDEX(...)) clause for each of my indexes (testing various settings for grid size and cells per object), and one without any hint. I also ran sp_help_spatial_geometry_index using each index and the same search shape. The index listed above ran fastest and also was listed as most efficient in sp_help_spatial_geometry_index.

When running the search I get these statistics:

(1 row(s) affected)
Table 'MapTesting'. Scan count 0, logical reads 361142, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'extended_index_592590491_384009'. Scan count 1827, logical reads 8041, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 6735 ms,  elapsed time = 13499 ms.

I also tried using random points as data (since I can't give out our real data), but it turns out that this search is really fast with random data. This lead us to believe that our problem is how the grid system works with our data.

Our data is addresses across the entire state, so there are a few very high density regions, but mostly sparse data. I think the problem is that no setting for the grid sizes works well for both. With grids set to HIGH, the index returns too many cells in low-density areas, and with grids set to LOW, the grids are useless in high density areas (at MEDIUM, it's not as bad, but still not good at either).

I am able to get the index used, it's just not helping. Every test was run with "show actual execution plan" turned on, and it always shows the index.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

鸢与 2024-09-21 06:19:20

我刚刚花了一天时间解决类似的问题。特别是,我们正在进行多边形点类型的查询,其中存在相对较小的多边形集,但每个多边形都很大且复杂。

对于多边形表上的空间索引,解决方案如下:

  1. 使用“几何自动网格”而不是旧的 MMLL 等。这提供了 8 级索引而不是旧的 4 级,并且设置是自动的。并且...
  2. 将“每个对象的单元格”设置为 2000 或 4000。(考虑到默认值为 16,这不容易猜测!)

这产生了巨大的差异。它比默认配置中的空间索引快 10 倍,比根本没有索引快 60 倍。

I've just spent the day on a similar problem. In particular, we are doing a point-in-polygon type of query, where there was a relatively small set of polygons, but each polygon was large and complex.

Solution turned out to be as follows, for the spatial index on the polygon table:

  1. Use "geometry auto grid" instead of the old MMLL etc. This gives 8 levels of indexing instead of the old 4, and the settings are automatic. AND...
  2. Set 'cells per object' to 2000 or 4000. (Not an easy thing to guess, given that the default is 16!)

This made a huge difference. It was 10 times faster than a spatial index in the default configuration, and 60 times faster than no index at all.

潦草背影 2024-09-21 06:19:20

以下是有关 SQL-Server 空间扩展以及如何确保高效使用索引的一些说明:

,如果规划者在解析期间不知道实际的几何形状,则很难制定一个好的计划。作者建议插入 exec sp_executesql:

将:替换

-- does not use the spatial index without a hint
declare @latlonPoint geometry = geometry::Parse('POINT (45.518066 -122.767464)')
select a.id, a.shape.STAsText() 
from zipcodes a 
where a.shape.STIntersects(@latlonPoint)=1
go

为:

-- this does use the spatial index without using a hint
declare @latlonPoint geometry = geometry::Parse('POINT (45.518066 -122.767464)')
exec sp_executesql 
N'select a.id, a.shape.STAsText() 
from zipcodes a 
where a.shape.STIntersects(@latlonPoint)=1', N'@latlonPoint geometry', @latlonPoint
go

Here are some remarks about SQL-Server's spatial extensions and how to ensure that the index is efficiently used:

Apparently, the planner has difficulties to build a good plan if he does not know the actual geometry during parse time. The autor suggest to insert exec sp_executesql:

Replace:

-- does not use the spatial index without a hint
declare @latlonPoint geometry = geometry::Parse('POINT (45.518066 -122.767464)')
select a.id, a.shape.STAsText() 
from zipcodes a 
where a.shape.STIntersects(@latlonPoint)=1
go

with:

-- this does use the spatial index without using a hint
declare @latlonPoint geometry = geometry::Parse('POINT (45.518066 -122.767464)')
exec sp_executesql 
N'select a.id, a.shape.STAsText() 
from zipcodes a 
where a.shape.STIntersects(@latlonPoint)=1', N'@latlonPoint geometry', @latlonPoint
go
说不完的你爱 2024-09-21 06:19:20

我的直觉反应是“因为微软没有费心让它变得更快,因为它不是企业功能”。也许我很愤世嫉俗。

我也不知道你为什么要从 Postgres 迁移出去。

My gut reaction is "because Microsoft hasn't bothered to make it fast, because it's not an Enterprise Feature". Maybe I'm being cynical.

I'm not sure why you're migrating away from Postgres either.

翻了热茶 2024-09-21 06:19:20

我相信 STIntersects 对于使用索引进行了更好的优化,比 STWithin 具有更好的性能,特别是对于较大的形状。

I believe STIntersects is better optimized for using the index would have better performance than STWithin, especially for larger shapes.

酷炫老祖宗 2024-09-21 06:19:20

除了实现效率问题之外,SQL Server 使用 四叉树索引,而 PostGIS 使用 R-tree 索引。

在大多数情况下,R 树是更好的算法,特别是对于具有不同几何尺寸的大型数据集。

Implementation efficency issues aside SQL server uses Quadtree index whereas PostGIS uses R-tree index.

R-tree is for most cases the better algorithm, especially for large datasets with varying geometry size.

无人接听 2024-09-21 06:19:20

您是否正确设置了空间索引?你的边界框正确吗?所有点都在里面吗?在你的情况下,网格的 HHMM 可能效果最好(再次取决于鲍德盒)。

您可以尝试使用 sp_help_spatial_geometry_index 看看出了什么问题吗?
http://msdn.microsoft.com/en-us/library/cc627426。 aspx

尝试使用过滤器操作,并告诉我们您得到的性能数字是多少? (它仅执行主过滤器(使用索引),而不经过辅助过滤器(真正的空间操作))

您的设置有问题。空间确实是一个新功能,但也没有那么糟糕。

Have you set up your spatial index correctly? Is your bounding box correct? Are all points inside? In your case probably HHMM for GRIDS would work the best (depending again on a bouding box).

Can you try to use sp_help_spatial_geometry_index, to see what's wrong?
http://msdn.microsoft.com/en-us/library/cc627426.aspx

Try using Filter operation instead and tell us what perf numbers you get? (it executes only primary filter (use index) without going through secondary filter (true spatial operation))

Something is wrong with your setup. Spatial is indeed new feature but it's not that bad.

赠我空喜 2024-09-21 06:19:20

您可以尝试将其分解为两遍:

  1. 将候选者选择到带有 .Filter() 的临时表中。
  2. 使用 .STWithin() 查询候选者。

例如:(

SELECT * INTO #this FROM PointsTable WHERE Point.Filter(@Shape) = 1
SELECT * FROM #this WHERE Point.STWithin(@Shape) = 1

仅将 SELECT * 替换为您需要减少 I/O 的实际列)

这种微优化不是必需的,但我之前已经看到了不错的性能改进。此外,您还可以通过 (1) 与 (2) 的比率来衡量索引的选择性。

You might try breaking it down into two passes:

  1. select candidates into a temp table w/ .Filter().
  2. query candidates w/ .STWithin().

eg:

SELECT * INTO #this FROM PointsTable WHERE Point.Filter(@Shape) = 1
SELECT * FROM #this WHERE Point.STWithin(@Shape) = 1

(replacing SELECT * with only the actual columns you need to reduce I/O)

This kind of micro-optimization shouldn't be necessary, but I have seen decent performance improvements before. Also, you will be able to gauge how selective your index is by the ratio of (1) to (2).

你的笑 2024-09-21 06:19:20

我不熟悉空间查询,但这可能是参数化查询问题,

尝试使用固定值(使用参数化查询执行缓慢的值)编写查询(不使用参数)并运行它。将时间与参数化版本进行比较。如果它快得多,那么您的问题是参数化查询。

如果上面的速度要快得多,那么我将使用嵌入在字符串中的参数值动态构建您的 sql 字符串,这样您就可以删除导致问题的参数。

I'm not familiar with spatial queries, but it could be a parameterized query problem

try writing a query (without using parameters) with a fixed value (use a value that performs slow with the parameterized query) and run it. Compare the times with the parameterized version. If its much faster, then your problem is parameterized queries.

If the above is much faster, then I would dynamically build your sql string with the parameter values embedded in the string, that way you can remove parameters from causing problems.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文