在处理文本搜索和地理空间数据时如何影响 Postgres 查询分析器

发布于 2024-12-16 10:59:56 字数 2258 浏览 0 评论 0原文

我对以下声明有一个非常严重的性能问题,我自己无法修复。

给定情况

  • 我有一个安装了 Postgis 1.4 的 postgres 8.4 数据库,
  • 我有一个包含约 900 万个条目的地理空间表。该表有一个(postgis)几何列和一个 tsvector 列
  • 我在几何上有一个 GIST 索引,在 vname 列上有一个 VNAME 索引 表
  • 已被分析

我想执行 to_tsquery 在这些几何图形的子集中进行文本搜索,应该可以返回所有受影响的 ID。

要搜索的区域会将 900 万个数据集限制为大约 100.000 个,并且该区域内 ts_query 的结果集很可能会输出 0..1000 个条目。

问题

查询分析器决定首先对 vname 进行位图索引扫描,然后聚合并对几何图形(以及此语句中的其他条件)进行聚合并放置过滤器

查询分析器输出:

Aggregate  (cost=12.35..12.62 rows=1 width=510) (actual time=5.616..5.616 rows=1 loops=1)
->  Bitmap Heap Scan on mxgeom g  (cost=8.33..12.35 rows=1 width=510) (actual time=5.567..5.567 rows=0 loops=1)
     Recheck Cond: (vname @@ '''hemer'' & ''hauptstrasse'':*'::tsquery)
     Filter: (active AND (geom && '0107000020E6100000010000000103000000010000000B0000002AFFFF5FD15B1E404AE254774BA8494096FBFF3F4CC11E40F37563BAA9A74940490200206BEC1E40466F209648A949404DF6FF1F53311F400C9623C206B2494024EBFF1F4F711F404C87835954BD4940C00000B0E7CA1E4071551679E0BD4940AD02004038991E40D35CC68418BE49408EF9FF5F297C1E404F8CFFCB5BBB4940A600006015541E40FAE6468054B8494015040060A33E1E4032E568902DAE49402AFFFF5FD15B1E404AE254774BA84940'::geometry) AND (mandator_id = ANY ('{257,1}'::bigint[])))
     ->  Bitmap Index Scan on gis_vname_idx  (cost=0.00..8.33 rows=1 width=0) (actual time=5.566..5.566 rows=0 loops=1)
           Index Cond: (vname @@ '''hemer'' & ''hauptstrasse'':*'::tsquery)

这会导致大量的 I/O - AFAIK 首先限制几何图形,然后再进行 vname 搜索会更聪明。

尝试的解决方案

为了实现我尝试的所需行为,

  1. 我将 geom @@ AREA 放入子选择中 ->没有更改执行计划
  2. 我创建了一个具有所需区域子集的临时视图 ->没有改变执行计划
  3. 我创建了所需区域的临时表->创建需要 4~6 秒,这使得情况变得更糟。

顺便说一句,很抱歉没有发布实际的查询:我认为如果我这样做,我的老板真的会生我的气,而且我正在寻找更多的理论指导,以便有人解决我的实际查询。请询问您是否需要进一步说明


编辑

理查德的观点非常好:您可以使用width语句实现查询规划器的所需行为。坏处是这个临时表(或 CTE)会弄乱 vname 索引,从而使查询在某些情况下不返回任何内容。

我能够通过使用 to_tsvector() 即时创建新的 vname 来解决此问题,但这成本太高 - 每个查询大约需要 300 - 500 毫秒。

我的解决方案

我放弃了 vname 搜索,转而使用简单的 LIKE('%query_string%') (10-20 毫秒/查询),但这仅在我的给定的环境。 YMMV。

I have a quite serious performance issue with the following statement that i can't fix myself.

Given Situation

  • I have a postgres 8.4 Database with Postgis 1.4 installed
  • I have a geospatial table with ~9 Million entries. This table has a (postgis) geometry column and a tsvector column
  • I have a GIST Index on the geometry and a VNAME Index on the vname column
  • Table is ANALYZE'd

I want to excecute a to_tsquerytext search within a subset of these geometries that should give me all affecteded ids back.

The area to search in will limit the 9 Million datasets to approximately 100.000 and the resultset of the ts_query inside this area will most likely give an output of 0..1000 Entries.

Problem

The query analyzer decides that he wants to do an Bitmap Index Scan on the vname first, and then aggreates and puts a filter on the geometry (and other conditions I have in this statement)

Query Analyzer output:

Aggregate  (cost=12.35..12.62 rows=1 width=510) (actual time=5.616..5.616 rows=1 loops=1)
->  Bitmap Heap Scan on mxgeom g  (cost=8.33..12.35 rows=1 width=510) (actual time=5.567..5.567 rows=0 loops=1)
     Recheck Cond: (vname @@ '''hemer'' & ''hauptstrasse'':*'::tsquery)
     Filter: (active AND (geom && '0107000020E6100000010000000103000000010000000B0000002AFFFF5FD15B1E404AE254774BA8494096FBFF3F4CC11E40F37563BAA9A74940490200206BEC1E40466F209648A949404DF6FF1F53311F400C9623C206B2494024EBFF1F4F711F404C87835954BD4940C00000B0E7CA1E4071551679E0BD4940AD02004038991E40D35CC68418BE49408EF9FF5F297C1E404F8CFFCB5BBB4940A600006015541E40FAE6468054B8494015040060A33E1E4032E568902DAE49402AFFFF5FD15B1E404AE254774BA84940'::geometry) AND (mandator_id = ANY ('{257,1}'::bigint[])))
     ->  Bitmap Index Scan on gis_vname_idx  (cost=0.00..8.33 rows=1 width=0) (actual time=5.566..5.566 rows=0 loops=1)
           Index Cond: (vname @@ '''hemer'' & ''hauptstrasse'':*'::tsquery)

which causes a LOT of I/O - AFAIK It would be smarter to limit the geometry first, and do the vname search after.

Attempted Solutions

To achieve the desired behaviour i tried to

  1. I Put the geom @@ AREA into a subselect -> Did not change the execution plan
  2. I created a temporary view with the desired area subset -> Did not change the execution plan
  3. I created a temporary table of the desired area -> Takes 4~6 seconds to create, so that made it even worse.

Btw, sorry for not posting the actual query: I think my boss would really be mad at me if I did, also I'm looking more for theoretical pointers for someone to fix my actual query. Please ask if you need further clarification


EDIT

Richard had a very good point: You can achieve the desired behaviour of the Query Planner with the width statement. The bad thing is that this temporary Table (or CTE) messes up the vname index, thus making the query return nothing in some cases.

I was able to fix this with creating a new vname on-the-fly with to_tsvector(), but this is (too) costly - around 300 - 500ms per query.

My Solution

I ditched the vname search and went with a simple LIKE('%query_string%') (10-20 ms/query), but this is only fast in my given environment. YMMV.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

情深已缘浅 2024-12-23 10:59:56

tsvector 的统计处理有了一些改进(我认为 PostGIS 也是如此,但我不使用它)。如果您有时间,可能值得再次尝试 9.1 版本,看看它对您有何帮助。

然而,对于这个单一查询,您可能需要查看WITH 构造。

http://www.postgresql.org/docs/8.4/static/queries -with.html

如果将几何部分作为WITH子句,它将首先被评估(保证),然后该结果集将通过以下SELECT进行过滤。但最终可能会变慢,直到你尝试你才会知道。

对 work_mem 进行调整也可能会有所帮助 - 您可以在每个会话中执行此操作(“SET work_mem = ...”),但要小心将其设置得太高 - 并发查询会很快耗尽您的所有 RAM。

http://www.postgresql .org/docs/8.4/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY

There have been some improvements in statistics handling for tsvector (and I think PostGIS too, but I don't use it). If you've got the time, it might be worth trying again with a 9.1 release and see what that does for you.

However, for this single query you might want to look at the WITH construct.

http://www.postgresql.org/docs/8.4/static/queries-with.html

If you put the geometry part as the WITH clause, it will be evaluated first (guaranteed) and then that result-set will filtered by the following SELECT. It might end up slower though, you won't know until you try.

It might be an adjustment to work_mem would help too - you can do this per-session ("SET work_mem = ...") but be careful with setting it too high - concurrent queries can quickly burn through all your RAM.

http://www.postgresql.org/docs/8.4/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文