当前位置：文江博客话题详情

在处理文本搜索和地理空间数据时如何影响 Postgres 查询分析器

发布于 2024-12-16 10:59:56 字数 2258 浏览 4 评论 0原文

我对以下声明有一个非常严重的性能问题，我自己无法修复。

给定情况

我有一个安装了 Postgis 1.4 的 postgres 8.4 数据库，
我有一个包含约 900 万个条目的地理空间表。该表有一个（postgis）几何列和一个 tsvector 列
我在几何上有一个 GIST 索引，在 vname 列上有一个 VNAME 索引表
已被分析

我想执行 to_tsquery 在这些几何图形的子集中进行文本搜索，应该可以返回所有受影响的 ID。

要搜索的区域会将 900 万个数据集限制为大约 100.000 个，并且该区域内 ts_query 的结果集很可能会输出 0..1000 个条目。

问题

查询分析器决定首先对 vname 进行位图索引扫描，然后聚合并对几何图形（以及此语句中的其他条件）进行聚合并放置过滤器

查询分析器输出：

Aggregate  (cost=12.35..12.62 rows=1 width=510) (actual time=5.616..5.616 rows=1 loops=1)
->  Bitmap Heap Scan on mxgeom g  (cost=8.33..12.35 rows=1 width=510) (actual time=5.567..5.567 rows=0 loops=1)
     Recheck Cond: (vname @@ '''hemer'' & ''hauptstrasse'':*'::tsquery)
     Filter: (active AND (geom && '0107000020E6100000010000000103000000010000000B0000002AFFFF5FD15B1E404AE254774BA8494096FBFF3F4CC11E40F37563BAA9A74940490200206BEC1E40466F209648A949404DF6FF1F53311F400C9623C206B2494024EBFF1F4F711F404C87835954BD4940C00000B0E7CA1E4071551679E0BD4940AD02004038991E40D35CC68418BE49408EF9FF5F297C1E404F8CFFCB5BBB4940A600006015541E40FAE6468054B8494015040060A33E1E4032E568902DAE49402AFFFF5FD15B1E404AE254774BA84940'::geometry) AND (mandator_id = ANY ('{257,1}'::bigint[])))
     ->  Bitmap Index Scan on gis_vname_idx  (cost=0.00..8.33 rows=1 width=0) (actual time=5.566..5.566 rows=0 loops=1)
           Index Cond: (vname @@ '''hemer'' & ''hauptstrasse'':*'::tsquery)

这会导致大量的 I/O - AFAIK 首先限制几何图形，然后再进行 vname 搜索会更聪明。

尝试的解决方案

为了实现我尝试的所需行为，

我将 geom @@ AREA 放入子选择中 ->没有更改执行计划
我创建了一个具有所需区域子集的临时视图 ->没有改变执行计划
我创建了所需区域的临时表->创建需要 4~6 秒，这使得情况变得更糟。

顺便说一句，很抱歉没有发布实际的查询：我认为如果我这样做，我的老板真的会生我的气，而且我正在寻找更多的理论指导，以便有人解决我的实际查询。请询问您是否需要进一步说明

编辑

理查德的观点非常好：您可以使用width语句实现查询规划器的所需行为。坏处是这个临时表（或 CTE）会弄乱 vname 索引，从而使查询在某些情况下不返回任何内容。

我能够通过使用 to_tsvector() 即时创建新的 vname 来解决此问题，但这成本太高 - 每个查询大约需要 300 - 500 毫秒。

我的解决方案

我放弃了 vname 搜索，转而使用简单的 LIKE('%query_string%') （10-20 毫秒/查询），但这仅在我的给定的环境。 YMMV。

原文

I have a quite serious performance issue with the following statement that i can't fix myself.

Given Situation

I have a postgres 8.4 Database with Postgis 1.4 installed
I have a geospatial table with ~9 Million entries. This table has a (postgis) geometry column and a tsvector column
I have a GIST Index on the geometry and a VNAME Index on the vname column
Table is ANALYZE'd

I want to excecute a to_tsquerytext search within a subset of these geometries that should give me all affecteded ids back.

The area to search in will limit the 9 Million datasets to approximately 100.000 and the resultset of the ts_query inside this area will most likely give an output of 0..1000 Entries.

Problem

The query analyzer decides that he wants to do an Bitmap Index Scan on the vname first, and then aggreates and puts a filter on the geometry (and other conditions I have in this statement)

Query Analyzer output:

Aggregate  (cost=12.35..12.62 rows=1 width=510) (actual time=5.616..5.616 rows=1 loops=1)
->  Bitmap Heap Scan on mxgeom g  (cost=8.33..12.35 rows=1 width=510) (actual time=5.567..5.567 rows=0 loops=1)
     Recheck Cond: (vname @@ '''hemer'' & ''hauptstrasse'':*'::tsquery)
     Filter: (active AND (geom && '0107000020E6100000010000000103000000010000000B0000002AFFFF5FD15B1E404AE254774BA8494096FBFF3F4CC11E40F37563BAA9A74940490200206BEC1E40466F209648A949404DF6FF1F53311F400C9623C206B2494024EBFF1F4F711F404C87835954BD4940C00000B0E7CA1E4071551679E0BD4940AD02004038991E40D35CC68418BE49408EF9FF5F297C1E404F8CFFCB5BBB4940A600006015541E40FAE6468054B8494015040060A33E1E4032E568902DAE49402AFFFF5FD15B1E404AE254774BA84940'::geometry) AND (mandator_id = ANY ('{257,1}'::bigint[])))
     ->  Bitmap Index Scan on gis_vname_idx  (cost=0.00..8.33 rows=1 width=0) (actual time=5.566..5.566 rows=0 loops=1)
           Index Cond: (vname @@ '''hemer'' & ''hauptstrasse'':*'::tsquery)

which causes a LOT of I/O - AFAIK It would be smarter to limit the geometry first, and do the vname search after.

Attempted Solutions

To achieve the desired behaviour i tried to

I Put the geom @@ AREA into a subselect -> Did not change the execution plan
I created a temporary view with the desired area subset -> Did not change the execution plan
I created a temporary table of the desired area -> Takes 4~6 seconds to create, so that made it even worse.

Btw, sorry for not posting the actual query: I think my boss would really be mad at me if I did, also I'm looking more for theoretical pointers for someone to fix my actual query. Please ask if you need further clarification

EDIT

Richard had a very good point: You can achieve the desired behaviour of the Query Planner with the width statement. The bad thing is that this temporary Table (or CTE) messes up the vname index, thus making the query return nothing in some cases.

I was able to fix this with creating a new vname on-the-fly with to_tsvector(), but this is (too) costly - around 300 - 500ms per query.

My Solution

I ditched the vname search and went with a simple LIKE('%query_string%') (10-20 ms/query), but this is only fast in my given environment. YMMV.

分享到QQ

分享到微博