在处理文本搜索和地理空间数据时如何影响 Postgres 查询分析器
我对以下声明有一个非常严重的性能问题,我自己无法修复。
给定情况
- 我有一个安装了 Postgis 1.4 的 postgres 8.4 数据库,
- 我有一个包含约 900 万个条目的地理空间表。该表有一个(postgis)几何列和一个 tsvector 列
- 我在几何上有一个 GIST 索引,在 vname 列上有一个 VNAME 索引 表
- 已被分析
我想执行 to_tsquery 在这些几何图形的子集中进行文本搜索,应该可以返回所有受影响的 ID。
要搜索的区域会将 900 万个数据集限制为大约 100.000 个,并且该区域内 ts_query
的结果集很可能会输出 0..1000 个条目。
问题
查询分析器决定首先对 vname 进行位图索引扫描,然后聚合并对几何图形(以及此语句中的其他条件)进行聚合并放置过滤器
查询分析器输出:
Aggregate (cost=12.35..12.62 rows=1 width=510) (actual time=5.616..5.616 rows=1 loops=1)
-> Bitmap Heap Scan on mxgeom g (cost=8.33..12.35 rows=1 width=510) (actual time=5.567..5.567 rows=0 loops=1)
Recheck Cond: (vname @@ '''hemer'' & ''hauptstrasse'':*'::tsquery)
Filter: (active AND (geom && '0107000020E6100000010000000103000000010000000B0000002AFFFF5FD15B1E404AE254774BA8494096FBFF3F4CC11E40F37563BAA9A74940490200206BEC1E40466F209648A949404DF6FF1F53311F400C9623C206B2494024EBFF1F4F711F404C87835954BD4940C00000B0E7CA1E4071551679E0BD4940AD02004038991E40D35CC68418BE49408EF9FF5F297C1E404F8CFFCB5BBB4940A600006015541E40FAE6468054B8494015040060A33E1E4032E568902DAE49402AFFFF5FD15B1E404AE254774BA84940'::geometry) AND (mandator_id = ANY ('{257,1}'::bigint[])))
-> Bitmap Index Scan on gis_vname_idx (cost=0.00..8.33 rows=1 width=0) (actual time=5.566..5.566 rows=0 loops=1)
Index Cond: (vname @@ '''hemer'' & ''hauptstrasse'':*'::tsquery)
这会导致大量的 I/O - AFAIK 首先限制几何图形,然后再进行 vname 搜索会更聪明。
尝试的解决方案
为了实现我尝试的所需行为,
- 我将 geom @@
AREA
放入子选择中 ->没有更改执行计划 - 我创建了一个具有所需区域子集的临时视图 ->没有改变执行计划
- 我创建了所需区域的临时表->创建需要 4~6 秒,这使得情况变得更糟。
顺便说一句,很抱歉没有发布实际的查询:我认为如果我这样做,我的老板真的会生我的气,而且我正在寻找更多的理论指导,以便有人解决我的实际查询。请询问您是否需要进一步说明
编辑
理查德的观点非常好:您可以使用width
语句实现查询规划器的所需行为。坏处是这个临时表(或 CTE)会弄乱 vname 索引,从而使查询在某些情况下不返回任何内容。
我能够通过使用 to_tsvector() 即时创建新的 vname 来解决此问题,但这成本太高 - 每个查询大约需要 300 - 500 毫秒。
我的解决方案
我放弃了 vname 搜索,转而使用简单的 LIKE('%query_string%')
(10-20 毫秒/查询),但这仅在我的给定的环境。 YMMV。
I have a quite serious performance issue with the following statement that i can't fix myself.
Given Situation
- I have a postgres 8.4 Database with Postgis 1.4 installed
- I have a geospatial table with ~9 Million entries. This table has a (postgis) geometry column and a tsvector column
- I have a GIST Index on the geometry and a VNAME Index on the vname column
- Table is
ANALYZE
'd
I want to excecute a to_tsquery
text search within a subset of these geometries that should give me all affecteded ids back.
The area to search in will limit the 9 Million datasets to approximately 100.000 and the resultset of the ts_query
inside this area will most likely give an output of 0..1000 Entries.
Problem
The query analyzer decides that he wants to do an Bitmap Index Scan on the vname first, and then aggreates and puts a filter on the geometry (and other conditions I have in this statement)
Query Analyzer output:
Aggregate (cost=12.35..12.62 rows=1 width=510) (actual time=5.616..5.616 rows=1 loops=1)
-> Bitmap Heap Scan on mxgeom g (cost=8.33..12.35 rows=1 width=510) (actual time=5.567..5.567 rows=0 loops=1)
Recheck Cond: (vname @@ '''hemer'' & ''hauptstrasse'':*'::tsquery)
Filter: (active AND (geom && '0107000020E6100000010000000103000000010000000B0000002AFFFF5FD15B1E404AE254774BA8494096FBFF3F4CC11E40F37563BAA9A74940490200206BEC1E40466F209648A949404DF6FF1F53311F400C9623C206B2494024EBFF1F4F711F404C87835954BD4940C00000B0E7CA1E4071551679E0BD4940AD02004038991E40D35CC68418BE49408EF9FF5F297C1E404F8CFFCB5BBB4940A600006015541E40FAE6468054B8494015040060A33E1E4032E568902DAE49402AFFFF5FD15B1E404AE254774BA84940'::geometry) AND (mandator_id = ANY ('{257,1}'::bigint[])))
-> Bitmap Index Scan on gis_vname_idx (cost=0.00..8.33 rows=1 width=0) (actual time=5.566..5.566 rows=0 loops=1)
Index Cond: (vname @@ '''hemer'' & ''hauptstrasse'':*'::tsquery)
which causes a LOT of I/O - AFAIK It would be smarter to limit the geometry first, and do the vname search after.
Attempted Solutions
To achieve the desired behaviour i tried to
- I Put the geom @@
AREA
into a subselect -> Did not change the execution plan - I created a temporary view with the desired area subset -> Did not change the execution plan
- I created a temporary table of the desired area -> Takes 4~6 seconds to create, so that made it even worse.
Btw, sorry for not posting the actual query: I think my boss would really be mad at me if I did, also I'm looking more for theoretical pointers for someone to fix my actual query. Please ask if you need further clarification
EDIT
Richard had a very good point: You can achieve the desired behaviour of the Query Planner with the width
statement. The bad thing is that this temporary Table (or CTE) messes up the vname index, thus making the query return nothing in some cases.
I was able to fix this with creating a new vname on-the-fly with to_tsvector()
, but this is (too) costly - around 300 - 500ms per query.
My Solution
I ditched the vname search and went with a simple LIKE('%query_string%')
(10-20 ms/query), but this is only fast in my given environment. YMMV.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
tsvector 的统计处理有了一些改进(我认为 PostGIS 也是如此,但我不使用它)。如果您有时间,可能值得再次尝试 9.1 版本,看看它对您有何帮助。
然而,对于这个单一查询,您可能需要查看WITH 构造。
http://www.postgresql.org/docs/8.4/static/queries -with.html
如果将几何部分作为WITH子句,它将首先被评估(保证),然后该结果集将通过以下SELECT进行过滤。但最终可能会变慢,直到你尝试你才会知道。
对 work_mem 进行调整也可能会有所帮助 - 您可以在每个会话中执行此操作(“SET work_mem = ...”),但要小心将其设置得太高 - 并发查询会很快耗尽您的所有 RAM。
http://www.postgresql .org/docs/8.4/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY
There have been some improvements in statistics handling for tsvector (and I think PostGIS too, but I don't use it). If you've got the time, it might be worth trying again with a 9.1 release and see what that does for you.
However, for this single query you might want to look at the WITH construct.
http://www.postgresql.org/docs/8.4/static/queries-with.html
If you put the geometry part as the WITH clause, it will be evaluated first (guaranteed) and then that result-set will filtered by the following SELECT. It might end up slower though, you won't know until you try.
It might be an adjustment to work_mem would help too - you can do this per-session ("SET work_mem = ...") but be careful with setting it too high - concurrent queries can quickly burn through all your RAM.
http://www.postgresql.org/docs/8.4/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY