为什么 Postgres 在这个查询中进行哈希处理?

发布于 2024-09-05 18:34:45 字数 864 浏览 3 评论 0原文

我有两个表:AP。我想从 A 中的所有行中获取信息,这些行的 id 位于我创建的临时表 tmp_ids 中。但是,Pfoo 中有有关 A 的附加信息,我也想获取此信息。我有以下查询:

SELECT A.H_id AS hid,
       A.id AS aid,
       P.foo, A.pos, A.size
FROM tmp_ids, P, A
WHERE tmp_ids.id = A.H_id
  AND P.id = A.P_id

我注意到它运行得很慢,当我要求 Postgres 解释时,我注意到它结合了 tmp_ids 和我为 A 创建的索引>H_id 带有嵌套循环。但是,它会在与第一次合并的结果进行哈希连接之前对所有 P 进行哈希处理。 P 相当大,我认为这就是一直以来所花费的时间。为什么它会在那里创建一个哈希? P.idP 的主键,A.P_id 有自己的索引。

更新:所有数据类型都是 INTEGER,除了 A.size (双精度)和 P.foo (VARCHAR)。我正在使用 PostgreSQL 版本 8.4。

解释如下: http://explain.depesz.com/s/WBo

I have two tables: A and P. I want to get information out of all rows in A whose id is in a temporary table I created, tmp_ids. However, there is additional information about A in the P table, foo, and I want to get this info as well. I have the following query:

SELECT A.H_id AS hid,
       A.id AS aid,
       P.foo, A.pos, A.size
FROM tmp_ids, P, A
WHERE tmp_ids.id = A.H_id
  AND P.id = A.P_id

I noticed it going slowly, and when I asked Postgres to explain, I noticed that it combines tmp_ids with an index on A I created for H_id with a nested loop. However, it hashes all of P before doing a Hash join with the result of the first merge. P is quite large and I think this is what's taking all the time. Why would it create a hash there? P.id is P's primary key, and A.P_id has an index of its own.

UPDATE: All the datatypes are INTEGER, except A.size which is a DOUBLE PRECISION and P.foo which is VARCHAR. I'm using PostgreSQL version 8.4.

Here is the explain: http://explain.depesz.com/s/WBo .

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

饮惑 2024-09-12 18:34:45

查询规划器估计顺序读取所有数据并对其进行散列处理比执行估计 2100 次索引扫描及其相关的更多随机磁盘访问要更快。

The query planner estimated it'd be faster to sequentially read all the data and hash it, than to perform an estimated 2100 index scans with their associated much more random disk access.

雪化雨蝶 2024-09-12 18:34:45

如果没有看到解释分析,此类问题通常是由于统计信息关闭或 random_page_cost 或 seq_page_cost 所需的异常设置造成的。

可能运行得更好

set enable_hashjoin = false;

Without seeing an explain analyze, these kind of problems usually result from statistics being off or an unusual setting required for random_page_cost or seq_page_cost.

It may run better with

set enable_hashjoin = false;
疏忽 2024-09-12 18:34:45

您的问题是优化器没有正确的统计信息来确定要创建多少个匹配“A.H_id = tmp_ids.id”,这是临时表的常见问题 - 他们没有这样的统计信息普通的可以。它猜测“在 A 上使用 idx_A_handid 进行索引扫描”将匹配 21 行,但实际上只有 3 行。在解释分析中突出显示,最低级别的向上箭头旁边有一个 7,给出了估计错误程度的乘数。

该错误会继续到它认为有 2100 行需要扫描的位置,此时它可能会进行完整的顺序扫描并对结果进行哈希处理,因为这可能会触及表中的大多数块。

如果它正确地知道只有 300 个数据需要探测,它可能会做一些不同的事情,只涉及数据的一个子集。由于临时表缺乏统计信息,您不能指望从临时表的联接中获得良好的计划。在这种情况下,在执行查询之前关闭 enable_hashjoin 来推动正确的行为可能是合适的。

Your problem is that the optimizer doesn't have the right statistics to determine how many matches "A.H_id = tmp_ids.id" is going to create, which is a common problem with temporary tables--they don't have statistics the way a regular one does. It guesses that 21 rows are going to match coming out of the "Index Scan using idx_A_handid on A", but there are actually only 3. It's highlighted in the explain analysis where the lowest level up arrow has a 7 next to it, giving the multiplier for how wrong the estimate was.

That error carries forward to where it thinks it has 2100 rows to scan, at which point it might as well do a full sequential scan and hash the results given that's likely to touch most blocks in the table.

Had it known correctly there were only 300 to probe, it might have done something different involving only a subset of the data. You can't expect to get good plans from joins against temporary tables because of their lack of statistics. This may be a case where it's appropriate to nudge correct behavior by turning off enable_hashjoin before executing the query.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文