提高 postgres 中表的分组查询速度

发布于 2024-12-29 17:06:31 字数 2011 浏览 2 评论 0原文

我有一个具有以下结构的连接表:

CREATE TABLE adjectives_friends
(
  adjective_id integer,
  friend_id integer
)
WITH (
  OIDS=FALSE
);
ALTER TABLE adjectives_friends
  OWNER TO rails;


CREATE UNIQUE INDEX index_adjectives_friends_on_adjective_id_and_friend_id
  ON adjectives_friends
  USING btree
  (adjective_id , friend_id );

CREATE UNIQUE INDEX index_adjectives_friends_on_friend_id_and_adjective_id
  ON adjectives_friends
  USING btree
  (friend_id , adjective_id );
ALTER TABLE adjectives_friends CLUSTER ON index_adjectives_friends_on_friend_id_and_adjective_id;

该表包含大约 5000 万条记录。

形容词表是一个包含约 150 个条目的查找表。我想做的是找到与形容词列表最匹配的朋友。假设朋友拥有的形容词的最大数量是 10。所以,我尝试了这个查询:

SELECT count(friend_id) count, friend_id
  FROM adjectives_friends
  where adjective_id in (1,2,3,4,5,6,7,8,9,10)
  group by friend_id
  order by count desc
  limit 100

在我的开发机器上,使用查询计划,这大约需要 10 秒

"Limit  (cost=831652.00..831652.25 rows=100 width=4)"
"  ->  Sort  (cost=831652.00..831888.59 rows=94634 width=4)"
"        Sort Key: (count(friend_id))"
"        ->  GroupAggregate  (cost=804185.31..828035.16 rows=94634 width=4)"
"              ->  Sort  (cost=804185.31..811819.81 rows=3053801 width=4)"
"                    Sort Key: friend_id"
"                    ->  Bitmap Heap Scan on adjectives_friends  (cost=85958.72..350003.24 rows=3053801 width=4)"
"                          Recheck Cond: (adjective_id = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[]))"
"                          ->  Bitmap Index Scan on index_adjectives_friends_on_adjective_id_and_friend_id  (cost=0.00..85195.26 rows=3053801 width=0)"
"                                Index Cond: (adjective_id = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[]))"

order by 是杀死我的原因,但我不知道避免它的好方法。无法预先计算计数,因为要选择的形容词完全是任意的,并且有>个。 150 选择 10 种组合。现在,我认为最好的选择是获取好友创建的 100 个最佳结果,保存结果,然后每 n 个时间间隔更新一次。这是可以接受的,因为形容词应该经常更换,而且我不知道确切的 100 个最佳结果。但是,如果我可以将查询速度提高到 1 - 2 秒左右,那就没有必要了。有什么建议吗?

I have a join table with the following structure:

CREATE TABLE adjectives_friends
(
  adjective_id integer,
  friend_id integer
)
WITH (
  OIDS=FALSE
);
ALTER TABLE adjectives_friends
  OWNER TO rails;


CREATE UNIQUE INDEX index_adjectives_friends_on_adjective_id_and_friend_id
  ON adjectives_friends
  USING btree
  (adjective_id , friend_id );

CREATE UNIQUE INDEX index_adjectives_friends_on_friend_id_and_adjective_id
  ON adjectives_friends
  USING btree
  (friend_id , adjective_id );
ALTER TABLE adjectives_friends CLUSTER ON index_adjectives_friends_on_friend_id_and_adjective_id;

This table contains around ~50 million records.

The adjectives table is a look up table of ~150 entries. What I would like to do is find the friend that most closely matches a list of adjectives. Assume that the maximum number of adjectives a friend has is 10. So, I tried this query:

SELECT count(friend_id) count, friend_id
  FROM adjectives_friends
  where adjective_id in (1,2,3,4,5,6,7,8,9,10)
  group by friend_id
  order by count desc
  limit 100

This takes around ~10 seconds on my dev machine, with query plan

"Limit  (cost=831652.00..831652.25 rows=100 width=4)"
"  ->  Sort  (cost=831652.00..831888.59 rows=94634 width=4)"
"        Sort Key: (count(friend_id))"
"        ->  GroupAggregate  (cost=804185.31..828035.16 rows=94634 width=4)"
"              ->  Sort  (cost=804185.31..811819.81 rows=3053801 width=4)"
"                    Sort Key: friend_id"
"                    ->  Bitmap Heap Scan on adjectives_friends  (cost=85958.72..350003.24 rows=3053801 width=4)"
"                          Recheck Cond: (adjective_id = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[]))"
"                          ->  Bitmap Index Scan on index_adjectives_friends_on_adjective_id_and_friend_id  (cost=0.00..85195.26 rows=3053801 width=0)"
"                                Index Cond: (adjective_id = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[]))"

The order by is what is killing me, but I don't know of a good way to avoid it. The count can't be precomputed becasue the adjectives to be selected are completely arbitrary, and there are > 150 choose 10 combinations. Right now, I think that the best option is to grab the 100 best results on friend creation, save the results, then update it every n time intervals. This would be acceptable as the adjectives are expected to be switched that often, and I don't the exact 100 best results. But, if I could get the query speed to around 1 - 2 seconds, that wouldn't be neccessary. Any suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

黑凤梨 2025-01-05 17:06:31

我认为您使用该查询计划不会做得更好。我相信你的话,计数无法预先计算。

我认为你最好的选择是

如果你可以使用smallint而不是整数,您的表和索引将更窄,页面可以容纳更多内容,并且您的查询运行得更快。但smallint是一个2字节整数,范围从-32768到+32767。如果您需要更多的 id 编号,则smallint 将不起作用。

I don't think you'll do much better with that query plan. I'll take your word that the count can't be precomputed.

I think your best bets are

If you can use smallint instead of integer, your tables and indexes will be narrower, more will fit into a page, and your queries should run faster. But smallint is a 2-byte integer, ranging from -32768 to +32767. If you need more id numbers than that, smallint won't work.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文