在 Postgres 中按匹配列的数量有效排序？

发布于 2025-01-14 17:55:40 字数 1941 浏览 2 评论 0原文

我有一个非规范化表，其中每一列都是与对象关联的标签。例如，这篇文章可以有标签“postgres”、“索引”和“相似性”。该表看起来像：

| id | col1     | col2     | col3       |
| 1  | postgres | indexing | similarity |
| 2  | postgres | foo      | bar        |
| 3  | foo      | bar      | baz        |

如果我想找到与此最相似的帖子，我可以这样做：

select *
from mytable
order by (col1 = 'postgres' or col2 = 'postgres' or col3 = 'postgres')::int
  + (col1 = 'indexing' or col2 = 'indexing' or col3 = 'indexing')::int
  + (col1 = 'similarity' or col2 = 'similarity' or col3 = 'similarity')::int desc
limit 10;

但是，据我所知，这不能使用任何索引进行排序；它必须进行表扫描。当我在每个列都被索引的类似表上运行它时，我得到了这个：

Limit  (cost=35124.06..35125.23 rows=10 width=21) (actual time=204.323..206.350 rows=10 loops=1)
  ->  Gather Merge  (cost=35124.06..132353.15 rows=833334 width=21) (actual time=204.322..206.344 rows=10 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Sort  (cost=34124.04..35165.70 rows=416667 width=21) (actual time=194.794..194.795 rows=10 loops=3)
              Sort Key: ((((((col1 = 'aa'::text) OR (col2 = 'aa'::text) OR (col3 = 'aa'::text)))::integer + (((col1 = 'bb'::text) OR (col2 = 'bb'::text) OR (col3 = 'bb'::text)))::integer) + (((col1 = 'cc'::text) OR (col2 = 'cc'::text) OR (col3 = 'cc'::text)))::integer)) DESC
              Sort Method: top-N heapsort  Memory: 25kB
              Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
              Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
              ->  Parallel Seq Scan on "__testTags__"  (cost=0.00..25120.01 rows=416667 width=21) (actual time=0.016..145.436 rows=333333 loops=3)
Planning Time: 0.139 ms
Execution Time: 206.371 ms

有没有一种方法可以更有效地按匹配列的数量进行排序？我研究过使用向量嵌入，但 Postgres 没有任何好的向量或最近邻支持。另外，全文搜索似乎不支持按匹配数排名。

我能想到的唯一解决方案是执行 4 个单独的查询：获取具有 3 个匹配项的行，然后获取具有 2 个匹配项的行，然后获取具有 1 个匹配项的行，然后获取没有匹配项的行。这比上面的查询要快得多。但是，如果我想添加更多列，查询就会变得非常复杂。

原文

I have a denormalized table where each column is a tag associated with an object. E.g. this post could have the tags "postgres", "indexing", and "similarity". The table looks something like:

| id | col1     | col2     | col3       |
| 1  | postgres | indexing | similarity |
| 2  | postgres | foo      | bar        |
| 3  | foo      | bar      | baz        |

If I wanted to find the post most similar to this one, I could do something like:

select *
from mytable
order by (col1 = 'postgres' or col2 = 'postgres' or col3 = 'postgres')::int
  + (col1 = 'indexing' or col2 = 'indexing' or col3 = 'indexing')::int
  + (col1 = 'similarity' or col2 = 'similarity' or col3 = 'similarity')::int desc
limit 10;

However, as far as I know, this can't use any indexes for sorting; it has to do a table scan. When I ran it on a similar table with every column indexed, I got this:

Limit  (cost=35124.06..35125.23 rows=10 width=21) (actual time=204.323..206.350 rows=10 loops=1)
  ->  Gather Merge  (cost=35124.06..132353.15 rows=833334 width=21) (actual time=204.322..206.344 rows=10 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Sort  (cost=34124.04..35165.70 rows=416667 width=21) (actual time=194.794..194.795 rows=10 loops=3)
              Sort Key: ((((((col1 = 'aa'::text) OR (col2 = 'aa'::text) OR (col3 = 'aa'::text)))::integer + (((col1 = 'bb'::text) OR (col2 = 'bb'::text) OR (col3 = 'bb'::text)))::integer) + (((col1 = 'cc'::text) OR (col2 = 'cc'::text) OR (col3 = 'cc'::text)))::integer)) DESC
              Sort Method: top-N heapsort  Memory: 25kB
              Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
              Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
              ->  Parallel Seq Scan on "__testTags__"  (cost=0.00..25120.01 rows=416667 width=21) (actual time=0.016..145.436 rows=333333 loops=3)
Planning Time: 0.139 ms
Execution Time: 206.371 ms

Is there a way to order by the number of matching columns more efficiently? I looked into using a vector embedding, but Postgres doesn't have any good vector or nearest-neighbor support. Also, fulltext search doesn't seem to support ranking by the number of matches.

The only solution I can think of is doing 4 separate queries: get rows with 3 matches, then rows with 2 matches, then rows with 1 match, then rows with no matches. This is much faster than the query above. However, if I want to add more columns, the query will get really complicated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

紅太極 2025-01-21 17:55:40

可以使用依赖于数组类型的索引来达到您的目的：

CREATE INDEX IF NOT EXISTS col_index ON mytable USING gin ((array[col1,col2,col3]) array_ops) ;

然后，以下查询可以使用索引 col_index 来过滤其中一个或多个列与搜索条件匹配的行，然后计算精确匹配的次数：

SELECT m.match_count, t.*
  FROM mytable AS t
 CROSS JOIN LATERAL
     ( SELECT count(*) AS match_count
         FROM unnest(array['postgres', 'indexing', 'similarity']) WITH ORDINALITY AS a(arr, id)
        WHERE (array[t.col1, t.col2, t.col3])[a.id] = a.arr
     ) AS m
 WHERE array[t.col1, t.col2, t.col3] && array['postgres', 'indexing', 'similarity']

测试结果见dbfiddle。

It is possible to use an index for your purpose which relies on an array type :

CREATE INDEX IF NOT EXISTS col_index ON mytable USING gin ((array[col1,col2,col3]) array_ops) ;

Then, the following query can use the index col_index for filtering the rows where one or several colomns match the search criteria and then calculate the number of exact matching :

SELECT m.match_count, t.*
  FROM mytable AS t
 CROSS JOIN LATERAL
     ( SELECT count(*) AS match_count
         FROM unnest(array['postgres', 'indexing', 'similarity']) WITH ORDINALITY AS a(arr, id)
        WHERE (array[t.col1, t.col2, t.col3])[a.id] = a.arr
     ) AS m
 WHERE array[t.col1, t.col2, t.col3] && array['postgres', 'indexing', 'similarity']

see the test result in dbfiddle.

回复收藏 0 原文

能否归途做我良人 2025-01-21 17:55:40

使用简单的倒排索引：创建和维护（可能使用触发器 - 练习reader），一个存储标签和源数据 id 的表，例如：

create table tag_mytable (
    tag text not null,
    mytable_id int not null references mytable,
    primary key (tag, mytable_id)
)

然后您可以非常有效地找到 mytable 行标签命中计数：

with hits as (
    select mytable_id
    from tag_mytable
    where tag = $1
    union all
    select mytable_id
    from tag_mytable
    where tag = $2
    union all
    select mytable_id
    from tag_mytable
    where tag = $3
), hits_total as (
    select mytable_id, count(*) as tag_hits
    from hits
    group by 1
)
select t.*
from mytable t
left join hits_total h on h.mytable_id = t.mytable_id
order by h.tag_hits

这具有将命中计数为零分配给没有任何命中的行的额外优点（大概是最多他们），而无需比较他们的数据。

考虑绕过将标签存储在 mytable 中，而是仅将它们存储在这个新表中，这样就不需要触发器等，并允许每个 mytable 有任意数量的标签（尽管对查询进行了细微调整）。

Use a simple inverted index: Create and maintain (probably using triggers - an exercise for the reader), a table that stores the tag and the id of the source data, eg:

create table tag_mytable (
    tag text not null,
    mytable_id int not null references mytable,
    primary key (tag, mytable_id)
)

You can then very efficiently find mytable row tag hit count:

with hits as (
    select mytable_id
    from tag_mytable
    where tag = $1
    union all
    select mytable_id
    from tag_mytable
    where tag = $2
    union all
    select mytable_id
    from tag_mytable
    where tag = $3
), hits_total as (
    select mytable_id, count(*) as tag_hits
    from hits
    group by 1
)
select t.*
from mytable t
left join hits_total h on h.mytable_id = t.mytable_id
order by h.tag_hits

This has the added advantage of assigning a hit count of zero to rows without any hits (presumably most of them) without having to compare their data.

Consider bypassing storing the tags in mytable and instead store them only in this new table, which eliminates the need for triggers etc, and allows any number of tags per mytable (albeit with a minor tweak to the query).

回复收藏 0 原文

~没有更多了~