Postgres 中的模糊分组

发布于 2024-08-09 06:37:35 字数 567 浏览 8 评论 0原文

我有一个表格，其内容与此类似：

id | title
------------
1  | 5. foo
2  | 5.foo
3  | 5. foo*
4  | bar
5  | bar*
6  | baz
6  | BAZ

……等等。我想按标题分组并忽略多余的部分。我知道 Postgres 可以做到这一点：

SELECT * FROM (
  SELECT regexp_replace(title, '[*.]+$', '') AS title
  FROM table
) AS a
GROUP BY title

但是，这非常简单，如果我试图预测所有可能的变化，就会变得非常笨拙。那么，问题是，是否有比使用正则表达式更通用的模糊分组方法？是否有可能，至少不会弄断背部？

编辑：为了澄清，对任何变体都没有偏好，这就是分组后表格应该是什么样的：

title
------
5. foo
bar
baz

即，变体将是仅由几个字符或大写字母不同的项目，并且它不会只要将它们分组，剩下哪些就无所谓了。

原文

I have a table with contents that look similar to this:

id | title
------------
1  | 5. foo
2  | 5.foo
3  | 5. foo*
4  | bar
5  | bar*
6  | baz
6  | BAZ

…and so on. I would like to group by the titles and ignore the extra bits. I know Postgres can do this:

SELECT * FROM (
  SELECT regexp_replace(title, '[*.]+
However, that's quite simple and would get very unwieldy if I tried to anticipate all the possible variations. So, the question is, is there a more general way to do fuzzy grouping than using regexp? Is it even possible, at least without breaking one's back doing it?
Edit: To clarify, there is no preference for any of the variations, and this is what the table should look like after grouping:
title
------
5. foo
bar
baz

I.e., the variations would be items that are different just by a few characters or capitalization, and it doesn't matter which ones are left as long as they're grouped.
, '') AS title
  FROM table
) AS a
GROUP BY title

However, that's quite simple and would get very unwieldy if I tried to anticipate all the possible variations. So, the question is, is there a more general way to do fuzzy grouping than using regexp? Is it even possible, at least without breaking one's back doing it?

Edit: To clarify, there is no preference for any of the variations, and this is what the table should look like after grouping:

I.e., the variations would be items that are different just by a few characters or capitalization, and it doesn't matter which ones are left as long as they're grouped.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ぇ气 2024-08-16 06:37:35

对于任何分组，您应该具有传递相等性，即a ~= b, b ~= c ==> a~=c。

严格使用单词来表述它，我们将尝试使用 SQL 来表述它。

例如，foo*bar 应该去哪个组？

更新：

此查询将所有非字母数字字符替换为空格，并返回每个组中的第一个标题：

SELECT  DISTINCT ON (REGEXP_REPLACE(UPPER(title), '[^[:alnum:]]', '', 'g')) title
FROM    (
        VALUES
        (1, '5. foo'),
        (2, '5.foo'),
        (3, '5. foo*'),
        (4, 'bar'),
        (5, 'bar*'),
        (6, 'baz'),
        (7, 'BAZ')
        ) rows (id, title)

For any grouping you should have transitive equality, that is a ~= b, b ~= c => a ~= c.

Formulate it strictly using words and we'll try to formulate it using SQL.

For instance, which group should foo*bar go to?

Update:

This query replaces all non-alphanumerical characters with spaces and returns first title from each group:

SELECT  DISTINCT ON (REGEXP_REPLACE(UPPER(title), '[^[:alnum:]]', '', 'g')) title
FROM    (
        VALUES
        (1, '5. foo'),
        (2, '5.foo'),
        (3, '5. foo*'),
        (4, 'bar'),
        (5, 'bar*'),
        (6, 'baz'),
        (7, 'BAZ')
        ) rows (id, title)

回复收藏 0 原文