Postgres 中的模糊分组

发布于 2024-08-09 06:37:35 字数 567 浏览 5 评论 0原文

我有一个表格,其内容与此类似:

id | title
------------
1  | 5. foo
2  | 5.foo
3  | 5. foo*
4  | bar
5  | bar*
6  | baz
6  | BAZ

……等等。我想按标题分组并忽略多余的部分。我知道 Postgres 可以做到这一点:

SELECT * FROM (
  SELECT regexp_replace(title, '[*.]+$', '') AS title
  FROM table
) AS a
GROUP BY title

但是,这非常简单,如果我试图预测所有可能的变化,就会变得非常笨拙。那么,问题是,是否有比使用正则表达式更通用的模糊分组方法?是否有可能,至少不会弄断背部?

编辑:为了澄清,对任何变体都没有偏好,这就是分组后表格应该是什么样的:

title
------
5. foo
bar
baz

即,变体将是仅由几个字符或大写字母不同的项目,并且它不会只要将它们分组,剩下哪些就无所谓了。

I have a table with contents that look similar to this:

id | title
------------
1  | 5. foo
2  | 5.foo
3  | 5. foo*
4  | bar
5  | bar*
6  | baz
6  | BAZ

…and so on. I would like to group by the titles and ignore the extra bits. I know Postgres can do this:

SELECT * FROM (
  SELECT regexp_replace(title, '[*.]+

However, that's quite simple and would get very unwieldy if I tried to anticipate all the possible variations. So, the question is, is there a more general way to do fuzzy grouping than using regexp? Is it even possible, at least without breaking one's back doing it?

Edit: To clarify, there is no preference for any of the variations, and this is what the table should look like after grouping:

title
------
5. foo
bar
baz

I.e., the variations would be items that are different just by a few characters or capitalization, and it doesn't matter which ones are left as long as they're grouped.

, '') AS title FROM table ) AS a GROUP BY title

However, that's quite simple and would get very unwieldy if I tried to anticipate all the possible variations. So, the question is, is there a more general way to do fuzzy grouping than using regexp? Is it even possible, at least without breaking one's back doing it?

Edit: To clarify, there is no preference for any of the variations, and this is what the table should look like after grouping:

I.e., the variations would be items that are different just by a few characters or capitalization, and it doesn't matter which ones are left as long as they're grouped.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

ぇ气 2024-08-16 06:37:35

对于任何分组,您应该具有传递相等性,即a ~= b, b ~= c ==> a~=c

严格使用单词来表述它,我们将尝试使用 SQL 来表述它。

例如,foo*bar 应该去哪个组?

更新

此查询将所有非字母数字字符替换为空格,并返回每个组中的第一个标题:

SELECT  DISTINCT ON (REGEXP_REPLACE(UPPER(title), '[^[:alnum:]]', '', 'g')) title
FROM    (
        VALUES
        (1, '5. foo'),
        (2, '5.foo'),
        (3, '5. foo*'),
        (4, 'bar'),
        (5, 'bar*'),
        (6, 'baz'),
        (7, 'BAZ')
        ) rows (id, title)

For any grouping you should have transitive equality, that is a ~= b, b ~= c => a ~= c.

Formulate it strictly using words and we'll try to formulate it using SQL.

For instance, which group should foo*bar go to?

Update:

This query replaces all non-alphanumerical characters with spaces and returns first title from each group:

SELECT  DISTINCT ON (REGEXP_REPLACE(UPPER(title), '[^[:alnum:]]', '', 'g')) title
FROM    (
        VALUES
        (1, '5. foo'),
        (2, '5.foo'),
        (3, '5. foo*'),
        (4, 'bar'),
        (5, 'bar*'),
        (6, 'baz'),
        (7, 'BAZ')
        ) rows (id, title)
勿忘初心 2024-08-16 06:37:35

有时,您必须定义使一组值属于一个组的原因。如果这太难了,也许您应该禁止和禁止输入模糊数据,或者如果必须允许,请添加一个列,其中包含标题的清理版本以供分组操作使用。

At some time, you are going to have to define what makes a set of values belong together in a group. If that's too hard, maybe you should prohibit and inhibit the entry of fuzzy data, or if you must permit it, add a column that contains a sanitized version of the title for use by the grouping operations.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文