Postgres 中的模糊分组
我有一个表格,其内容与此类似:
id | title
------------
1 | 5. foo
2 | 5.foo
3 | 5. foo*
4 | bar
5 | bar*
6 | baz
6 | BAZ
……等等。我想按标题分组并忽略多余的部分。我知道 Postgres 可以做到这一点:
SELECT * FROM (
SELECT regexp_replace(title, '[*.]+$', '') AS title
FROM table
) AS a
GROUP BY title
但是,这非常简单,如果我试图预测所有可能的变化,就会变得非常笨拙。那么,问题是,是否有比使用正则表达式更通用的模糊分组方法?是否有可能,至少不会弄断背部?
编辑:为了澄清,对任何变体都没有偏好,这就是分组后表格应该是什么样的:
title
------
5. foo
bar
baz
即,变体将是仅由几个字符或大写字母不同的项目,并且它不会只要将它们分组,剩下哪些就无所谓了。
I have a table with contents that look similar to this:
id | title
------------
1 | 5. foo
2 | 5.foo
3 | 5. foo*
4 | bar
5 | bar*
6 | baz
6 | BAZ
…and so on. I would like to group by the titles and ignore the extra bits. I know Postgres can do this:
SELECT * FROM (
SELECT regexp_replace(title, '[*.]+
However, that's quite simple and would get very unwieldy if I tried to anticipate all the possible variations. So, the question is, is there a more general way to do fuzzy grouping than using regexp? Is it even possible, at least without breaking one's back doing it?
Edit: To clarify, there is no preference for any of the variations, and this is what the table should look like after grouping:
title
------
5. foo
bar
baz
I.e., the variations would be items that are different just by a few characters or capitalization, and it doesn't matter which ones are left as long as they're grouped.
, '') AS title
FROM table
) AS a
GROUP BY title
However, that's quite simple and would get very unwieldy if I tried to anticipate all the possible variations. So, the question is, is there a more general way to do fuzzy grouping than using regexp? Is it even possible, at least without breaking one's back doing it?
Edit: To clarify, there is no preference for any of the variations, and this is what the table should look like after grouping:
I.e., the variations would be items that are different just by a few characters or capitalization, and it doesn't matter which ones are left as long as they're grouped.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于任何分组,您应该具有传递相等性,即
a ~= b, b ~= c ==> a~=c
。严格使用单词来表述它,我们将尝试使用
SQL
来表述它。例如,
foo*bar
应该去哪个组?更新:
此查询将所有非字母数字字符替换为空格,并返回每个组中的第一个标题:
For any grouping you should have transitive equality, that is
a ~= b, b ~= c => a ~= c
.Formulate it strictly using words and we'll try to formulate it using
SQL
.For instance, which group should
foo*bar
go to?Update:
This query replaces all non-alphanumerical characters with spaces and returns first title from each group:
有时,您必须定义使一组值属于一个组的原因。如果这太难了,也许您应该禁止和禁止输入模糊数据,或者如果必须允许,请添加一个列,其中包含标题的清理版本以供分组操作使用。
At some time, you are going to have to define what makes a set of values belong together in a group. If that's too hard, maybe you should prohibit and inhibit the entry of fuzzy data, or if you must permit it, add a column that contains a sanitized version of the title for use by the grouping operations.