Levenshtein在多个列上多个单词

发布于 2025-01-20 16:02:37 字数 1255 浏览 2 评论 0原文

我正在尝试使搜索更加友好，并希望利用Levenshtein距离。这效果很好，但是如果列中的值长25个字符，则只有3个字符的距离太远。在这种情况下，它的性能比喜欢方法更糟。我通过使用regexp_split_to_table将所有单词分为自己的行来解决此问题。这很好，但是如果我有多个单词作为输入，仍然无法使用。

例如：

让数据看起来如下

ID ID	col1	col2
1	一三2二	2
二	2	马
house	4	3
house	house	3

使用regexp_split_to_table会将其转换为

ID	col
1	col 1 col
1 1	两
1	二
2	二
2	二1三2
2	两个
3	马
3	树
4	房子
4	3

如果我搜索一棵树，我想将一个与每个单词进行比较，但也比较tree tree 使用每个单词，然后按两个距离的总和进行排序。

我不知道从哪里开始。我也不知道这是否是这样做的最佳方法（似乎有些过分，但我也不是专家）。也许我也想过这个。我很高兴暗示正确的方向:)。

原文

I'm trying to make search a bit more friendly and wanted to exploit the Levenshtein distance. This works great but if a value in a column has a length of 25 characters long, the distance to only 3 characters is too far. In this case, it performs worse than the LIKE method. I solved this by splitting all words into their own rows using regexp_split_to_table. This is nice, but it's still not working if I have multiple words as input.

For example:

Let the data look as following

id	col1	col2
1	one two	three
2	two	one
3	horse	tree
4	house	three

using regexp_split_to_table would transform this to

id	col
1	one
1	two
1	three
2	one
2	two
2	two
3	horse
3	tree
4	house
4	three

If I search for one tree, I'd like to compare one with each word but also compare tree with each word and then order by the sum of both distances.

I have no idea where to start. I also do not know if this is the best approach to do this (it seems somewhat excessive but I'm also not an expert). Maybe I'm also overthinking this. I'd appreciate a hint into the right direction :).

分享到QQ

分享到微博