Levenshtein在多个列上多个单词
我正在尝试使搜索更加友好,并希望利用Levenshtein
距离。这效果很好,但是如果列中的值长25个字符,则只有3个字符的距离太远。在这种情况下,它的性能比喜欢
方法更糟。我通过使用regexp_split_to_table
将所有单词分为自己的行来解决此问题。这很好,但是如果我有多个单词作为输入,仍然无法使用。
例如:
让数据看起来如下
ID ID | col1 | col2 |
---|---|---|
1 | 一三2二 | 2 |
二 | 2 | 马 |
house | 4 | 3 |
house | house | 3 |
使用regexp_split_to_table
会将其转换为
ID | col |
---|---|
1 | col 1 col |
1 1 | 两 |
1 | 二 |
2 | 二 |
2 | 二1三2 |
2 | 两个 |
3 | 马 |
3 | 树 |
4 | 房子 |
4 | 3 |
如果我搜索一棵树
,我想将一个
与每个单词进行比较,但也比较tree
tree 使用每个单词,然后按两个距离的总和进行排序。
我不知道从哪里开始。我也不知道这是否是这样做的最佳方法(似乎有些过分,但我也不是专家)。也许我也想过这个。我很高兴暗示正确的方向:)。
I'm trying to make search a bit more friendly and wanted to exploit the Levenshtein
distance. This works great but if a value in a column has a length of 25 characters long, the distance to only 3 characters is too far. In this case, it performs worse than the LIKE
method. I solved this by splitting all words into their own rows using regexp_split_to_table
. This is nice, but it's still not working if I have multiple words as input.
For example:
Let the data look as following
id | col1 | col2 |
---|---|---|
1 | one two | three |
2 | two | one |
3 | horse | tree |
4 | house | three |
using regexp_split_to_table
would transform this to
id | col |
---|---|
1 | one |
1 | two |
1 | three |
2 | one |
2 | two |
2 | two |
3 | horse |
3 | tree |
4 | house |
4 | three |
If I search for one tree
, I'd like to compare one
with each word but also compare tree
with each word and then order by the sum of both distances.
I have no idea where to start. I also do not know if this is the best approach to do this (it seems somewhat excessive but I'm also not an expert). Maybe I'm also overthinking this. I'd appreciate a hint into the right direction :).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论