如何仅使用 PostgreSQL 创建简单的模糊搜索？

发布于 2024-12-09 10:33:49 字数 300 浏览 2 评论 0原文

我的基于 RoR 的网站上的搜索功能存在一些问题。我有很多带有一些代码的产品。此代码可以是任何字符串，例如“AB-123-lHdfj”。现在我使用 ILIKE 运算符来查找产品：

Product.where("code ILIKE ?", "%" + params[:search] + "%")

它工作正常，但无法找到具有“AB123-lHdfj”或“AB123lHdfj”等代码的产品。

为此我该怎么办？可能 Postgres 有一些字符串规范化功能，或者其他一些方法来帮助我？

原文

I have a little problem with search functionality on my RoR based site. I have many Produts with some CODEs. This code can be any string like "AB-123-lHdfj". Now I use ILIKE operator to find products:

Product.where("code ILIKE ?", "%" + params[:search] + "%")

It works fine, but it can't find product with codes like "AB123-lHdfj", or "AB123lHdfj".

What should I do for this? May be Postgres has some string normalization function, or some other methods to help me?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小清晰的声音 2024-12-16 10:33:49

Postgres 提供了一个带有多个字符串比较函数的模块，例如 soundex 和 metaphone。但是您将需要使用 levenshtein 编辑距离函数。

Example:

test=# SELECT levenshtein('GUMBO', 'GAMBOL');
 levenshtein
-------------
           2
(1 row)

2 是两个单词之间的编辑距离。当您将其应用于多个单词并按编辑距离结果排序时，您将获得您正在寻找的模糊匹配类型。

尝试这个查询示例：（当然使用您自己的对象名称和数据）

SELECT * 
FROM some_table
WHERE levenshtein(code, 'AB123-lHdfj') <= 3
ORDER BY levenshtein(code, 'AB123-lHdfj')
LIMIT 10

此查询表示：

给我 some_table 中所有数据的前 10 个结果，其中代码值与输入“AB123-lHdfj”之间的编辑距离小于3. 您将返回代码值与“AB123-lHdfj”相差 3 个字符以内的所有行...

注意：如果您收到如下错误：

function levenshtein(character varying, unknown) does not exist

安装fuzzystrmatch 扩展使用：

test=# CREATE EXTENSION fuzzystrmatch;

Postgres provides a module with several string comparsion functions such as soundex and metaphone. But you will want to use the levenshtein edit distance function.

Example:

test=# SELECT levenshtein('GUMBO', 'GAMBOL');
 levenshtein
-------------
           2
(1 row)

The 2 is the edit distance between the two words. When you apply this against a number of words and sort by the edit distance result you will have the type of fuzzy matches that you're looking for.

Try this query sample: (with your own object names and data of course)

SELECT * 
FROM some_table
WHERE levenshtein(code, 'AB123-lHdfj') <= 3
ORDER BY levenshtein(code, 'AB123-lHdfj')
LIMIT 10

This query says:

Give me the top 10 results of all data from some_table where the edit distance between the code value and the input 'AB123-lHdfj' is less than 3. You will get back all rows where the value of code is within 3 characters difference to 'AB123-lHdfj'...

Note: if you get an error like:

function levenshtein(character varying, unknown) does not exist

Install the fuzzystrmatch extension using:

test=# CREATE EXTENSION fuzzystrmatch;

回复收藏 0 原文

っ左 2024-12-16 10:33:49

Paul 告诉您有关 levenshtein()。这是一个非常有用的工具，但对于大表来说它也非常慢。它必须计算每行与搜索词的 Levenshtein 距离。这是昂贵的并且不能使用索引。 “加速”变体 levenshtein_less_equal() 对于长字符串来说速度更快，但在没有索引支持的情况下仍然很慢。

如果您的要求如示例所示简单，您仍然可以使用LIKE。只需将搜索词中的任何 - 替换为 WHERE 子句中的 % 即可。所以代替：

WHERE code ILIKE '%AB-123-lHdfj%'

使用：

WHERE code ILIKE '%AB%123%lHdfj%'

或者动态地：

WHERE code ILIKE '%' || replace('AB-123-lHdfj', '-', '%') || '%'

LIKE 模式中的 % 代表 0-n 个字符。或者使用 _ 来表示一个字符。或者使用正则表达式进行更智能的匹配：

WHERE code ~* 'AB.?123.?lHdfj'

.? ... 0 或 1 个字符

或者：

WHERE code ~* 'AB\-?123\-?lHdfj'

\-? ... 0 或 1 个破折号

您可能想要转义特殊字符LIKE 或正则表达式模式中的字符。请参阅：

正则表达式或 LIKE 模式的转义函数

如果您的实际问题更复杂并且您需要更快的解决方案，那么根据您的要求，有多种选择：

有全文搜索，当然。但这对于您的情况来说可能有点过分了。
更可能的候选者是与附加模块 pg_trgm 进行三元匹配。请参阅：
自 PostgreSQL 9.1 起可与 LIKE、ILIKE、~ 或 ~* 组合使用。< br>
在这种情况下也很有趣：该模块的 similarity() 函数或 % 运算符。
最后但并非最不重要的一点是，您可以使用一个函数来实现手动解决方案，以规范化要搜索的字符串。例如，您可以转换 AB1-23-lHdfj --> ab123lhdfj，将其保存在附加列中，并使用以相同方式转换的术语进行搜索。
或者使用表达式索引而不是冗余列。（涉及的函数必须是IMMUTABLE。）可以将其与上面的pg_tgrm结合起来。

模式匹配技术概述：

使用 LIKE、SIMILAR TO 或正则表达式进行模式匹配

Paul told you about levenshtein(). That's a very useful tool, but it's also very slow with big tables. It has to calculate the Levenshtein distance from the search term for every single row. That's expensive and cannot use an index. The "accelerated" variant levenshtein_less_equal() is faster for long strings, but still slow without index support.

If your requirements are as simple as the example suggests, you can still use LIKE. Just replace any - in your search term with % in the WHERE clause. So instead of:

~~WHERE code ILIKE '%AB-123-lHdfj%'~~

Use:

WHERE code ILIKE '%AB%123%lHdfj%'

Or, dynamically:

WHERE code ILIKE '%' || replace('AB-123-lHdfj', '-', '%') || '%'

% in LIKE patterns stands for 0-n characters. Or use _ for exactly one character. Or use regular expressions for a smarter match:

WHERE code ~* 'AB.?123.?lHdfj'

.? ... 0 or 1 characters

Or:

WHERE code ~* 'AB\-?123\-?lHdfj'

\-? ... 0 or 1 dashes

You may want to escape special characters in LIKE or regexp patterns. See:

Escape function for regular expression or LIKE patterns

If your actual problem is more complex and you need something faster then there are various options, depending on your requirements:

There is full text search, of course. But this may be an overkill in your case.
A more likely candidate is trigram-matching with the additional module pg_trgm. See:
Can be combined it with LIKE, ILIKE, ~, or ~* since PostgreSQL 9.1.
Also interesting in this context: the similarity() function or % operator of that module.
Last but not least you can implement a hand-knit solution with a function to normalize the strings to be searched. For instance, you could transform AB1-23-lHdfj --> ab123lhdfj, save it in an additional column and search with terms transformed the same way.
Or use an index on the expression instead of the redundant column. (Involved functions must be IMMUTABLE.) Possibly combine that with pg_tgrm from above.