PHP/MySQL小规模模糊搜索
我正在寻求为小型 PHP/MySQL 应用程序实现模糊搜索。具体来说,我有一个包含大约 2400 条记录的数据库(记录以每年大约 600 条的速度添加,因此这是一个小型数据库)。感兴趣的三个字段是街道地址、姓氏和日期。我希望能够通过这些字段之一进行搜索,并且基本上能够容忍拼写/字符错误。即,地址“123 Main Street”还应匹配“123 Main St”、“123 Main St.”、“123 Mian St”、“123 Man St”、“132 Main St”等,名称也同样如此和日期。
我在回答其他类似问题时遇到的主要问题是:
- 不可能为每个可能的错误拼写定义同义词,忘记为日期和名称定义同义词。
- 对于如此有限的搜索数据集(称之为最多 5000 条记录,每条记录 3 个字段),Lucene 等似乎非常重量级。
- 对于所有可能的拼写错误,仅使用通配符执行某些操作似乎并不符合逻辑。
有什么建议吗?我知道不可能用 MySQL 本地完成,但由于数据集非常有限,我想保持它相对简单......也许是一个 PHP 类,可以获取所有内容 来自数据库的记录,使用某种比较算法,并返回相似记录的 ID?
谢谢, 贾森
I'm looking to implement fuzzy search for a small PHP/MySQL application. Specifically, I have a database with about 2400 records (records added at a rate of about 600 per year, so it's a small database). The three fields of interest are street address, last name and date. I want to be able to search by one of those fields, and essentially have tolerance for spelling/character errors. i.e., an address of "123 Main Street" should also match "123 Main St", "123 Main St.", "123 Mian St", "123 Man St", "132 Main St", etc. and likewise for name and date.
The main issues I have with answers to other similar questions:
- It's impossible to define synonyms for every possible incorrect spelling, forget doing so for dates and names.
- Lucene, etc. seems very heavy-weight for such a limited search data set (call it a maximum of 5,000 records, 3 fields per record).
- Just doing something with wildcards doesn't seem logical with all of the possible spelling errors.
Any suggestions? I know it isn't going to be possible to do natively with MySQL, but since the data set is so limited, I'd like to keep it relatively simple... perhaps a PHP class that gets all of the records from the DB, uses some sort of comparison algorithm, and returns the IDs of the similar records?
Thanks,
Jason
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Razzie 的答案(或使用 Damerau–Levenshtein)对候选匹配列表进行排名根据它们与搜索键的接近程度。 (注意:如果按键是“12 Main St”,则“13 Main St”与“12 Moin St”具有相同的打字距离,但您可能希望将其排名较低,甚至排除它,就像 11 和 22 Main St 一样等等)
但是如何选择规模可管理的候选人列表进行排名呢?
一种方法是计算要搜索的字符串中每个单词的变音位值(或多个值,使用双变音位)。将每个变音位保存在另一个表中,并使用包含原始字符串的行的 ID。然后,您可以使用 LIKE 'key%' 快速搜索这些变音位值,其中 key 是搜索文本中单词的变音位。
查看此线程上的建议答案。它非常简洁,对于规模不大的数据库来说应该可以很好地工作。
Razzie's answer (or using Damerau–Levenshtein) ranks a list of candidates matches according to their closeness to the search key. (Take care: if the key is "12 Main St" then "13 Main St" has the same typing distance as "12 Moin St" but you might want to rank it low or even exclude it, as with 11 and 22 Main St etc.)
But how do you select a list of candidates of a manageable size to rank?
One way is to compute the metaphone value (or values, using double-metaphone) for each word in the strings your going to search. Save each of these metaphones in another table with the id of the row containing the original string. You can then search these metaphone values quickly with LIKE 'key%' where key is the metaphone of a word from the search text.
Check out the suggested answer on this thread. It's quite neat and should work nicely for DBs that aren't huge.
如果它是一个非常小的数据库,您可以一次加载所有数据并使用类似Jaro-Winkler 供您搜索。他们有一个 PHP 实现,您可以在此处找到。
恕我直言,它的效果非常好。 此处查看示例实现。我知道该搜索使用相同的算法,并且可以很好地找到“Nintedno”。它还会根据与您的查询最匹配的结果对结果进行排序。
If it is a very small database, you could load all the data at once and use an algorithm like Jaro-Winkler for your search. They have an implementation in PHP, which you can find here.
Imho it works really well. Take a look at an example implementation here. I know that that search uses the same algorithm, and it can find 'Nintedno' very well. It also sorts the results for you, based on which result best matches your query.