如何在大型字符串数据库中找到字符串的最佳模糊匹配
我有一个字符串数据库(任意长度),其中包含超过一百万个项目(可能更多)。
我需要将用户提供的字符串与整个数据库进行比较,并检索相同的字符串(如果存在),否则返回最接近的模糊匹配(60% 相似度或更好)。 理想情况下,搜索时间应在一秒以下。
我的想法是在根据长度缩小数据库中的候选者范围后,使用编辑距离将每个数据库字符串与搜索字符串进行比较。
但是,由于我需要经常执行此操作,因此我正在考虑构建数据库字符串的索引以保存在内存中并查询索引,而不是直接查询数据库。
关于如何以不同的方式解决这个问题或如何构建内存索引有什么想法吗?
I have a database of strings (arbitrary length) which holds more than one million items (potentially more).
I need to compare a user-provided string against the whole database and retrieve an identical string if it exists or otherwise return the closest fuzzy match(es) (60% similarity or better). The search time should ideally be under one second.
My idea is to use edit distance for comparing each db string to the search string after narrowing down the candidates from the db based on their length.
However, as I will need to perform this operation very often, I'm thinking about building an index of the db strings to keep in memory and query the index, not the db directly.
Any ideas on how to approach this problem differently or how to build the in-memory index?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
这篇论文似乎准确地描述了您想要的内容。
Lucene (http://lucene.apache.org/) 也实现了 Levenshtein 编辑距离。
This paper seems to describe exactly what you want.
Lucene (http://lucene.apache.org/) also implements Levenshtein edit distance.
您没有提到您的数据库系统,但对于 PostrgreSQL,您可以使用以下 contrib 模块: trgm - PostgreSQL 的 Trigram 匹配
You didn't mention your database system, but for PostrgreSQL you could use the following contrib module: trgm - Trigram matching for PostgreSQL
如果您的数据库支持,您应该使用全文搜索。 否则,您可以使用像 lucene 及其各种实现这样的索引器。
If your database supports it, you should use full-text search. Otherwise, you can use an indexer like lucene and its various implementations.
计算 SOUNDEX 哈希(内置于许多 SQL 数据库引擎中)并用它建立索引。
SOUNDEX 是基于单词发音的哈希值,因此同一单词的拼写错误很可能具有相同的 SOUNDEX 哈希值。
然后找到搜索字符串的 SOUNDEX 哈希值,并对其进行匹配。
Compute the SOUNDEX hash (which is built into many SQL database engines) and index by it.
SOUNDEX is a hash based on the sound of the words, so spelling errors of the same word are likely to have the same SOUNDEX hash.
Then find the SOUNDEX hash of the search string, and match on it.
由于数据量很大,当插入记录时,我将计算语音算法的值并将其存储在索引列中,然后将(WHERE 子句)我的选择查询限制在该列的范围内。
Since the amount of data is large, when inserting a record I would compute and store the value of the phonetic algorithm in an indexed column and then constrain (WHERE clause) my select queries within a range on that column.
丹·古斯菲尔德。
A very extensive explanation of relevant algorithms is in the book Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology by Dan Gusfield.
https://en.wikipedia.org/wiki/Levenshtein_distance
Levenshtein算法已在一些DBMS中实现
(例如PostgreSql:http://www.postgresql.org/docs/9.1/static /fuzzystrmatch.html)
https://en.wikipedia.org/wiki/Levenshtein_distance
Levenshtein algorithm has been implemented in some DBMS
(E.g. PostgreSql: http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html)