在数据库中进行近似搜索
我有一个大型数据库,其中包含机构列表(大学、医院等)。机构名称的来源不同,同一机构的拼写也可能不同。例如,它们可能会拼写错误,或者单词可能会被缩短(“uni”、“univ”或“university”)
给定一个我需要插入到数据库中的名称,是否有一种实用的方法来查找该名称是否有效?机构已在数据库中?这不是一个研究项目,所以我正在寻找一个相当快的解决方案。
我正在使用 django 和 postgresql,但我想这并不重要。
I have a large database with a list of institutions (universities, hospitals, etc). The names of institutions come from different sources and can be spelled differently for the same institution. They can be misspelled, for example, or words can be shortened ("uni", or "univ", or "university")
Given a name that I need to insert in to the database, is there a practical way to find if this institution is already in the database? This is not a research project, so I am looking for a solution that is reasonably fast.
I am using django and postgresql, but it does not matter I suppose.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
听起来您想在数据库中找到一个与给定值的词汇距离较小的值。查找带有前缀的内容相当简单,但拼写错误的单词则更难。您可能想阅读Peter Norvig 关于拼写纠正器的帖子。
It sounds like you want to find a value in the database with a small lexical distance from the value you're given. Finding things with prefixes is fairly straightforward, but misspelled words are harder. You might want to read Peter Norvig's post on spell correctors.
这是记录链接的问题。许多数据库为此提供了基本方法,例如字符级n-gram匹配,其中像“university”这样的术语被扩展为
[“uni”,“niv”,“ive”,“ver” , "ers", ...]
for n = 3。数据库将为所有此类n-gram 建立索引,并允许使用某种加权匹配进行搜索。
pg_trgm
似乎正是这样做的,请尝试它出来了。This is the problem of record linkage. Many databases provide basic methods for this such as character-level n-gram matching, where a term like "university" is expanded into
["uni", "niv", "ive", "ver", "ers", ...]
for n = 3. The database would index all such n-grams and allow a search with some kind of weighted matching.
pg_trgm
seems to do exactly this, try it out.您可能应该考虑使用专用的搜索引擎。 Django-haystack 使您能够轻松将 Solr、Whoosh 或 Xapian 等搜索引擎添加到您的项目中。
You should probably look into using a dedicated search engine. Django-haystack enables you to easily add search engines like Solr, Whoosh or Xapian to your project.