如何在没有大量数据库的情况下进行模糊字符串搜索?

发布于 2024-07-19 06:44:18 字数 599 浏览 8 评论 0原文

我有目录号到产品名称的映射:

35  cozy comforter
35  warm blanket
67  pillow

并且需要进行搜索以查找拼写错误的混合名称,例如“warm cmfrter”

我们有使用编辑距离 (difflib) 的代码,但它可能无法扩展到 18000 个名称。

我使用 Lucene 实现了类似的目标,但由于 PyLucene 只包装了 Java,这会使最终用户的部署变得复杂。

SQLite 通常不会编译全文或评分。

Xapian 绑定 是喜欢 C++ 并且有一些学习曲线。

Whoosh 尚未有详细记录,但包含一个可滥用的拼写检查器。

那里还有什么?

I have a mapping of catalog numbers to product names:

35  cozy comforter
35  warm blanket
67  pillow

and need a search that would find misspelled, mixed names like "warm cmfrter".

We have code using edit-distance (difflib), but it probably won't scale to the 18000 names.

I achieved something similar with Lucene, but as PyLucene only wraps Java that would complicate deployment to end-users.

SQLite doesn't usually have full-text or scoring compiled in.

The Xapian bindings are like C++ and have some learning curve.

Whoosh is not yet well-documented but includes an abusable spell-checker.

What else is there?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

冬天旳寂寞 2024-07-26 06:44:18

显然,快速进行模糊比较的唯一方法是少做一些;)

我们现在保留一个单词索引,而不是编写另一个 n-gram 搜索或改进 Whoosh 中的搜索,而是检索至少具有一个(拼写正确)的所有条目与查询相同的单词,并使用 difflib 对它们进行排名。 在这种情况下效果很好。

Apparently the only way to make fuzzy comparisons fast is to do less of them ;)

Instead of writing another n-gram search or improving the one in Whoosh we now keep a word index, retrieve all entries that have at least one (correctly spelled) word in common with the query, and use difflib to rank those. Works well enough in this case.

玩世 2024-07-26 06:44:18

使用 SOUNDEX 实现时,您会收到太多误报。 仅有 26,000 个(最多)可能的 SOUNDEX 代码。

尽管 Metaphone 算法是为英语姓氏设计的,但它对于拼写错误也非常有效。 我在分支定位器中使用过一次,非常成功。

添加一个包含变音位翻译的字段,如果未找到完全匹配,则与其进行匹配。 您仍然会得到误报,但使用其他算法时会更少。

You will get too many false positives with the SOUNDEX implementation. There are only 26,000 (at most) possible SOUNDEX codes.

Though the Metaphone algorithim was designed for English surnames, it works pretty well for misspellings; I used it once in a branch locator and it was quite successful.

Add a field with the Metaphone translation and match against it if no exact match is found. You will still get false positives, but fewer that with other algorithims.

满意归宿 2024-07-26 06:44:18


有全文搜索,但不支持拼写错误的匹配
盒子外面。 您可以尝试为每个条目添加一个附加字段
它索引了
SOUNDEX 术语翻译,然后使用 soundex 翻译进行搜索
用户输入的。 我真的不知道这会有多好...

看看 advas < a href="http://advas.sourceforge.net/news.php" rel="nofollow noreferrer">http://advas.sourceforge.net/news.php 其中有一个很好的演示比较各种 soundex -类似的方法:

advas/examples Aaron$ python phonetic_algorithms.py 
                    soundex       metaphone           nyiis      caverphone 
====================================================================================================
 schmidt :             S253           sxmtt          sssnad      SKMT111111
  schmid :             S253            sxmt          sssnad      SKMT111111
 schmitt :             S253            sxmt         sssnatt      SKMT111111
   smith :             S530            sm0h           snatt      SMT1111111
  smythe :             S530           smy0h           snatt      SMT1111111
 schmied :             S253            sxmt         sssnaad      SKMT111111
   mayer :             M600             myr           naaar      MA11111111
   meier :             M600              mr           naaar      MA11111111
....

我不知道它们是否适合您的未命名语言......

Nucular
has full text search, but it doesn't support misspelled matches
out-of-the box. You could try adding an additional field to each entry
which indexes the
SOUNDEX translation of the terms and then search using the soundex translation
of the user input. I really don't know how well this would work...

Take a look at advas http://advas.sourceforge.net/news.php which has a nice demo comparing various soundex-alike methods:

advas/examples Aaron$ python phonetic_algorithms.py 
                    soundex       metaphone           nyiis      caverphone 
====================================================================================================
 schmidt :             S253           sxmtt          sssnad      SKMT111111
  schmid :             S253            sxmt          sssnad      SKMT111111
 schmitt :             S253            sxmt         sssnatt      SKMT111111
   smith :             S530            sm0h           snatt      SMT1111111
  smythe :             S530           smy0h           snatt      SMT1111111
 schmied :             S253            sxmt         sssnaad      SKMT111111
   mayer :             M600             myr           naaar      MA11111111
   meier :             M600              mr           naaar      MA11111111
....

I don't know if any of them is good for your unnamed language...

記憶穿過時間隧道 2024-07-26 06:44:18

计算两个字符串之间的编辑距离的常用方法是一种相当昂贵的算法(如果我没记错的话,它的时间复杂度是二次的)。 也许如果您使用不同的字符串相似性度量,那么您的问题就会消失。

我最喜欢的模糊字符串匹配方法之一是三元组匹配。 使用此方法比较两个字符串具有线性时间复杂度,这比提到的编辑距离要好得多。 您可以在 Github 上找到我的 Python 实现。 还有一个 PostgreSQL contrib 模块 正是用于此目的。 使其与 SQLite3 配合使用应该不会太困难。

The usual way of computing the edit-distance between two strings is a rather expensive algorithm (if I remember correctly, its time complexity is quadratic). Maybe if you used a different string similarity metric then your problem would go away.

One of my favorite methods for fuzzy string matching is trigrams matching. Comparing two strings using this method has a linear time complexity, which is much better than the mentioned edit-distance. You can find my Python implementation on Github. There's also a PostgreSQL contrib module exactly for that. It shouldn't be too difficult to adapt it to work with SQLite3.

凉薄对峙 2024-07-26 06:44:18

Sybase SQL Anywhere 有免费的网络版/开发版,并附带全文索引/搜索和 FUZZY 运算符(以及一些实现限制)。

引用文档:

Specifying 'FUZZY "500 main street"' is equivalent to 
'500 OR mai OR ain OR str OR tre OR ree OR eet'. 

另一种方法是对全文搜索使用评分。

Sybase SQL Anywhere has a free Web edition/Developer Edition and comes with full text indexing/search and a FUZZY operator (and some implementation constraints).

To quote from the documentation:

Specifying 'FUZZY "500 main street"' is equivalent to 
'500 OR mai OR ain OR str OR tre OR ree OR eet'. 

An other approach would be to use scoring on full-text searches.

停顿的约定 2024-07-26 06:44:18

sqlite3支持python回调函数。 Matthew Barnett 的正则表达式 (http://code.google.com/p/mrab-regex-hg/) 现在支持近似匹配。

所以,像这样:

try:
    import regex
except ImportError:
    sys.stderr.write("Can't import mrab-regex; see http://pypi.python.org/pypi/regex\n")
    sys.exit(1)

def _sqlite3_regex(expr, item):
    return (not (not regex.search(expr, item)))

def main():
    ...
    database = sqlite3.connect(dbfile)
    database.create_function("regexp", 2, _sqlite3_regex)
    pattern = '(?:%s){e<=%d}' % (queriedname, distance)
    print [x for x in database.cursor().execute(
         "SELECT * FROM products WHERE (productname regexp '%s')" % pattern)]

sqlite3 supports python callback functions. Matthew Barnett's regex (http://code.google.com/p/mrab-regex-hg/) now supports approximate matching.

So, like this:

try:
    import regex
except ImportError:
    sys.stderr.write("Can't import mrab-regex; see http://pypi.python.org/pypi/regex\n")
    sys.exit(1)

def _sqlite3_regex(expr, item):
    return (not (not regex.search(expr, item)))

def main():
    ...
    database = sqlite3.connect(dbfile)
    database.create_function("regexp", 2, _sqlite3_regex)
    pattern = '(?:%s){e<=%d}' % (queriedname, distance)
    print [x for x in database.cursor().execute(
         "SELECT * FROM products WHERE (productname regexp '%s')" % pattern)]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文