轻量级模糊搜索库
你能推荐一些轻量级的模糊文本搜索库吗?
我想做的是让用户能够找到带有拼写错误的搜索词的正确数据。
我可以使用像 Lucene 这样的全文搜索引擎,但我认为这是一种矫枉过正的做法。
编辑:
为了使问题更清楚,这里是该库的主要场景:
我有一个很大的字符串列表。 我希望能够在此列表中进行搜索(类似于 MSVS 的智能感知),但应该可以通过列表中不存在但与列表中的某些字符串足够接近的字符串来过滤此列表。
示例:
- 红绿
- 蓝
- 当我在文本框中键入“Gren ”
或“Geen”时,我希望在结果集中看到“Green”。
索引数据的主要语言是英语。
我认为 Lucene 对于这项任务来说太繁重了。
更新:
我找到了一款符合我要求的产品。 这是ShuffleText。
你知道有什么替代方案吗?
Can you suggest some light weight fuzzy text search library?
What I want to do is to allow users to find correct data for search terms with typos.
I could use full-text search engines like Lucene, but I think it's an overkill.
Edit:
To make question more clear here is a main scenario for that library:
I have a large list of strings. I want to be able to search in this list (something like MSVS' intellisense) but it should be possible to filter this list by string which is not present in it but close enough to some string which is in the list.
Example:
- Red
- Green
- Blue
When I type 'Gren' or 'Geen' in a text box, I want to see 'Green' in the result set.
Main language for indexed data will be English.
I think that Lucene is to heavy for that task.
Update:
I found one product matching my requirements. It's ShuffleText.
Do you know any alternatives?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
@aku - 工作 soundex 库的链接就在页面底部。
至于 Levenshtein 距离,维基百科文章也在底部列出了实现。
@aku - links to working soundex libraries are right there at the bottom of the page.
As for Levenshtein distance, the Wikipedia article on that also has implementations listed at the bottom.
sphinx 是一个强大、轻量级的解决方案。
它比 Lucene 小并且支持消歧。
它是用 C++ 编写的,速度快,经过实战测试,拥有适用于每个环境的库,并且被大公司使用,例如 craigslists.org
A powerful, lightweight solution is sphinx.
It's smaller then Lucene and it supports disambiguation.
It's written in c++, it's fast, battle-tested, has libraries for every env and it's used by large companies, like craigslists.org
Lucene 具有很强的可扩展性,这意味着它也适用于小型应用程序。 如果您需要的话,您可以非常快速地在内存中创建索引。
对于模糊搜索,您确实需要决定要使用哪种算法。 在信息检索方面,我成功地将 n-gram 技术与 Lucene 结合使用。 但这是一种特殊的索引技术,本身并不是一个“库”。
如果不了解更多关于您的应用程序的信息,推荐合适的库并不容易。 您要搜索多少数据? 数据是什么格式? 数据多久更新一次?
Lucene is very scalable—which means its good for little applications too. You can create an index in memory very quickly if that's all you need.
For fuzzy searching, you really need to decide what algorithm you'd like to use. With information retrieval, I use an n-gram technique with Lucene successfully. But that's a special indexing technique, not a "library" in itself.
Without knowing more about your application, it won't be easy to recommend a suitable library. How much data are you searching? What format is the data? How often is the data updated?
Soundex 的编码非常“英国化”——Daitch-Mokotoff 对于许多名字来说效果更好,尤其是欧洲(日耳曼)和犹太名字。 在我以英国为中心的世界中,这就是我所使用的。
维基此处。
Soundex is very 'English' in it's encoding - Daitch-Mokotoff works better for many names, especially European (Germanic) and Jewish names. In my UK-centric world, it's what I use.
Wiki here.
我不确定 Lucene 是否适合模糊搜索,自定义库将是更好的选择。 例如,此搜索是用 Java 完成的并且运行速度相当快,但它是为此类任务定制的:
http://www.softcorporation.com/products/people/
I'm not sure how well Lucene is suited for fuzzy searching, the custom library would be better choice. For example, this search is done in Java and works pretty fast, but it is custom made for such task:
http://www.softcorporation.com/products/people/
您没有指定您的开发平台,但如果它是 PHP,那么建议您查看 ZEND Lucene lubrary:
http://ifacethoughts.net/2008/02/07/zend-brings-lucene-to-php/
http://framework.zend.com/manual/en/zend。 search.lucene.html
由于 LAMP 比 Java 上的 Lucene 轻得多,并且可以轻松扩展为其他文件类型,只要您能找到转换库或命令行转换器 - 有很多 OSS 解决方案可供使用这。
You didn't specify your development platform, but if its PHP then suggest you look at the ZEND Lucene lubrary :
http://ifacethoughts.net/2008/02/07/zend-brings-lucene-to-php/
http://framework.zend.com/manual/en/zend.search.lucene.html
As it LAMP its far lighter than Lucene on Java, and can easily be extended for other filetypes, provided you can find a conversion library or cmd line converter - there are lots of OSS solutions around to do this.
尝试基于 Lucene API 的 Walnutil,集成到 SQL Server 和 Oracle DB。 您可以创建任何类型的索引然后使用它。 对于简单的搜索,您可以使用 walnutilsoft 的一些方法,对于更复杂的搜索情况,您可以使用 Lucene API。 请参阅基于 Web 的示例,其中使用了从 Walnutil Tools 创建的索引。 您还可以看到一些用 Java 和 C# 编写的代码示例,您可以使用它来创建不同类型的搜索。
该工具是免费的。
http://www.walnutilsoft.com/
Try Walnutil - based on Lucene API - integrated to SQL Server and Oracle DBs . You can create any type of index and then use it. For simple search you can use some methods from walnutilsoft, for more complicated search cases you can use Lucene API. See web based example where was used indexes created from Walnutil Tools. Also you can see some code example written on Java and C# which you can use it for creating different type of search.
This tools is free.
http://www.walnutilsoft.com/
如果您可以选择使用数据库,我建议使用 PostgreSQL 及其 模糊字符串匹配功能。
如果您可以使用 Ruby,我建议您查看 amatch 库。
If you can choose to use a database, I recommend using PostgreSQL and its fuzzy string matching functions.
If you can use Ruby, I suggest looking into the amatch library.