部分单词的文档搜索
我正在寻找一个能够搜索部分术语的文档搜索引擎(如 Xapian、Whoosh、Lucene、Solr、Sphinx 或其他)。
例如,当搜索术语“brit”时,搜索引擎应返回包含“britney”或“britain”的文档,或者通常包含与 r*brit*
匹配的单词的任何文档
。引擎使用 TF-IDF(术语频率-逆文档频率)或其衍生物,这些衍生物基于完整术语而不是部分术语。除了 TF-IDF 之外,还有其他已成功实施的文档检索技术吗?
I am looking for a document search engine (like Xapian, Whoosh, Lucene, Solr, Sphinx or others) which is capable of searching partial terms.
For example when searching for the term "brit" the search engine should return documents containing either "britney" or "britain" or in general any document containing a word matching r*brit*
Tangentially, I noticed most engines use TF-IDF (Term frequency-Inverse document frequency) or its derivatives which are based on full terms and not partial terms. Are there any other techniques that have been successfully implemented besides TF-IDF for document retrieval?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用 lucene,您可以通过多种方式实现此目的:
1.)您可以使用通配符查询
*brit*
(您必须设置查询解析器以允许前导通配符)2.)您可以创建一个包含 所有术语的 N 元语法。这会导致索引更大,但在许多情况下会更快(搜索速度)。
3.) 您可以使用模糊搜索来处理查询中的输入错误。例如,有人输入了
britnei
但想找到britney
。对于通配符查询和模糊搜索,请查看查询语法文档。
With lucene you would be able to implement this in several ways:
1.) You can use wildcard queries
*brit*
(You would have to set your query parser to allow leading wild cards)2.) You can create an additional field containing N-Grams of all the terms. This would result in larger indexes, but would be in many cases faster (search speed).
3.) You can use fuzzy search to handle typing mistakes in the query. e.g. someone typed
britnei
but wanted to findbritney
.For wildcard queries and fuzzy search have a look at the query syntax docs.