lucene ngram tokenizer 用于模糊短语匹配
我正在尝试通过使用 lucene 来实现模糊短语搜索(以匹配拼写错误的单词),通过参考我想在模糊短语搜索上尝试 ngram 索引的各种博客。
但我找不到 ngram tokenizer 作为我的 lucene3.4 JAR 库的一部分,它是否已被弃用并被其他东西替换? - 目前我正在使用 standardAnalyzer,我在术语的精确匹配方面获得了不错的结果。
我有以下两个要求需要处理。
我的索引有包含短语“xyz abc pqr”的文档,当我提供查询“abc xyz”~5时,我能够获得结果,但我的要求是获得同一文档的结果,即使我有一个额外的单词,如“ abc xyz pqr tst" 在我的查询中(我知道匹配分数会少一些) - 在短语中使用邻近额外单词不起作用,如果我从查询中删除邻近和双引号 " ",我会得到预期的结果(但有我明白了许多误报,例如仅包含 xyz、仅 abc 等的文档。)
在上面的同一个示例中,如果有人拼错查询“abc xxz”,我仍然想获得同一文档的结果。
我想尝试一下 ngram,但不确定它是否会按预期工作。
有什么想法吗?
I am trying to achieve fuzzy phrase search (to match misspelled words) by using lucene, by referring various blogs I thought to try ngram indexes on fuzzy phrase search.
But I couldn't find ngram tokenizer as part of my lucene3.4 JAR library, is it deprecated and replaced with something else ? - currently I am using standardAnalyzer where I am getting decent results for exact match of terms.
I have below two requirements to handle.
My index is having document with phrase "xyz abc pqr", when I provide query "abc xyz"~5, I am able to get results, but my requirement is to get results for same document even though I have one extra word like "abc xyz pqr tst" in my query (I understand match score will be little less) - using proximity extra word in phrase is not working, if I remove proximity and double quotes " " from my query, I am getting expected results (but there I get many false positives like documents containing only xyz, only abc etc.)
In same above example, if somebody misspell query "abc xxz", I still want to get results for same document.
I want to give a try with ngram but not sure it will work as expected.
Any thoughts ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
尝试使用
BooleanQuery
和FuzzyQuery
,例如:Try to use
BooleanQuery
andFuzzyQuery
like: