Lucene.Net/SpellChecker - 基于多单词/短语的自动建议
我在我的网站上实现了 Lucenet.NET,用它来索引我的产品,包括伦敦周围的剧院表演、旅游和景点。
我想实现一个“你是说吗?”当用户拼错产品名称时,该功能会考虑整个产品标题,而不仅仅是单个单词。例如,
如果用户输入:
Lodnon Eye,
我想自动建议:
London 伦敦眼
我假设我需要让分析器对标题进行索引,就好像它们是单个实体一样,以便拼写检查器可以对短语以及单个单词进行最近匹配。
我该怎么做?
I've implemented Lucenet.NET on my site, using it to index my products which are theatre shows, tours and attractions around London.
I want to implement a "Did you mean?" feature for when users misspell product names that takes the whole product titles into account and not just single words. For example,
If the user typed:
Lodnon Eye
I would like to auto-suggest:
London
London Eye
I assume I nead to have the analyzer index the titles as if they are a single entity, so that SpellChecker can nearest-match on the phrase, as well as the individual words.
How would I do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这里有一个很棒的博客系列:
我还发现了另一个名为 SimpleLucene 的项目,每当您需要更新或删除文档时,您都可以使用它来维护 lucene 索引。 在此处阅读相关内容
There is a excellent blog series here:
I have also found another project called SimpleLucene which you can use to maintain your lucene indexes whenever you need to update or delete a document. Read about it here
我最近刚刚在 lucene.net 中实现了一个短语自动建议系统。
基本上,java 版本的 lucene 在 contrib 文件夹之一中有一个 shinglefilter,它将一个句子分解为所有可能的短语组合。不幸的是 lucene.nets contrib 过滤器还没有完全实现,所以我们没有 shingle 过滤器。
但是,只要版本相同,用java编写的lucene索引就可以被lucene.net读取。所以我所做的如下:
使用 jake scotts 链接的“您的意思是”部分中列出的拼写检查.IndexDictionary 方法在 lucene.net 中创建拼写索引。请注意,仅创建单个单词的拼写索引,而不创建短语。
然后我创建了一个java应用程序,它使用shingle过滤器创建我正在搜索的文本短语并将其保存在临时索引中。
然后,我在 dotnet 中编写了另一种方法来打开这个临时索引,并将每个短语作为一行或文档添加到已经包含单个单词的拼写索引中。诀窍是确保您添加的文档与其余拼写文档具有相同的形式,因此我删除了 lucene.net 项目中拼写检查器代码中使用的方法并对其进行了编辑。
完成后,您可以调用pellcheck.suggestsimilar方法并向其传递一个拼写错误的短语,它会返回一个有效的建议。
i've just recently implemented a phrase autosuggest system in lucene.net.
basically, the java version of lucene has a shinglefilter in one of the contrib folders which breaks down a sentence into all possible phrase combinations. Unfortunately lucene.nets contrib filters aren't quite there yet and so we don't have a shingle filter.
but, a lucene index written in java can be read by lucene.net as long as the versions are the same. so what i did was the following :
created a spell index in lucene.net using the spellcheck.IndexDictionary method as laid out in the "did you mean" section of jake scotts link. please note that only creates a spelling index of single words, not phrases.
i then created a java app that uses the shingle filter to create phrases of the text i'm searching and saves it in a temporary index.
i then wrote another method in dotnet to open this temporary index and add each of the phrases as a line or document into my spelling index that already contains the single words. the trick is to make sure the documents you're adding have the same form as the rest of the spell documents, so i ripped out the methods used in the spellchecker code in the lucene.net project and edited those.
once you've done that you can call the spellcheck.suggestsimilar method and pass it a misspelled phrase and it will return you a valid suggestion.
这可能不是最好的解决方案,我肯定会使用 spaceman 建议的答案,但这是另一个可能的解决方案。对每个标题使用 KeywordAnalyzer 或 KeywordTonenizer,这不会将标题分解为单独的标记,而是将其保留为一个标记。使用 SuggestSimilar 方法将返回整个标题作为建议。
This is probably not the best solution and I definitely would use the answer suggested by spaceman but here is another possible solution. Use the KeywordAnalyzer or the KeywordTonenizer on each title, this will not break down the title into separate tokens but keep it as one token. Using the SuggestSimilar method would return the whole title as suggestions.