我打算将基于 Sphinx 的搜索集成到网站中,但我发现没有内置的拼写纠正支持。
网上的人们建议使用 pspell 或其他第三方库来完成任务,但问题是我要搜索的数据主要包含品牌名称等“技术”术语,因此我不认为通用库将包括他们。
另一方面,Xapian 声明提供基于索引数据的拼写纠正支持,这正是我想要的。值得使用 Xapian 代替吗?我仍然很困惑应该使用哪个全文搜索引擎:Sphinx 似乎相当不错,但缺乏 Xapian(或者可能是 Lucene?)的一些很酷的功能,而看起来后者的社区较小,文档较少。
我认为我可以使用自定义词典来解决 pspell 词典中不存在的单词的问题,但我不确定这是否会造成明显的性能损失?我将在一个非常受欢迎的网站上使用搜索系统进行聚光灯搜索(通过 ajax 对输入的每个字母进行单独搜索),因此性能很重要。
理想情况下,我想让一些字段(例如品牌名称)比普通词典具有更高的优先级,但我想这并不重要,因为大多数品牌名称与其他单词截然不同。
也欢迎对自定义全文搜索引擎的总体设计提出任何建议。
谢谢
I was about to integrate the Sphinx-based search into the website, but I've found that there's no built support for spelling correction.
Folks on the web suggest using pspell or other third-party libraries to get things done, but the problem is the data I'm going to search in, contains mostly "technical" terms like brand names, thus I don't think common libraries will include them.
On the other hand, Xapian states to have spelling correction support based on the data indexed, so exactly what I want. Is it worth using Xapian instead? I'm still quite confused of which fulltext search engine I should use: Sphinx seems to be quite good, but lacking some cool features of Xapian (or maybe Lucene?), while it looks like the latter has smaller community and less documentation.
I think I can solve the problem with words not present in pspell dictionary using the custom one for it, but I'm not sure whether that will impose noticeable performance losses? I'm going to use the search system for the spotlight search (separate search via ajax on every letter entered) on a pretty popular website, so performance matters.
Ideally, I'd like to make some fields like brand names have more priority over common dictionary but I guess that's not really important since most brand names a quite distinct from the other words.
Any suggestions on the general design of the custom full-text search engine are welcome too.
Thanks
发布评论
评论(2)
Sphinx 没有内置的拼写纠正功能,但可以使用 Sphinx 来实现。只能在那里找到一篇关于此的操作方法文章(由 Sphinx 作者撰写)http://habrahabr.ru/ blogs/sphinx/61807(俄语,您可以使用 GoogleTranslate 阅读本文。请参阅名为“Я понял, это намек”的文章的第二部分。)
我最近实现了该方法 - 效果完美!
Sphinx has no built-in spelling-correction, but that can be implemented using Sphinx. Only one how-to article (by Sphinx author) about this can be found there http://habrahabr.ru/blogs/sphinx/61807 (in Russian, You can use GoogleTranslate for read this article. Look on the second part of article named "Я понял, это намек.")
I implement that method recently - works perfect!
Sphinx 允许您使用形态预处理器和词形词典。这两者结合起来可以让你更接近你想要实现的目标。您可以在此处阅读有关这两个主题的更多信息: http://sphinxsearch.com/ docs/manual-0.9.8.html#conf-morphology 及下文。
有多种形态预处理器可供选择,选择最适合您需求的一种。该文档还提到了 Snowball 项目,如果需要,该项目可用于添加除内置英语和俄语之外的其他语言的词干。项目网站:http://snowball.tartarus.org/
Sphinx 是一个非常快速的全文搜索引擎,使用词干分析器不太可能将速度减慢到您开始注意到的程度。
Sphinx allows you to use morphology preprocessors and word forms dictionaries. Both of these combined could get you closer to what you want to achieve. You can read more about both topics here: http://sphinxsearch.com/docs/manual-0.9.8.html#conf-morphology and further below.
There are several "flavours" of morphology preprocessors available, choose one that best fits your needs. The docs also mention the Snowball project, which can be used to add stems in other languages than the built-in english and russian, if needed. The project website: http://snowball.tartarus.org/
Sphinx is a very fast full text search engine and using stemmers is not likely to slow it down to the extent that you start noticing it.