搜索引擎之间的主要区别是什么,这些区别会影响使用哪个搜索引擎来搜索专有数据的决定?
搜索引擎(DtSearch、Lucene.net、Sphinx、Google 等)之间的主要区别是什么,这些区别会影响使用哪个搜索引擎来搜索专有数据的决定?
要搜索的数据由无表示的数据组成,这些数据以名称/值对的形式用元数据标记。 我们对各种工具的格式解析能力不感兴趣。 此外,搜索结果需要结构良好、无需呈现的数据,能够与其他(类似结构的存储库)的搜索结果聚合。
下面列出了需要为决策提供信息的一些相关搜索引擎特征。进一步的建议或描述欢迎体验。
• 费用 。 • 使用方便 • 可配置为仅返回特定标签 • 可以“识别”特定术语,为搜索结果赋予更高的权重 • 快速< 0.3秒返回搜索结果或%E6记录/文档 • 支持类型标签(查找weather='sunny' 但不查找personality=sunny) • 支持权重以给出相关性排名 • 按相关性排序返回结果 • 支持同义词 • 支持词干提取 • 支持停用词 • 支持拼写更正 • 适合并行化或索引构建(如果基于索引) • 快速重新索引(如果基于索引) • 快速更新索引(如果基于索引) • 合并多个索引的结果(如果基于索引) • 邻近性检查:为紧密相连的单词提供更高的相关性
What are the main differences between search engines (DtSearch , Lucene.net, Sphinx, Google etc) that should influence the decision as to which to use to search proprietary data?
The data to be searched consists of presentation-free data that is marked up with metadata in the form of name/value pairs. We’re not interested in the format parsing abilities of the tools various. Also, the search results need to be well structured, presentation-free data that is amenable to aggregating with search results from other (similarly structured repositories.
Some relevant search engine characteristics that need to inform the decision are listed below. Futther suggestions or description of experiences welcome.
• Cost
• Ease of use
• Can be configured to return specific tags only
• Can ‘identify’ specific terms give search results higher weighting for these results
• Fast < 0.3seconds to return search results or %E6 records/documents
• Support tags with types (find weather=’sunny’ but not personality=sunny)
• Support weightings to give relevancy ranking
• Return results in ranked order by relevency
• Supports Synonyms
• Supports stemmings
• Supports Stop words
• Supports spelling corrections
• Amenable to parallelisation or index building (if index based)
• Fast to reindex (if index based)
• Fast to update index (if index based)
• Combine results from multiple indexes (if index based)
• Proximity checks: give higher relevance to words found close together
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我喜欢 Solr 的 DataImportHandler。 它支持您的大部分要点,并且设置起来并不太困难,只要您不介意编辑一些 XML 配置文件即可。 它比许多企业级搜索引擎更容易。
GSA(Google Search Appliance)没有任何问题,但对于您想要的控制量,Solr 是更好的选择。
Lucene/Solr
I like Solr with the DataImportHandler. It supports most of your bullet points, and is not too difficult to set up, as long as you don't mind editing some XML configuration files. It's easier than many enterprise class search engines.
There is nothing wrong with GSA (Google Search Appliance), but for the amount of control that you desire, Solr is a better option.
Lucene/Solr
关于相关性,Google Search Appliance 允许进行一些调整。 他们认为允许过多的调整会导致相关性较差,而且我确实相信谷歌知道相关性。
用户不太可能发现除 Google 之外的其他搜索引擎更容易使用。
In relation to relevancy, the Google Search Appliance allows a little tweaking. They believe that allowing too much tweaking will give poor relevancy, and I do believe that Google knows relevancy.
It is unlikely that users will find a search engine other than Google easier to use.