全文干扰词 - 背后的逻辑
正如标题所描述的,在全文搜索中实施干扰词以避免这些词被搜索背后的逻辑是什么?我的意思是,如果有人搜索“to be or not to be”怎么办?没有显示结果?如果有人能告诉我背后的逻辑,我将非常感激,因为我即将禁用 ft_stopword_file
。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用这些停用词的原因是为了避免全文索引变得臃肿。它有助于提高性能和存储能力。如果您包含所有停用词(或禁用它们),那么它会在一定程度上降低全文搜索的性能。
The reason for these stop words is so that the full-text index doesn't become bloated. It aids in performance and storage. If you included all stop words (or disable them) then it would degrade the full-text searching to a certain extent.
如果禁用停用词,那么性能将急剧下降。解决方法是检查您的 php 代码以查看停用词在搜索查询中是否常见,并为这些查询调整“LIKE”搜索,或者简单地使用 sphinx 作为搜索引擎。停用词背后的逻辑是禁用搜索词,例如“is,are,be,there,not”等......
If you disable the stop words then the performance will decrease dramatically. The workaround for this is to either check in your php code to see whether the stop words are in common in the search query and adapt a 'LIKE' search for those queries, or simply use sphinx as a search engine. The logic behind the stop words is to disable searching words like 'is,are,be,there,not' etc etc...
逻辑是这些词非常常见,它们会创建大型索引节点并降低系统性能,并且对用户来说毫无用处,因为“to”和“be”如此常见且没有上下文。
更好的索引方法是使用 ngram 来查找像“to be”这样的引用短语,但这种索引非常罕见。
The logic is that these words are so common, that they will create large index nodes and degrade the system as well as be useless to users since the words "to" and "be" are so common and contextless.
A better method of indexing would be ngrams to find quoted phrases like "to be" but this kind of indexing is pretty rare.