任何处理像“the”这样的停用词的功能都可以。在狮身人面像?

发布于 2024-12-19 15:54:08 字数 242 浏览 3 评论 0原文

我目前正在使用 Thinking Sphinx 进行酒店搜索。我有一件名为“曼哈顿俱乐部”的物品。当我搜索“曼哈顿俱乐部”时,没有得到任何结果。这是因为默认的 :all 选项意味着所有单词都必须匹配。

然后我使用 :any 选项(任何单词匹配都将视为匹配)。然而,这会产生很多结果,顶级酒店的描述中有很多“THE”。

我认为提高相关性的唯一方法是删除搜索字符串中的所有停用词。我想知道 Sphinx(或 Ruby)是否有删除停用词的功能?

I am using Thinking Sphinx to do hotel search at the moment. I have one item called "Manhattan Club". When I search with the "The Manhattan Club" I get no results. This is because the default :all option means all words must be matched.

I then use the :any option (any word match will count as a match). However this results a lot of results, with the top hotel having lots of 'THE' in its description.

I think the only way to improve relevance is to remove all the stop-words in the search string. I am wondering if Sphinx (or Ruby) has a feature for removing stopwords?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

腻橙味 2024-12-26 15:54:08

我不知道你在思考 sphinx 时到底会如何做,但是,是的,Sphinx 确实有停用词

http://sphinxsearch.com/docs/current.html#conf-stopwords

它位于 sphinx.conf 文件中的索引定义中。索引器具有帮助您构建常用单词列表的工具 - 帮助创建初始停用词文件。

http://sphinxsearch.com/docs/current.html#ref-indexer

I've no idea how exactly how you would do it in thinking sphinx, but yes, Sphinx does have stopwords

http://sphinxsearch.com/docs/current.html#conf-stopwords

It goes in your index definition in sphinx.conf file. indexer has tools to help you build a list of common words - to help create an initial stopword file.

http://sphinxsearch.com/docs/current.html#ref-indexer

不醒的梦 2024-12-26 15:54:08

您可以在 config/sphinx.yml 中设置停用词文件路径 - 其组织方式类似于 config/database.yml (按环境):

development:
  stopwords: "/path/to/stopwords.txt"

对于停用词文件中的具体内容,Barry 的答案有相关链接。

You can set the stopwords file path in config/sphinx.yml - which is organised like config/database.yml (by environment):

development:
  stopwords: "/path/to/stopwords.txt"

For what exactly goes in the stopwords file, Barry's answer has the relevant links.

请你别敷衍 2024-12-26 15:54:08

要从 Sphinx 索引中删除高频单词,您需要在索引定义中使用 stopwords 指令:

source my_source
{
   ...
}

index my_index
{
    source = my_source
    path = /path/to/my/index
    ...
    stopwords = /path/to/stopwords/file
}

其中 stopwords 文件只是您要从 Sphinx 索引中删除的单词的每行单词列表。索引器将忽略这些单词并且不会将它们添加到索引中。

如果您在索引配置中启用了形态选项,则可以提高相关性检查的另一种方法。您可能还想在每个查询的基础上使用排名器。

参考资料:

预生成的停用词文件:http://astellar.com/2011/ 12/sphinx-搜索停用词/
形态学:http://sphinxsearch.com/docs/current.html#conf-morphology

To remove high-frequency words from Sphinx index you need to use stopwords directive in your index definition:

source my_source
{
   ...
}

index my_index
{
    source = my_source
    path = /path/to/my/index
    ...
    stopwords = /path/to/stopwords/file
}

Where stopwords file is simply a word-per-line list of words you'd like to remove from your Sphinx index. Indexer will ignore those words and do not add them to index.

Another way to improve relevance check if you have morphology option in your index config enabled. You may also want to play with ranker on per-query basis.

References:

Pre-generated stopword files: http://astellar.com/2011/12/stopwords-for-sphinx-search/
Morphology: http://sphinxsearch.com/docs/current.html#conf-morphology

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文