忽略 sphinx 索引中的撇号

发布于 2024-08-18 21:47:26 字数 644 浏览 13 评论 0原文

在我的 sphinx 配置文件中,我有以下内容:(

ignore_chars: "U+0027"
charset_table: "0..9, a..z, _, A..Z->a..z, U+00C0->a, U+00C1->a,
  U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c, U+00C8->e,
  U+00C9->e, U+00CA->e, U+00CB->e, U+00CC->i, U+00CD->i, U+00CE->i [SNIP]"

charset_table 条目来自此处: http://speeple .com/unicode-maps.txt

预期结果是查询 kyles 将返回与 kyles 和/或 kyle's 匹配的所有记录>,因为我告诉 sphinx 从索引中排除 ' (单引号/apos)(ab'cd -> abcd)。然而,在实践中,这种情况并没有发生。

In my sphinx config file, I have the following:

ignore_chars: "U+0027"
charset_table: "0..9, a..z, _, A..Z->a..z, U+00C0->a, U+00C1->a,
  U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c, U+00C8->e,
  U+00C9->e, U+00CA->e, U+00CB->e, U+00CC->i, U+00CD->i, U+00CE->i [SNIP]"

(The charset_table entry is from here: http://speeple.com/unicode-maps.txt)

The expected result is that querying kyles will return all records matching kyles and/or kyle's, since I'm telling sphinx to exclude ' (single quote/apos) from the index (ab'cd -> abcd). However, in practice, this is not happening.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

别低头,皇冠会掉 2024-08-25 21:47:26

我相信将其添加到ignore_chars 会产生与预期效果相反的效果。这告诉 sphinx 不要在该字符上拆分,而是会折叠要忽略的字符周围的单词。因此,kyle's 将变为 kyles,而不是 kyles

我刚刚针对这个问题尝试过的似乎有效的解决方案是将 s 添加到我的停用词列表中(可能还需要 's ,不记得了) 。 Sphinx 似乎将 kyle's 分成了单词 kyle's。由于匹配所有模式处于启用状态,因此某些文档无法匹配 。将其添加到停用词中似乎达到了预期的效果。

然而,正常的词干似乎应该解决这个问题,所以也许我们都做错了什么......

I believe adding it to the ignore_chars has the opposite of the desired effect. This is telling sphinx not to split on that character, but instead it will collapse the word around the characters to be ignored. So, kyle's will become kyles instead of kyle and s.

The solution I just tried for this issue that seems to have worked was to add s to my list of stopwords (might need 's in there also, can't remember). Sphinx seems to split kyle's up into the words kyle and 's. Because match all mode is on, some documents fail on the match for 's. Adding it to the stop words seems to have the desired effect.

It seems like the normal stemming should take care of this however, so maybe we're both doing something wrong...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文