忽略 sphinx 索引中的撇号
在我的 sphinx 配置文件中,我有以下内容:(
ignore_chars: "U+0027"
charset_table: "0..9, a..z, _, A..Z->a..z, U+00C0->a, U+00C1->a,
U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c, U+00C8->e,
U+00C9->e, U+00CA->e, U+00CB->e, U+00CC->i, U+00CD->i, U+00CE->i [SNIP]"
charset_table 条目来自此处: http://speeple .com/unicode-maps.txt)
预期结果是查询 kyles
将返回与 kyles
和/或 kyle's
匹配的所有记录>,因为我告诉 sphinx 从索引中排除 ' (单引号/apos)(ab'cd -> abcd)。然而,在实践中,这种情况并没有发生。
In my sphinx config file, I have the following:
ignore_chars: "U+0027"
charset_table: "0..9, a..z, _, A..Z->a..z, U+00C0->a, U+00C1->a,
U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c, U+00C8->e,
U+00C9->e, U+00CA->e, U+00CB->e, U+00CC->i, U+00CD->i, U+00CE->i [SNIP]"
(The charset_table entry is from here: http://speeple.com/unicode-maps.txt)
The expected result is that querying kyles
will return all records matching kyles
and/or kyle's
, since I'm telling sphinx to exclude ' (single quote/apos) from the index (ab'cd -> abcd). However, in practice, this is not happening.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我相信将其添加到ignore_chars 会产生与预期效果相反的效果。这告诉 sphinx 不要在该字符上拆分,而是会折叠要忽略的字符周围的单词。因此,
kyle's
将变为kyles
,而不是kyle
和s
。我刚刚针对这个问题尝试过的似乎有效的解决方案是将
s
添加到我的停用词列表中(可能还需要's
,不记得了) 。 Sphinx 似乎将kyle's
分成了单词kyle
和's
。由于匹配所有模式处于启用状态,因此某些文档无法匹配的
。将其添加到停用词中似乎达到了预期的效果。然而,正常的词干似乎应该解决这个问题,所以也许我们都做错了什么......
I believe adding it to the ignore_chars has the opposite of the desired effect. This is telling sphinx not to split on that character, but instead it will collapse the word around the characters to be ignored. So,
kyle's
will becomekyles
instead ofkyle
ands
.The solution I just tried for this issue that seems to have worked was to add
s
to my list of stopwords (might need's
in there also, can't remember). Sphinx seems to splitkyle's
up into the wordskyle
and's
. Because match all mode is on, some documents fail on the match for's
. Adding it to the stop words seems to have the desired effect.It seems like the normal stemming should take care of this however, so maybe we're both doing something wrong...