使用 sphinx 索引和搜索带有 (++, #, .) 等符号的单词
您好,我已经建立了一个索引,我需要搜索“c++”、“.net”或“c#”等单词,但是 还没有结果。这是我的配置:
source = xxxx
path = /usr/local/etc/sphinx/var/data/xxxx
docinfo = extern
charset_type = utf-8
min_word_len = 1
min_infix_len = 7
stopwords = /usr/local/etc/sphinx/var/stopwords/stop_words_en.txt
我尝试使用 SPH_MATCH_PHRASE 和 SPH_MATCH_ALL 进行搜索,但没有任何有用的信息。
我可以做什么来允许这种情况发生?
谢谢 尼克
Hi I have build an index and I need to search for words like "c++", ".net" or "c#", but
there is not an result coming. Here is my config:
source = xxxx
path = /usr/local/etc/sphinx/var/data/xxxx
docinfo = extern
charset_type = utf-8
min_word_len = 1
min_infix_len = 7
stopwords = /usr/local/etc/sphinx/var/stopwords/stop_words_en.txt
I have try to search with SPH_MATCH_PHRASE and SPH_MATCH_ALL, but there is nothing useful coming.
What can I do to allow this?
Thanks
Nik
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您必须配置 charset_table 以包含符号 +、#、.、
即
您可以通过 CALL KEYWORDS MySQL 调用检查单词如何被标记化,
并使用您提供的配置我会得到这样的输出
mysql> CALL KEYWORDS ('c++ .net c# end_of_a_sentence.', 'YOUR_INDEX')
标记化标准化
抄送
网网
抄送
结束结束
的 的
啊
我添加到您的配置中的句子
输出是
标记化标准化
时间:2019-03-17 标签:c++c++
.net .net
时间:2019-03-17 标签:c#c#
结束结束
的 的
啊
句子。句子。
charset_table 中的 点 (.) 的缺点是句子末尾的单词与 点 一起标记和索引
'例句。'
'sentence' 单词将被标记为 'sentence'。 并搜索 'sentence' > 没有给你任何东西。
You have to configure charset_table to include symbols +, #, .,
ie
You could check how words get tokenized via CALL KEYWORDS MySQL call
with config you provided I'd get such output
mysql> CALL KEYWORDS ('c++ .net c# end_of_a_sentence.', 'YOUR_INDEX')
tokenized normalized
c c
net net
c c
end end
of of
a a
sentence sentence
with my addition to your config the output is
tokenized normalized
c++ c++
.net .net
c# c#
end end
of of
a a
sentence. sentence.
The downside of the dot (.) in charset_table that word at the end of a sentence tokenized and indexed together with a dot
'The example sentence.'
The 'sentence' word would be tokenized as a 'sentence.' and searching for the 'sentence' gives you nothing.
正如 tmg_tt 所说,修改
charset_table
应该可以。但是,您需要在
sphinx.conf
的索引定义中转义#
和可能的+
:适用于英镑,但我有不知道如何转义
+
,至少在Sphinx 0.99
中是这样。我也在
sphinx
论坛上发布了相关内容。As tmg_tt says, modifying the
charset_table
should work.However, you need to escape the
#
and probably the+
, in the index definition insphinx.conf
:Works for pound, but I have not figured out how to escape
+
, at least inSphinx 0.99
.I am posting to the
sphinx
forums about this too.