Sphinx - 分隔符
我想知道 Sphinx 引擎是否可以使用任何分隔符(例如普通 MySQL 中的逗号和句点)。我的问题来自于一种冲动,根本不使用它们,而是逃避它们,或者至少在使用全文搜索执行 MATCH 操作时它们不会发生冲突,因为默认情况下我在 MySQL 中处理它们时遇到问题,并且我不希望被迫用任何其他字符替换这些分隔符来提供一组好的结果。
抱歉,如果我说了一些愚蠢的话,但我没有使用 Sphinx 或其他补充(?)搜索引擎的经验。
举个例子,如果我
"Passat 2.0 TDI"
默认使用 MySQL 执行搜索,则会将这种情况下的句点识别为分隔符,并且由于“2”和“0”太短,默认情况下无法被视为单词,因此结果将是有点乱。
使用 Sphinx(或其他搜索引擎)是否容易处理?我愿意接受建议。
这是一个大型项目,可能有超过 500.000 条可能的记录(一点也不简单)。
干杯!
I would like to know if the Sphinx engine works with any delimiters (like commas and periods in normal MySQL). My question comes from the urge, not to use them at all, but to escape them or at least thay they don't enter in conflict when performing MATCH operations with FULLTEXT searches, since I have problems dealing with them in MySQL by default and I would prefer not to be forced to replace those delimiters by any other characters to provide a good set of results.
Sorry if I'm saying something stupid, but I don't have experience with Sphinx or other complementary (?) search engines.
To give you an example, if I perform a search with
"Passat 2.0 TDI"
MySQL by default would identify the period in this case as a delimiter and since the "2" and "0" are too short to be considered words by default, the results would be a bit messed up.
Is it easy to handle with Sphinx (or other search engine)? I'm open to suggestions.
This is for a large project, with probably more than 500.000 possible records (not trivial at all).
Cheers!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以通过指定字符集表来有效控制哪些字符是分隔符特定的 sphinx 索引。
如果您从字符集表中排除某个字符,则它实际上充当分隔符。如果您在字符集表中指定它(甚至空格如 U+0020),它将不再充当分隔符,并将成为令牌字符串的一部分。
每个索引(使用一个或多个 sphinx 数据源)可以有不同的字符集表的灵活性。
注意:如果您想要单字符单词,您可以指定 min_word_len。
You can effectively control which characters are delimiters by specifying the charset table of a specific sphinx index.
If you exclude a character from your charset table, it effectively acts as a delimiter. If you specify it in your charset table (even spaces as U+0020), it will no longer acts as a delimiter and will be part of your token strings.
Each index (which uses one or more sphinx data sources) can have a different charset table for flexibility.
NB: If you want single character words, you can specify the min_word_len of each the sphinx index.
这可能是文档中最值得阅读的部分。由于 sphinx 是一个全文引擎,因此它在如何处理短语以及如何传入它们方面具有高度可调性。
This is probably the best section of the documentation to read. As sphinx is a fulltext engine primarily it's highly tunable as to how it handles phrases and also how you pass them in.