Sphinx 中的重音不敏感排序

发布于 2024-07-25 08:59:20 字数 636 浏览 9 评论 0原文

我正在使用 Sphinx 和 Thinking Sphinx 插件来搜索我的数据。我正在使用MySQL。

我的数据包含重音字符（“á”、“é”、“ã”），并且我希望它们在搜索和搜索时等同于非重音字符（例如“a”、“e”、“a”）订购。

我使用字符集表 (pastie.org/204316) 进行搜索，搜索“AGUA”返回“ÁGUA”，但结果的排序无法正常工作。例如，在搜索“AGUA”时，“ÁGUA”出现在“MUITA ÁGUA”之后，但我希望将其排序为用“A”而不是“Á”编写。

我能想到的唯一解决方案是索引一个包含非重音字符的新列，并使用它进行排序，使用 REPLACE (http://dev.mysql.com/doc/refman/5.4/en/string-functions.html#function_replace) mysql 函数去除重音符号字符，但我需要对每个可能的重音字符（而且有很多）调用 REPLACE 一次，在我看来，这是一种不太可维护的解决方法。

有人知道处理这个问题的更好方法吗？

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

怎言笑 2024-08-01 08:59:20

Sphinx 通过将所有值存储在列表中，对列表进行排序，然后将每个字符串的索引存储为 int 属性来处理字符串字段的排序。根据文档，此列表的排序是在字节级别完成的，目前不可配置。

理想情况下，字符串应根据编码和区域设置进行不同的排序。例如，如果已知字符串是 KOI8R 编码中的俄语文本，则对字节 0xE0、0xE1 和 0xE2 进行排序应产生 0xE1、0xE2 和 0xE0，因为在 KOI8R 值 0xE0 中编码的字符（明显）在由0xE1 和 0xE2。不幸的是，Sphinx 目前不支持这一点，只会按字节对字符串进行排序。

-- 来自 http://www.sphinxsearch.com/docs/current.html

因此，在 Sphinx 中没有简单的方法可以实现这一点。对基于 REPLACE() 的想法的修改是拥有一个单独的列并使用模型中的回调来填充它。这将使您能够在 Ruby 而不是 MySQL 中处理替换，这可以说是更易于维护的解决方案。

# save an unaccented copy of your title. Normalise method borrowed from
# http://stackoverflow.com/questions/522715/removing-accents-diacritics-from-string-while-preserving-other-special-chars-tri
class MyModel < ActiveRecord::Base
  before_validation :update_sort_col

  private

  def update_sort_col
    sort_col = self.title.to_s.mb_chars.normalize(:kd).gsub(/[^-x00-\x7F]/n, '').to_s
  end
end

Sphinx handles sorting on string fields by storing all the values in a list, sorting the list and then storing the index of each string as an int attribute. According to the docs the sorting of this list is done at a byte level and currently isn't configurable.

Ideally the strings should be sorted differently, depending on the encoding and locale. For instance, if the strings are known to be Russian text in KOI8R encoding, sorting the bytes 0xE0, 0xE1, and 0xE2 should produce 0xE1, 0xE2 and 0xE0, because in KOI8R value 0xE0 encodes a character that is (noticeably) after characters encoded by 0xE1 and 0xE2. Unfortunately, Sphinx does not support that at the moment and will simply sort the strings bytewise.

-- from http://www.sphinxsearch.com/docs/current.html

So, no easy way to achieve this within Sphinx. A modification to your REPLACE() based idea would be to have a separate column and populate it using a callback in your model. This would let you handle the replace in Ruby instead of MySQL, an arguably more maintainable solution.

# save an unaccented copy of your title. Normalise method borrowed from
# http://stackoverflow.com/questions/522715/removing-accents-diacritics-from-string-while-preserving-other-special-chars-tri
class MyModel < ActiveRecord::Base
  before_validation :update_sort_col

  private

  def update_sort_col
    sort_col = self.title.to_s.mb_chars.normalize(:kd).gsub(/[^-x00-\x7F]/n, '').to_s
  end
end

回复收藏 0 原文

丑疤怪 2024-08-01 08:59:20

您还可以使用特殊索引，因为您甚至不需要在数据库

indexes "LOWER(title)", :as => :title,  :sortable => true

及其原始 SQL 上添加新列，因此您可以调用替换方法。

you can also use a special index for that you dont even need a new column on your db

indexes "LOWER(title)", :as => :title,  :sortable => true

its raw sql so you can call your replace method.

回复收藏 0 原文

何止钟意 2024-08-01 08:59:20

只需使用以下语法在小写版本上构建索引即可。这是使用 Sphinx 进行不区分大小写搜索的非常简单而优雅的解决方案。

indexes title, as: :title, sortable: :insensitive

Just build index on lower case version with following syntax. Its very simple and elegant solution for case insensitive search using Sphinx.

indexes title, as: :title, sortable: :insensitive

回复收藏 0 原文

~没有更多了~

关于作者

清眉祭

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

Sphinx 中的重音不敏感排序

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

诺曦

要走干脆点

把回忆走一遍

陌上青苔

Arthur

哄哄

友情链接

Sphinx 中的重音不敏感排序

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

诺曦

要走干脆点

把回忆走一遍

陌上青苔

Arthur

哄哄

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。