RAILS3:搜索忽略变音符号?

发布于 2024-12-10 05:43:04 字数 509 浏览 2 评论 0原文

我有一个包含 Article 对象的 Rails 3 应用程序。他们有一个标题属性。在添加新文章之前,人们应该搜索该标题的文章是否已经存在。

今天有人举报了一篇重复的文章。事实证明,添加它的人首先搜索了它,但标题中的“o”上方有一个元音变音。他们使用常规“o”字符在没有元音变音的情况下进行搜索,但没有找到它,并添加了重复项。

我正在使用范围对标题属性进行简单查找,如下所示:

scope :search, lambda { |term| where('title like ?', "%#{term}%") }

我想知道是否有一种简单的方法来“忽略”变音符号,以便该人可以键入“o”并且仍然可以找到一篇文章,如果o 有一个变音符号,其他变音符号也是如此。

我考虑过创建一个 search_title 属性,并在更新时自己填充它,用普通的等效项替换变音符号,但这有其自身的问题,其中,如果有人使用变音符号怎么办。

我希望有一个简单的解决方案,但我并没有抱太大希望。 :-)

I have a Rails 3 app that contains Article objects. They have a title attribute. Before adding a new article, people are supposed to search to see if it an article with the title already exists.

Today someone reported a duplicate article. Turns out whoever added it had searched for it first, but there was an umlaut over an "o" in the title. They searched without the umlaut using a regular "o" character, didn't find it, and added the duplicate.

I'm doing a simple find on the title attribute with a scope, as below:

scope :search, lambda { |term| where('title like ?', "%#{term}%") }

I'm wondering if there's a simple way to "ignore" diacritics, so that the person could type an "o" and still find an article if the o has an umlaut, and the same for other diacritics.

I've considered creating a search_title attribute and populating it myself on update replacing the diacritics with their plain equivalents, but that has its own problems, among them, what if someone then does use the diacritic.

I was hoping there might be an easy solution for this, but I'm not holding out much hope. :-)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

浮世清欢 2024-12-17 05:43:04

我建议创建一个 search_title 字段并将其存储在 title.to_ascii_brutal (使用此插件: https://github.com/tomash /ascii_tic)。然后将搜索范围更改为:

scope :search, lambda { |term| where('search_title like ?', "%#{term.to_ascii_brutal}%") }

I suggest to create a search_title field and store there title.to_ascii_brutal (use this plugin: https://github.com/tomash/ascii_tic). And then change your search scope to:

scope :search, lambda { |term| where('search_title like ?', "%#{term.to_ascii_brutal}%") }
谈情不如逗狗 2024-12-17 05:43:04

是的,处理此问题的标准方法是维护影子搜索字段。除了将所有数据更改为 Ascii 之外,请考虑:

  • 将所有数据更改为大写以消除大小写问题,
  • 删除所有非数字、字母或空格的字符。 (删除标点符号、制表符等)
  • 删除“停用词”,例如“is”“the” “a”等。当然,停用词取决于语言。

另一种策略是根据 Soundex 分数进行计算和搜索。 (或使用 Soundex 的修订版本)。有用于 Soundex 的 Ruby 库,或者您可以编写自己的库。

Soundex 会给你更多的误报——你需要确定你是否愿意有更多的误报,或者可能错过一场比赛(误报),因为一个标题是“瘟疫”,另一个是“瘟疫”

你也可以安装一个真正的全文检索系统,可以通过打开MySQL系统或通过单独的系统。

Yes, a standard way to handle this is to maintain a shadow search field. In addition to changing all the data to Ascii, consider:

  • Changing everything to upper case to eliminate case issues
  • removing all characters that aren't numbers, letters or spaces. (Remove punctuation, tabs, etc)
  • removing "stop words" such as "is" "the" "a" etc. Of course, stop words are language dependent.

An alternative strategy is to compute and search based on the Soundex score. (Or use a revised version of Soundex). There are Ruby libraries for Soundex or write your own.

Soundex will give you more false positives--you need to determine if you'd rather have more false positives or perhaps miss a match (a false negative) because one title was "Plague" and the other was "Plagues"

You could also install a real full-text search system, either by turning on the MySQL system or via a separate system.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文