当前位置：文江博客话题详情

String Ruby fuzzy-search

如何在 Ruby 中进行模糊子字符串匹配？

发布于 2024-11-08 17:52:26 字数 439 浏览 10 评论 0 原文

我发现了很多关于模糊匹配的链接，将一个字符串与另一个字符串进行比较，看看哪个字符串的相似度得分最高。

我有一个很长的字符串（一个文档）和一个子字符串。子字符串来自原始文档，但已被转换多次，因此可能会引入奇怪的工件，例如这里有一个空格，那里有一个破折号。子字符串将与原始文档中的文本部分匹配 99% 或更多。我无法匹配该字符串来自哪个文档，我试图在该字符串开始的文档中找到索引。

如果由于没有引入随机错误而导致字符串相同，我会使用 document.index(substring) ，但是如果只有一个字符差异，则会失败。

我认为可以通过删除字符串和子字符串中除 az 之外的所有字符进行比较，然后使用压缩字符串时生成的索引将压缩字符串中的索引转换为真实文档中的索引来解决差异。当差异是空格和标点符号时，这种方法效果很好，但一旦有一个字母不同，它就会失败。

文档通常是几页到一百页，子串从几句话到几页。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烟雨扶苏 2024-11-15 17:52:26

你可以尝试一下匹配。它可以作为 ruby gem 提供，虽然我已经很长时间没有使用模糊逻辑了，但它看起来有你需要的东西。 amatch 的主页是：https://github.com/flori/amatch。

只是无聊地摆弄这个想法，一个完全未经优化且未经测试的解决方案黑客如下：

include 'amatch'

module FuzzyFinder
  def scanner( input )
    out = [] unless block_given?
    pos = 0
    input.scan(/(\w+)(\W*)/) do |word, white|
      startpos = pos
      pos = word.length + white.length
      if block_given?
        yield startpos, word
      else
        out << [startpos, word]
      end
    end
  end

  def find( text, doc )
    index = scanner(doc)
    sstr = text.gsub(/\W/,'')
    levenshtein = Amatch::Levensthtein.new(sstr)
    minlen = sstr.length
    maxndx = index.length
    possibles = []
    minscore = minlen*2
    index.each_with_index do |x, i|
      spos = x[0]
      str = x[1]
      si = i
      while (str.length < minlen)
        i += 1
        break unless i < maxndx
        str += index[i][1]
      end
      str = str.slice(0,minlen) if (str.length > minlen)
      score = levenshtein.search(str)
      if score < minscore
        possibles = [spos]
        minscore = score
      elsif score == minscore
        possibles << spos
      end
    end
    [minscore, possibles]
  end
end

显然，有许多可能的改进，而且可能是必要的！最重要的几点：

处理一次文档并存储
结果，可能在数据库中。
确定字符串的可用长度
进行初步检查、处理
首先针对该初始子字符串
在尝试匹配整个之前
分段。
继上一篇之后，
预先计算起始片段
那个长度。

You could try amatch. It's available as a ruby gem and, although I haven't worked with fuzzy logic for a long time, it looks to have what you need. The homepage for amatch is: https://github.com/flori/amatch.

Just bored and messing around with the idea, a completely non-optimized and untested hack of a solution follows:

include 'amatch'

module FuzzyFinder
  def scanner( input )
    out = [] unless block_given?
    pos = 0
    input.scan(/(\w+)(\W*)/) do |word, white|
      startpos = pos
      pos = word.length + white.length
      if block_given?
        yield startpos, word
      else
        out << [startpos, word]
      end
    end
  end

  def find( text, doc )
    index = scanner(doc)
    sstr = text.gsub(/\W/,'')
    levenshtein = Amatch::Levensthtein.new(sstr)
    minlen = sstr.length
    maxndx = index.length
    possibles = []
    minscore = minlen*2
    index.each_with_index do |x, i|
      spos = x[0]
      str = x[1]
      si = i
      while (str.length < minlen)
        i += 1
        break unless i < maxndx
        str += index[i][1]
      end
      str = str.slice(0,minlen) if (str.length > minlen)
      score = levenshtein.search(str)
      if score < minscore
        possibles = [spos]
        minscore = score
      elsif score == minscore
        possibles << spos
      end
    end
    [minscore, possibles]
  end
end

Obviously there are numerous improvements possible and probably necessary! A few off the top:

Process the document once and store
the results, possibly in a database.
Determine a usable length of string
for an initial check, process
against that initial substring first
before trying to match the entire
fragment.
Following up on the previous,
precalculate starting fragments of
that length.

回复收藏 0 原文

难理解 2024-11-15 17:52:26

一个简单的方法是 fuzzy_match

require 'fuzzy_match'
FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus') #=> seamus

更详细的方法是 < a href="https://github.com/tliff/levenshtein" rel="nofollow">levenshein，计算差异的数量。

require 'levenshtein' 
Levenshtein.distance('test', 'test')    # => 0
Levenshtein.distance('test', 'tent')    # => 1

A simple one is fuzzy_match

require 'fuzzy_match'
FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus') #=> seamus

A more elaborated one (you wouldn't say it from this example though) is levenshein, which computes the number of differences.

require 'levenshtein' 
Levenshtein.distance('test', 'test')    # => 0
Levenshtein.distance('test', 'tent')    # => 1

回复收藏 0 原文

桃扇骨 2024-11-15 17:52:26

您应该查看此处详细介绍的 StrikeAMatch 实现：
一种更好的可变长度字符串相似度排名算法

依靠某种字符串距离（即两个字符串之间的变化数量），这个查看字符对模式。每个字符串中出现的字符对越多，匹配得越好。它对我们的应用程序非常有用，我们在纯文本文件中搜索输入错误/可变长度的标题。

还有一个 gem，它结合了 StrikeAMatch（字符级二元组上的 Dice 系数的实现））和编辑距离来查找匹配： https://github.com/seamusabshere/fuzzy_match

回复收藏 0 原文

残月升风 2024-11-15 17:52:26

这取决于可能出现在子字符串中的工件。在更简单的情况下，它们不是 [az] 的一部分，您可以使用解析子字符串，然后在文档上使用 Regexp#match ：（

document = 'Ulputat non nullandigna tortor dolessi illam sectem laor acipsus.'
substr = "tortor - dolessi _%&#   +illam"

re = Regexp.new(substr.split(/[^a-z]/i).select{|e| !e.empty?}.join(".*"))
md = document.match re
puts document[md.begin(0) ... md.end(0)]
# => tortor dolessi illam

在这里，因为我们不在正则表达式中设置任何括号，我们在 MatchData 的第一个（完全匹配）元素 0 上使用 begin 和 end >

如果您只对起始位置感兴趣，可以使用 =~ 运算符：

start_pos = document =~ re

It depends on the artifacts that can end up in the substring. In the simpler case where they are not part of [a-z] you can use parse the substring and then use Regexp#match on the document:

document = 'Ulputat non nullandigna tortor dolessi illam sectem laor acipsus.'
substr = "tortor - dolessi _%&#   +illam"

re = Regexp.new(substr.split(/[^a-z]/i).select{|e| !e.empty?}.join(".*"))
md = document.match re
puts document[md.begin(0) ... md.end(0)]
# => tortor dolessi illam

(Here, as we do not set any parenthesis in the Regexp, we use begin and end on the first (full match) element 0 of MatchData.

If you are only interested in the start position, you can use =~ operator: