如何在 Ruby 中进行模糊子字符串匹配?

发布于 2024-11-08 17:52:26 字数 439 浏览 0 评论 0 原文

我发现了很多关于模糊匹配的链接,将一个字符串与另一个字符串进行比较,看看哪个字符串的相似度得分最高。

我有一个很长的字符串(一个文档)和一个子字符串。子字符串来自原始文档,但已被转换多次,因此可能会引入奇怪的工件,例如这里有一个空格,那里有一个破折号。子字符串将与原始文档中的文本部分匹配 99% 或更多。我无法匹配该字符串来自哪个文档,我试图在该字符串开始的文档中找到索引。

如果由于没有引入随机错误而导致字符串相同,我会使用 document.index(substring) ,但是如果只有一个字符差异,则会失败。

我认为可以通过删除字符串和子字符串中除 az 之外的所有字符进行比较,然后使用压缩字符串时生成的索引将压缩字符串中的索引转换为真实文档中的索引来解决差异。当差异是空格和标点符号时,这种方法效果很好,但一旦有一个字母不同,它就会失败。

文档通常是几页到一百页,子串从几句话到几页。

I found lots of links about fuzzy matching, comparing one string to another and seeing which gets the highest similarity score.

I have one very long string, which is a document, and a substring. The substring came from the original document, but has been converted several times, so weird artifacts might have been introduced, such as a space here, a dash there. The substring will match a section of the text in the original document 99% or more. I am not matching to see from which document this string is, I am trying to find the index in the document where the string starts.

If the string was identical because no random error was introduced, I would use document.index(substring), however this fails if there is even one character difference.

I thought the difference would be accounted for by removing all characters except a-z in both the string and the substring, compare, and then use the index I generated when compressing the string to translate the index in the compressed string to the index in the real document. This worked well where the difference was whitespace and punctuation, but as soon as one letter is different it failed.

The document is typically a few pages to a hundred pages, and the substring from a few sentences to a few pages.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

烟雨扶苏 2024-11-15 17:52:26

你可以尝试一下匹配。它可以作为 ruby​​ gem 提供,虽然我已经很长时间没有使用模糊逻辑了,但它看起来有你需要的东西。 amatch 的主页是:https://github.com/flori/amatch

只是无聊地摆弄这个想法,一个完全未经优化且未经测试的解决方案黑客如下:

include 'amatch'

module FuzzyFinder
  def scanner( input )
    out = [] unless block_given?
    pos = 0
    input.scan(/(\w+)(\W*)/) do |word, white|
      startpos = pos
      pos = word.length + white.length
      if block_given?
        yield startpos, word
      else
        out << [startpos, word]
      end
    end
  end

  def find( text, doc )
    index = scanner(doc)
    sstr = text.gsub(/\W/,'')
    levenshtein = Amatch::Levensthtein.new(sstr)
    minlen = sstr.length
    maxndx = index.length
    possibles = []
    minscore = minlen*2
    index.each_with_index do |x, i|
      spos = x[0]
      str = x[1]
      si = i
      while (str.length < minlen)
        i += 1
        break unless i < maxndx
        str += index[i][1]
      end
      str = str.slice(0,minlen) if (str.length > minlen)
      score = levenshtein.search(str)
      if score < minscore
        possibles = [spos]
        minscore = score
      elsif score == minscore
        possibles << spos
      end
    end
    [minscore, possibles]
  end
end

显然,有许多可能的改进,而且可能是必要的!最重要的几点:

  1. 处理一次文档并存储
    结果,可能在数据库中。
  2. 确定字符串的可用长度
    进行初步检查、处理
    首先针对该初始子字符串
    在尝试匹配整个之前
    分段。
  3. 继上一篇之后,
    预先计算起始片段
    那个长度。

You could try amatch. It's available as a ruby gem and, although I haven't worked with fuzzy logic for a long time, it looks to have what you need. The homepage for amatch is: https://github.com/flori/amatch.

Just bored and messing around with the idea, a completely non-optimized and untested hack of a solution follows:

include 'amatch'

module FuzzyFinder
  def scanner( input )
    out = [] unless block_given?
    pos = 0
    input.scan(/(\w+)(\W*)/) do |word, white|
      startpos = pos
      pos = word.length + white.length
      if block_given?
        yield startpos, word
      else
        out << [startpos, word]
      end
    end
  end

  def find( text, doc )
    index = scanner(doc)
    sstr = text.gsub(/\W/,'')
    levenshtein = Amatch::Levensthtein.new(sstr)
    minlen = sstr.length
    maxndx = index.length
    possibles = []
    minscore = minlen*2
    index.each_with_index do |x, i|
      spos = x[0]
      str = x[1]
      si = i
      while (str.length < minlen)
        i += 1
        break unless i < maxndx
        str += index[i][1]
      end
      str = str.slice(0,minlen) if (str.length > minlen)
      score = levenshtein.search(str)
      if score < minscore
        possibles = [spos]
        minscore = score
      elsif score == minscore
        possibles << spos
      end
    end
    [minscore, possibles]
  end
end

Obviously there are numerous improvements possible and probably necessary! A few off the top:

  1. Process the document once and store
    the results, possibly in a database.
  2. Determine a usable length of string
    for an initial check, process
    against that initial substring first
    before trying to match the entire
    fragment.
  3. Following up on the previous,
    precalculate starting fragments of
    that length.
难理解 2024-11-15 17:52:26

一个简单的方法是 fuzzy_match

require 'fuzzy_match'
FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus') #=> seamus

更详细的方法是 < a href="https://github.com/tliff/levenshtein" rel="nofollow">levenshein,计算差异的数量。

require 'levenshtein' 
Levenshtein.distance('test', 'test')    # => 0
Levenshtein.distance('test', 'tent')    # => 1

A simple one is fuzzy_match

require 'fuzzy_match'
FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus') #=> seamus

A more elaborated one (you wouldn't say it from this example though) is levenshein, which computes the number of differences.

require 'levenshtein' 
Levenshtein.distance('test', 'test')    # => 0
Levenshtein.distance('test', 'tent')    # => 1
桃扇骨 2024-11-15 17:52:26

您应该查看此处详细介绍的 StrikeAMatch 实现:
一种更好的可变长度字符串相似度排名算法

依靠某种字符串距离(即两个字符串之间的变化数量),这个查看字符对模式。每个字符串中出现的字符对越多,匹配得越好。它对我们的应用程序非常有用,我们在纯文本文件中搜索输入错误/可变长度的标题。

还有一个 gem,它结合了 StrikeAMatch(字符级二元组上的 Dice 系数 的实现) )和编辑距离来查找匹配: https://github.com/seamusabshere/fuzzy_match

You should look at the StrikeAMatch implementation detailed here:
A better similarity ranking algorithm for variable length strings

Instead of relying on some kind of string distance (i.e. number of changes between two strings), this one looks at the character pairs patterns. The more character pairs occur in each string, the better the match. It has worked wonderfully for our application, where we search for mistyped/variable length headings in a plain text file.

There's also a gem which combines StrikeAMatch (an implementation of Dice's coefficient on character-level bigrams) and Levenshtein distance to find matches: https://github.com/seamusabshere/fuzzy_match

残月升风 2024-11-15 17:52:26

这取决于可能出现在子字符串中的工件。在更简单的情况下,它们不是 [az] 的一部分,您可以使用解析子字符串,然后在文档上使用 Regexp#match :(

document = 'Ulputat non nullandigna tortor dolessi illam sectem laor acipsus.'
substr = "tortor - dolessi _%&#   +illam"

re = Regexp.new(substr.split(/[^a-z]/i).select{|e| !e.empty?}.join(".*"))
md = document.match re
puts document[md.begin(0) ... md.end(0)]
# => tortor dolessi illam

在这里,因为我们不在正则表达式中设置任何括号,我们在 MatchData 的第一个(完全匹配)元素 0 上使用 beginend >

如果您只对起始位置感兴趣,可以使用 =~ 运算符:

start_pos = document =~ re

It depends on the artifacts that can end up in the substring. In the simpler case where they are not part of [a-z] you can use parse the substring and then use Regexp#match on the document:

document = 'Ulputat non nullandigna tortor dolessi illam sectem laor acipsus.'
substr = "tortor - dolessi _%&#   +illam"

re = Regexp.new(substr.split(/[^a-z]/i).select{|e| !e.empty?}.join(".*"))
md = document.match re
puts document[md.begin(0) ... md.end(0)]
# => tortor dolessi illam

(Here, as we do not set any parenthesis in the Regexp, we use begin and end on the first (full match) element 0 of MatchData.

If you are only interested in the start position, you can use =~ operator:

start_pos = document =~ re
最偏执的依靠 2024-11-15 17:52:26

我没有使用过它们,但我只是通过在 ruby​​gems.org 中搜索“diff”找到了一些库。所有这些都可以通过 gem 安装。您可能想尝试一下。我自己也很感兴趣,所以如果你已经知道这些或者你尝试过,留下你的评论将会很有帮助。

I have used none of them, but I found some libraries just by doing a search for 'diff' in rubygems.org. All of them can be installed by gem. You might want to try them. I myself is interested, so if you already know these or if you try them out, it would be helpful if you leave your comment.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文