我发现了很多关于模糊匹配的链接,将一个字符串与另一个字符串进行比较,看看哪个字符串的相似度得分最高。
我有一个很长的字符串(一个文档)和一个子字符串。子字符串来自原始文档,但已被转换多次,因此可能会引入奇怪的工件,例如这里有一个空格,那里有一个破折号。子字符串将与原始文档中的文本部分匹配 99% 或更多。我无法匹配该字符串来自哪个文档,我试图在该字符串开始的文档中找到索引。
如果由于没有引入随机错误而导致字符串相同,我会使用 document.index(substring) ,但是如果只有一个字符差异,则会失败。
我认为可以通过删除字符串和子字符串中除 az 之外的所有字符进行比较,然后使用压缩字符串时生成的索引将压缩字符串中的索引转换为真实文档中的索引来解决差异。当差异是空格和标点符号时,这种方法效果很好,但一旦有一个字母不同,它就会失败。
文档通常是几页到一百页,子串从几句话到几页。
I found lots of links about fuzzy matching, comparing one string to another and seeing which gets the highest similarity score.
I have one very long string, which is a document, and a substring. The substring came from the original document, but has been converted several times, so weird artifacts might have been introduced, such as a space here, a dash there. The substring will match a section of the text in the original document 99% or more. I am not matching to see from which document this string is, I am trying to find the index in the document where the string starts.
If the string was identical because no random error was introduced, I would use document.index(substring)
, however this fails if there is even one character difference.
I thought the difference would be accounted for by removing all characters except a-z in both the string and the substring, compare, and then use the index I generated when compressing the string to translate the index in the compressed string to the index in the real document. This worked well where the difference was whitespace and punctuation, but as soon as one letter is different it failed.
The document is typically a few pages to a hundred pages, and the substring from a few sentences to a few pages.
发布评论
评论(5)
你可以尝试一下匹配。它可以作为 ruby gem 提供,虽然我已经很长时间没有使用模糊逻辑了,但它看起来有你需要的东西。 amatch 的主页是:https://github.com/flori/amatch。
只是无聊地摆弄这个想法,一个完全未经优化且未经测试的解决方案黑客如下:
显然,有许多可能的改进,而且可能是必要的!最重要的几点:
结果,可能在数据库中。
进行初步检查、处理
首先针对该初始子字符串
在尝试匹配整个之前
分段。
预先计算起始片段
那个长度。
You could try amatch. It's available as a ruby gem and, although I haven't worked with fuzzy logic for a long time, it looks to have what you need. The homepage for amatch is: https://github.com/flori/amatch.
Just bored and messing around with the idea, a completely non-optimized and untested hack of a solution follows:
Obviously there are numerous improvements possible and probably necessary! A few off the top:
the results, possibly in a database.
for an initial check, process
against that initial substring first
before trying to match the entire
fragment.
precalculate starting fragments of
that length.
一个简单的方法是 fuzzy_match
更详细的方法是 < a href="https://github.com/tliff/levenshtein" rel="nofollow">levenshein,计算差异的数量。
A simple one is fuzzy_match
A more elaborated one (you wouldn't say it from this example though) is levenshein, which computes the number of differences.
您应该查看此处详细介绍的 StrikeAMatch 实现:
一种更好的可变长度字符串相似度排名算法
依靠某种字符串距离(即两个字符串之间的变化数量),这个查看字符对模式。每个字符串中出现的字符对越多,匹配得越好。它对我们的应用程序非常有用,我们在纯文本文件中搜索输入错误/可变长度的标题。
还有一个 gem,它结合了 StrikeAMatch(字符级二元组上的 Dice 系数 的实现) )和编辑距离来查找匹配: https://github.com/seamusabshere/fuzzy_match
You should look at the StrikeAMatch implementation detailed here:
A better similarity ranking algorithm for variable length strings
Instead of relying on some kind of string distance (i.e. number of changes between two strings), this one looks at the character pairs patterns. The more character pairs occur in each string, the better the match. It has worked wonderfully for our application, where we search for mistyped/variable length headings in a plain text file.
There's also a gem which combines StrikeAMatch (an implementation of Dice's coefficient on character-level bigrams) and Levenshtein distance to find matches: https://github.com/seamusabshere/fuzzy_match
这取决于可能出现在子字符串中的工件。在更简单的情况下,它们不是
[az]
的一部分,您可以使用解析子字符串,然后在文档上使用Regexp#match
:(在这里,因为我们不在正则表达式中设置任何括号,我们在
MatchData
的第一个(完全匹配)元素0
上使用begin
和end
>如果您只对起始位置感兴趣,可以使用
=~
运算符:It depends on the artifacts that can end up in the substring. In the simpler case where they are not part of
[a-z]
you can use parse the substring and then useRegexp#match
on the document:(Here, as we do not set any parenthesis in the Regexp, we use
begin
andend
on the first (full match) element0
ofMatchData
.If you are only interested in the start position, you can use
=~
operator:我没有使用过它们,但我只是通过在 rubygems.org 中搜索“diff”找到了一些库。所有这些都可以通过 gem 安装。您可能想尝试一下。我自己也很感兴趣,所以如果你已经知道这些或者你尝试过,留下你的评论将会很有帮助。
I have used none of them, but I found some libraries just by doing a search for 'diff' in
rubygems.org
. All of them can be installed by gem. You might want to try them. I myself is interested, so if you already know these or if you try them out, it would be helpful if you leave your comment.