Ruby:如何测试两个文本块之间的相似性?

发布于 2024-12-08 20:51:15 字数 1112 浏览 2 评论 0原文

所以,假设我有这些文本:

Text1:

绝对服从被称为主宰的虫族集体意识。主宰指挥虫群中每一个虫族生物的行动,通过低等感知者的等级制度发挥作用。

文本2:

虫群中的虫族生物,通过低等生物的等级制度发挥作用。尽管主宰主要是由其消费和同化的欲望驱动的

文本 3 的

欲望驱动的

当虫族第一次到达科普鲁星区时,他们通过绝对服从被称为主宰的虫族集体意识而团结起来。主宰指挥虫群中每一个虫族生物的行动,通过低等感知者的等级制度发挥作用。虽然主宰的主要驱动力是消耗和同化先进的神族种族,但它在人类中发现了有用但尚未开发的材料。

现在,Text1 的结尾和 text2 的开头重叠,因此我们可以说文本块不是唯一的。类似地,对于 Text3,可以在内部找到 Text1(以及 Text2),因此由于重叠,这也不是唯一的。

所以,我的问题是:

我该如何写一些可以查看连续字母或单词并确定唯一性的东西?理想情况下,我希望这样的方法返回一些值,表示相似程度——也许是匹配单词的数量超过两个文本块大小的平均值。当它返回 0 时,测试的两个文本应该是完全唯一的。

我在使用 Ruby 的字符串方法时遇到了一些问题。

首先,我开始尝试找到两个字符串的交集。

>> a = "nt version, there are no ch"  
>> b = "he current versi"  
>> (a.chars.to_a & b.chars.to_a).join  
=> "nt versihc"  

上述方法的问题在于,它只是将共同的字母附加到结果的末尾(我们丢失了字符的顺序),这将使得测试唯一性变得困难。但我不认为交集是开始这种相似性比较的最佳方式。正在比较的两个文本中可以存在任意数量的单词组合。因此,也许如果我制作一系列连续的相似之处……但这将要求我们遍历其中一个文本的次数与我们尝试短语长度的次数一样多。

我想我真的不知道从哪里开始,并且以一种高效而不是 O(n^too_high) 的方式开始。

So, lets say I have these texts:

Text1:

absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients.

Text2:

zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate

Text 3

When the zerg first arrived in the Koprulu sector, they were unified by their absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate the advanced protoss race, it found useful but undeveloped material in humanity.

Now, The end of Text1 and the beginning of text2 overlap, so we'd say the text blocks aren't unique. Similarly, with Text3, Text1 can be found inside (as well as Text2) so this is also not unique, due to the overlap.

So, my question:

How do I go about writing something that can look at consecutive letters or words and determine uniqueness? Ideally, I'd want such a method to return some value, representing the amount of similarity--maybe the number of matched words over the average of the two text blocks' size. When it returns 0, both texts tested should be completely unique.

Some problem's I've run into when playing around with Ruby's string methods.

First, I started trying to find the intersection of two strings.

>> a = "nt version, there are no ch"  
>> b = "he current versi"  
>> (a.chars.to_a & b.chars.to_a).join  
=> "nt versihc"  

problem with the above method is that it just appends letters that are in common to the end of the result (we lose the order of characters), which would make it hard to test uniqueness. But I don't think intersection is the best way to start this similarity comparison. Any number of combinations of words could be present in both texts that are being compared. So maybe if I made an array of consecutive similarities... but that would require us to traverse one of the texts for as many times as we try phrase lengths.

I guess I really just don't know where to start, and in a way that is efficient and not O(n^too_high).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

红焚 2024-12-15 20:51:15

我相信您正在寻找的是 最长公共子串问题,即查找问题,给定两个字符串,它们都有共同的最长子字符串。该链接指向维基百科页面,该页面将帮助您了解该领域,并包含在 O(nm) 时间内运行的算法的伪代码示例。

此外,Wikibooks 的《算法实现》一书有Ruby 实现。它包含一个 lcs_size 方法,可能满足您的所有需求。简而言之,如果 lcs_size(text1, text2) 返回 4 这意味着 text1text2 几乎没有共同的连续文本,可能只有一个单词,但如果它返回,例如 40,它们可能有整个句子的共同点。

希望这有帮助!

I believe you're looking for is the Longest Common Substring problem, i.e. the problem of finding, given two strings, of the longest substring they both have in common. The link is to the Wikipedia page which will help you understand the domain and contains a pseudocode example of an algorithm that runs in O(nm) time.

Further, Wikibooks' Algorithm Implementation book has an implementation in Ruby. It includes an lcs_size method that may be all you need. In short, if lcs_size(text1, text2) returns 4 that means text1 and text2 have very little consecutive text in common, probably just a single word, but if it returns, say, 40, they might have an entire sentence in common.

Hope that's helpful!

帅哥哥的热头脑 2024-12-15 20:51:15

这是 Levenshtein 距离算法的 Ruby 实现。安装gem后,你可以像这样使用它:

require 'rubygems'
require 'Text'

t1 = "absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients."

t2 = "zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate"

puts Text::Levenshtein.distance(t1,t2)

Here's a Ruby implementation of the Levenshtein distance algorithm. After installing the gem, you can use it like that:

require 'rubygems'
require 'Text'

t1 = "absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients."

t2 = "zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate"

puts Text::Levenshtein.distance(t1,t2)
月野兔 2024-12-15 20:51:15

你的问题不是Ruby。这就是算法。您可以将每个文本拆分为单词,然后运行最小距离算法(http://en.wikipedia.org/wiki/Levenshtein_distance)来获得它。

数字越小,文本越相似。

Your problem is not Ruby. It's the algorithm. You could split each text into words, then run a minimum distance algorithm (http://en.wikipedia.org/wiki/Levenshtein_distance) to get that.

The smaller the number, the more similar the texts are.

执笏见 2024-12-15 20:51:15

这可以改进很多,但这是一个想法:

txt1 = "absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients."
txt2 = "zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate"

def txt_to_ary(txt)
    txt.gsub(/\.|,/, ' ').downcase.split(/\s+/)
end

def longest_match(txt1, txt2)
    longest = 0
    txt1.each_with_index do |w1, i|
        txt2.each_with_index do |w2, j|
            next unless w1 == w2
            k = 0
            k += 1 while txt1[i+k] == txt2[j+k]
            longest = k if k > longest          
        end
    end
    longest
end

txt1 = txt_to_ary( txt1 )
txt2 = txt_to_ary( txt2 )

puts longest_match(txt1, txt2) #=>12

This could be improved a lot but it's an idea:

txt1 = "absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients."
txt2 = "zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate"

def txt_to_ary(txt)
    txt.gsub(/\.|,/, ' ').downcase.split(/\s+/)
end

def longest_match(txt1, txt2)
    longest = 0
    txt1.each_with_index do |w1, i|
        txt2.each_with_index do |w2, j|
            next unless w1 == w2
            k = 0
            k += 1 while txt1[i+k] == txt2[j+k]
            longest = k if k > longest          
        end
    end
    longest
end

txt1 = txt_to_ary( txt1 )
txt2 = txt_to_ary( txt2 )

puts longest_match(txt1, txt2) #=>12
〆一缕阳光ご 2024-12-15 20:51:15

amatch gem 非常适合字符串比较。

The amatch gem is perfect for string comparisons.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文