如何返回包含重复元素的 Ruby 数组交集? (骰子系数中的二元组问题)

发布于 2024-08-07 23:36:57 字数 855 浏览 2 评论 0原文

我正在尝试编写骰子系数的脚本,但我在数组交集方面遇到了一些问题。

def bigram(string)
  string.downcase!
  bgarray=[]
  bgstring="%"+string+"#"
  bgslength = bgstring.length
  0.upto(bgslength-2) do |i|
    bgarray << bgstring[i,2]
   end
   return bgarray
 end

def approx_string_match(teststring, refstring)
  test_bigram = bigram(teststring) #.uniq
  ref_bigram = bigram(refstring)   #.uniq

  bigram_overlay = test_bigram & ref_bigram

  result = (2*bigram_overlay.length.to_f)/(test_bigram.length.to_f+ref_bigram.length.to_f)*100

  return result
end

问题是,as &删除重复项,我得到这样的东西:

string1="Almirante Almeida Almada"
string2="Almirante Almeida Almada"

puts approx_string_match(string1, string2) => 76.0%

它应该返回 100。

uniq 方法解决了这个问题,但是存在信息丢失,这可能会在我正在工作的特定数据集中带来不需要的匹配。

如何获得包含所有重复项的交集?

I'm trying to script Dice's Coefficient, but I'm having a bit of a problem with the array intersection.

def bigram(string)
  string.downcase!
  bgarray=[]
  bgstring="%"+string+"#"
  bgslength = bgstring.length
  0.upto(bgslength-2) do |i|
    bgarray << bgstring[i,2]
   end
   return bgarray
 end

def approx_string_match(teststring, refstring)
  test_bigram = bigram(teststring) #.uniq
  ref_bigram = bigram(refstring)   #.uniq

  bigram_overlay = test_bigram & ref_bigram

  result = (2*bigram_overlay.length.to_f)/(test_bigram.length.to_f+ref_bigram.length.to_f)*100

  return result
end

The problem is, as & removes duplicates, I get stuff like this:

string1="Almirante Almeida Almada"
string2="Almirante Almeida Almada"

puts approx_string_match(string1, string2) => 76.0%

It should return 100.

The uniq method nails it, but there is information loss, which may bring unwanted matches in the particular dataset I'm working.

How can I get an intersection with all duplicates included?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

北风几吹夏 2024-08-14 23:36:57

正如 Yuval F 所说,你应该使用 multiset。不过Ruby标准库中没有multiset,看看此处此处

如果性能对于您的应用程序来说不是那么重要,您仍然可以使用Array 和一些代码来实现。

def intersect  a , b  
    a.inject([]) do |intersect, s|
      index = b.index(s)
      unless index.nil?
         intersect << s
         b.delete_at(index)
      end
      intersect        
    end
end

a=  ["al","al","lc" ,"lc","ld"]
b = ["al","al" ,"lc" ,"ef"]
puts intersect(a ,b).inspect   #["al", "al", "lc"]

As Yuval F said you should use multiset. However, there is nomultiset in Ruby standard library , Take at look at here and here.

If performance is not that critical for your application, you still can do it usingArray with a little bit code.

def intersect  a , b  
    a.inject([]) do |intersect, s|
      index = b.index(s)
      unless index.nil?
         intersect << s
         b.delete_at(index)
      end
      intersect        
    end
end

a=  ["al","al","lc" ,"lc","ld"]
b = ["al","al" ,"lc" ,"ef"]
puts intersect(a ,b).inspect   #["al", "al", "lc"]
凉月流沐 2024-08-14 23:36:57

来自 此链接 我相信你不应该使用 Ruby 的集合,而应该使用多重集合,这样每个二元组都会计算它出现的次数。也许你可以使用这个gem来进行多重集。这应该为重复的二元组提供正确的行为。

From this link I believe you should not use Ruby's sets but rather multisets, so that every bigram gets counted the number of times it appears. Maybe you can use this gem for multisets. This should give a correct behavior for recurring bigrams.

神经大条 2024-08-14 23:36:57

根据@pierr 的回答,我对此进行了一段时间的研究,最后得出了这个结论。

a = ["al","al","lc","lc","lc","lc","ld"]
b = ["al","al","al","al","al","lc","ef"]
result=[]
h1,h2=Hash.new(0),Hash.new(0)
a.each{|x| h1[x]+=1}
b.each{|x| h2[x]+=1}
h1.each_pair{|key,val| result<<[key]*[val,h2[key]].min if h2[key]!=0}
result.flatten

<代码> => ["al", "al", "lc"]

这可能是 a & 的一种多重集交集。 b 但不要相信我的话,因为我还没有对其进行足够的测试来确定。

I toyed with this, based on the answer from @pierr, for a while and ended up with this.

a = ["al","al","lc","lc","lc","lc","ld"]
b = ["al","al","al","al","al","lc","ef"]
result=[]
h1,h2=Hash.new(0),Hash.new(0)
a.each{|x| h1[x]+=1}
b.each{|x| h2[x]+=1}
h1.each_pair{|key,val| result<<[key]*[val,h2[key]].min if h2[key]!=0}
result.flatten

=> ["al", "al", "lc"]

This could be a kind of multiset intersect of a & b but don't take my word for it because I haven't tested it enough to be sure.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文