如何返回包含重复元素的 Ruby 数组交集？（骰子系数中的二元组问题）

发布于 2024-08-07 23:36:57 字数 855 浏览 2 评论 0原文

我正在尝试编写骰子系数的脚本，但我在数组交集方面遇到了一些问题。

def bigram(string)
  string.downcase!
  bgarray=[]
  bgstring="%"+string+"#"
  bgslength = bgstring.length
  0.upto(bgslength-2) do |i|
    bgarray << bgstring[i,2]
   end
   return bgarray
 end

def approx_string_match(teststring, refstring)
  test_bigram = bigram(teststring) #.uniq
  ref_bigram = bigram(refstring)   #.uniq

  bigram_overlay = test_bigram & ref_bigram

  result = (2*bigram_overlay.length.to_f)/(test_bigram.length.to_f+ref_bigram.length.to_f)*100

  return result
end

问题是，as &删除重复项，我得到这样的东西：

string1="Almirante Almeida Almada"
string2="Almirante Almeida Almada"

puts approx_string_match(string1, string2) => 76.0%

它应该返回 100。

uniq 方法解决了这个问题，但是存在信息丢失，这可能会在我正在工作的特定数据集中带来不需要的匹配。

如何获得包含所有重复项的交集？

原文

I'm trying to script Dice's Coefficient, but I'm having a bit of a problem with the array intersection.

def bigram(string)
  string.downcase!
  bgarray=[]
  bgstring="%"+string+"#"
  bgslength = bgstring.length
  0.upto(bgslength-2) do |i|
    bgarray << bgstring[i,2]
   end
   return bgarray
 end

def approx_string_match(teststring, refstring)
  test_bigram = bigram(teststring) #.uniq
  ref_bigram = bigram(refstring)   #.uniq

  bigram_overlay = test_bigram & ref_bigram

  result = (2*bigram_overlay.length.to_f)/(test_bigram.length.to_f+ref_bigram.length.to_f)*100

  return result
end

The problem is, as & removes duplicates, I get stuff like this:

string1="Almirante Almeida Almada"
string2="Almirante Almeida Almada"

puts approx_string_match(string1, string2) => 76.0%

It should return 100.

The uniq method nails it, but there is information loss, which may bring unwanted matches in the particular dataset I'm working.

How can I get an intersection with all duplicates included?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

北风几吹夏 2024-08-14 23:36:57

正如 Yuval F 所说，你应该使用 multiset。不过Ruby标准库中没有multiset，看看此处和此处。

如果性能对于您的应用程序来说不是那么重要，您仍然可以使用Array 和一些代码来实现。

def intersect  a , b  
    a.inject([]) do |intersect, s|
      index = b.index(s)
      unless index.nil?
         intersect << s
         b.delete_at(index)
      end
      intersect        
    end
end

a=  ["al","al","lc" ,"lc","ld"]
b = ["al","al" ,"lc" ,"ef"]
puts intersect(a ,b).inspect   #["al", "al", "lc"]

As Yuval F said you should use multiset. However, there is nomultiset in Ruby standard library , Take at look at here and here.

If performance is not that critical for your application, you still can do it usingArray with a little bit code.

def intersect  a , b  
    a.inject([]) do |intersect, s|
      index = b.index(s)
      unless index.nil?
         intersect << s
         b.delete_at(index)
      end
      intersect        
    end
end

a=  ["al","al","lc" ,"lc","ld"]
b = ["al","al" ,"lc" ,"ef"]
puts intersect(a ,b).inspect   #["al", "al", "lc"]

回复收藏 0 原文

凉月流沐 2024-08-14 23:36:57

来自此链接我相信你不应该使用 Ruby 的集合，而应该使用多重集合，这样每个二元组都会计算它出现的次数。也许你可以使用这个gem来进行多重集。这应该为重复的二元组提供正确的行为。

回复收藏 0 原文

神经大条 2024-08-14 23:36:57

根据@pierr 的回答，我对此进行了一段时间的研究，最后得出了这个结论。

a = ["al","al","lc","lc","lc","lc","ld"]
b = ["al","al","al","al","al","lc","ef"]
result=[]
h1,h2=Hash.new(0),Hash.new(0)
a.each{|x| h1[x]+=1}
b.each{|x| h2[x]+=1}
h1.each_pair{|key,val| result<<[key]*[val,h2[key]].min if h2[key]!=0}
result.flatten

<代码> => ["al", "al", "lc"]

这可能是 a & 的一种多重集交集。 b 但不要相信我的话，因为我还没有对其进行足够的测试来确定。

I toyed with this, based on the answer from @pierr, for a while and ended up with this.

a = ["al","al","lc","lc","lc","lc","ld"]
b = ["al","al","al","al","al","lc","ef"]
result=[]
h1,h2=Hash.new(0),Hash.new(0)
a.each{|x| h1[x]+=1}
b.each{|x| h2[x]+=1}
h1.each_pair{|key,val| result<<[key]*[val,h2[key]].min if h2[key]!=0}
result.flatten

=> ["al", "al", "lc"]

This could be a kind of multiset intersect of a & b but don't take my word for it because I haven't tested it enough to be sure.

回复收藏 0 原文

~没有更多了~