如何返回包含重复元素的 Ruby 数组交集? (骰子系数中的二元组问题)
我正在尝试编写骰子系数的脚本,但我在数组交集方面遇到了一些问题。
def bigram(string)
string.downcase!
bgarray=[]
bgstring="%"+string+"#"
bgslength = bgstring.length
0.upto(bgslength-2) do |i|
bgarray << bgstring[i,2]
end
return bgarray
end
def approx_string_match(teststring, refstring)
test_bigram = bigram(teststring) #.uniq
ref_bigram = bigram(refstring) #.uniq
bigram_overlay = test_bigram & ref_bigram
result = (2*bigram_overlay.length.to_f)/(test_bigram.length.to_f+ref_bigram.length.to_f)*100
return result
end
问题是,as &删除重复项,我得到这样的东西:
string1="Almirante Almeida Almada"
string2="Almirante Almeida Almada"
puts approx_string_match(string1, string2) => 76.0%
它应该返回 100。
uniq 方法解决了这个问题,但是存在信息丢失,这可能会在我正在工作的特定数据集中带来不需要的匹配。
如何获得包含所有重复项的交集?
I'm trying to script Dice's Coefficient, but I'm having a bit of a problem with the array intersection.
def bigram(string)
string.downcase!
bgarray=[]
bgstring="%"+string+"#"
bgslength = bgstring.length
0.upto(bgslength-2) do |i|
bgarray << bgstring[i,2]
end
return bgarray
end
def approx_string_match(teststring, refstring)
test_bigram = bigram(teststring) #.uniq
ref_bigram = bigram(refstring) #.uniq
bigram_overlay = test_bigram & ref_bigram
result = (2*bigram_overlay.length.to_f)/(test_bigram.length.to_f+ref_bigram.length.to_f)*100
return result
end
The problem is, as & removes duplicates, I get stuff like this:
string1="Almirante Almeida Almada"
string2="Almirante Almeida Almada"
puts approx_string_match(string1, string2) => 76.0%
It should return 100.
The uniq method nails it, but there is information loss, which may bring unwanted matches in the particular dataset I'm working.
How can I get an intersection with all duplicates included?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
正如
Yuval F
所说,你应该使用multiset
。不过Ruby标准库中没有multiset
,看看此处和此处。如果性能对于您的应用程序来说不是那么重要,您仍然可以使用
Array
和一些代码来实现。As
Yuval F
said you should usemultiset
. However, there is nomultiset
in Ruby standard library , Take at look at here and here.If performance is not that critical for your application, you still can do it using
Array
with a little bit code.来自 此链接 我相信你不应该使用 Ruby 的集合,而应该使用多重集合,这样每个二元组都会计算它出现的次数。也许你可以使用这个gem来进行多重集。这应该为重复的二元组提供正确的行为。
From this link I believe you should not use Ruby's sets but rather multisets, so that every bigram gets counted the number of times it appears. Maybe you can use this gem for multisets. This should give a correct behavior for recurring bigrams.
根据@pierr 的回答,我对此进行了一段时间的研究,最后得出了这个结论。
<代码> => ["al", "al", "lc"]
这可能是
a
& 的一种多重集交集。b
但不要相信我的话,因为我还没有对其进行足够的测试来确定。I toyed with this, based on the answer from @pierr, for a while and ended up with this.
=> ["al", "al", "lc"]
This could be a kind of multiset intersect of
a
&b
but don't take my word for it because I haven't tested it enough to be sure.