模糊匹配两个哈希表?

发布于 2024-08-31 23:10:49 字数 733 浏览 10 评论 0原文

我正在寻找有关如何最好地匹配包含字符串键/值对的两个哈希表的想法。

这是我面临的实际问题:我有结构化数据导入到数据库中。我需要更新数据库中已有的记录,但是,源中的任何值都可能会更改,因此我没有可靠的 ID。

我正在考虑模糊匹配两行,源和数据库,并做出“有根据的”猜测是否应该更新或插入。

任何想法将不胜感激。

解决方案

解决方案基于 Ben Robinson 的帖子。工作得很好,允许这里和那里有小的不匹配以及基于自定义键的权重。

require 'rubygems'
require 'amatch'

class Hash
  def fuzzy_match(hash, key_weights = {})
    sum_total = 0
    sum_weights = 0

    self.keys.each do |key|
      weight = key_weights[key] || 1
      next if weight == 0

      weight *= 10000
      match = self[key].to_s.levenshtein_similar(hash[key].to_s) * weight
      sum_total += match
      sum_weights += weight
    end

    sum_total.to_f / sum_weights.to_f
  end
end

I'm looking for ideas on how to best match two hash tables containing string key/value pairs.

Here's the actual problem I'm facing: I have structured data coming in which is imported into the database. I need to UPDATE records which are already in the DB, however, it's possible that ANY value in the source can change, therefore I don't have a reliable ID.

I'm thinking of fuzzy matching two rows, source and DB and make an "educated" guess if it should be updated or inserted.

Any ideas would be greatly appreciated.

Solution

Solution is based on Ben Robinson's post. Works pretty well, allows to have small mismatches here and there and custom key based weights.

require 'rubygems'
require 'amatch'

class Hash
  def fuzzy_match(hash, key_weights = {})
    sum_total = 0
    sum_weights = 0

    self.keys.each do |key|
      weight = key_weights[key] || 1
      next if weight == 0

      weight *= 10000
      match = self[key].to_s.levenshtein_similar(hash[key].to_s) * weight
      sum_total += match
      sum_weights += weight
    end

    sum_total.to_f / sum_weights.to_f
  end
end

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

a√萤火虫的光℡ 2024-09-07 23:10:49

我最近使用 Levenshtein Distance 进行模糊匹配。我计算两个可能匹配的字符串之间的距离,并为匹配给出一个距离倒数的分数。然后,我对各个字段的分数进行加权平均值,以确定记录的分数,并允许更重要的字段比较不重要的字段更重要。它用于 CRM 应用程序,其中有来自许多不同来源的潜在客户。客户需要将这些与现有的潜在客户/机会/客户/经销商等进行匹配。需要对分数匹配和不匹配的阈值进行一些调整。最后我们得到了大约 1% 的误报率,我认为这确实相当不错。

I have used the Levenshtein Distance to do fuzzy matching recently. I compute the distance between two possible matched strings and give the match a score that is the inverse of the distance. I then do a weighted average of the scores across the fields to determine a score for the record and allow more important fields to count more heavily than less important fields. It is used in a CRM application where there were leads coming in from many different sources. The client needed to match these against existing leads/opportunities/cleints/resellers etc. It took a bit of adjusting of the thresholds of what score was a match and what wasn't. In the end we got about a 1% false positive rate which i think is quite good really.

入怼 2024-09-07 23:10:49

如果您在 SQL Server 中导入数据,SSIS 有一个模糊匹配任务。尝试一下,看看您是否喜欢结果。我们发现它在这种情况下确实很有帮助。

If you are importing data in SQL Server, SSIS has a fuzzy match task. Try it to see if you like the results. We've found it really helpful in situations like this.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文