模糊匹配两个哈希表？

发布于 2024-08-31 23:10:49 字数 733 浏览 15 评论 0原文

我正在寻找有关如何最好地匹配包含字符串键/值对的两个哈希表的想法。

这是我面临的实际问题：我有结构化数据导入到数据库中。我需要更新数据库中已有的记录，但是，源中的任何值都可能会更改，因此我没有可靠的 ID。

我正在考虑模糊匹配两行，源和数据库，并做出“有根据的”猜测是否应该更新或插入。

任何想法将不胜感激。

解决方案

解决方案基于 Ben Robinson 的帖子。工作得很好，允许这里和那里有小的不匹配以及基于自定义键的权重。

require 'rubygems'
require 'amatch'

class Hash
  def fuzzy_match(hash, key_weights = {})
    sum_total = 0
    sum_weights = 0

    self.keys.each do |key|
      weight = key_weights[key] || 1
      next if weight == 0

      weight *= 10000
      match = self[key].to_s.levenshtein_similar(hash[key].to_s) * weight
      sum_total += match
      sum_weights += weight
    end

    sum_total.to_f / sum_weights.to_f
  end
end

原文

I'm looking for ideas on how to best match two hash tables containing string key/value pairs.

Here's the actual problem I'm facing: I have structured data coming in which is imported into the database. I need to UPDATE records which are already in the DB, however, it's possible that ANY value in the source can change, therefore I don't have a reliable ID.

I'm thinking of fuzzy matching two rows, source and DB and make an "educated" guess if it should be updated or inserted.

Any ideas would be greatly appreciated.

Solution

Solution is based on Ben Robinson's post. Works pretty well, allows to have small mismatches here and there and custom key based weights.

require 'rubygems'
require 'amatch'

class Hash
  def fuzzy_match(hash, key_weights = {})
    sum_total = 0
    sum_weights = 0

    self.keys.each do |key|
      weight = key_weights[key] || 1
      next if weight == 0

      weight *= 10000
      match = self[key].to_s.levenshtein_similar(hash[key].to_s) * weight
      sum_total += match
      sum_weights += weight
    end

    sum_total.to_f / sum_weights.to_f
  end
end

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

a√萤火虫的光℡ 2024-09-07 23:10:49

我最近使用 Levenshtein Distance 进行模糊匹配。我计算两个可能匹配的字符串之间的距离，并为匹配给出一个距离倒数的分数。然后，我对各个字段的分数进行加权平均值，以确定记录的分数，并允许更重要的字段比较不重要的字段更重要。它用于 CRM 应用程序，其中有来自许多不同来源的潜在客户。客户需要将这些与现有的潜在客户/机会/客户/经销商等进行匹配。需要对分数匹配和不匹配的阈值进行一些调整。最后我们得到了大约 1% 的误报率，我认为这确实相当不错。

回复收藏 0 原文