在小于指数时间内进行模糊匹配重复数据删除?
我有一个大型数据库(可能有数百万条记录),其中包含相对较短的文本字符串(按街道地址、名称等顺序)。
我正在寻找一种删除不精确重复项的策略,模糊匹配似乎是首选方法。我的问题:许多文章和 SO 问题都涉及将单个字符串与数据库中的所有记录进行匹配。我希望立即对整个数据库进行重复数据删除。
前者是一个线性时间问题(将一个值与一百万个其他值进行比较,每次计算一些相似性度量)。后者是一个指数时间问题(将每条记录的值与其他记录的值进行比较;对于一百万条记录,这大约是 5 x 10^11 次计算,而前一个选项是 1,000,000 次计算)。
我想知道除了我提到的“暴力”方法之外是否还有其他方法。我正在考虑可能生成一个字符串来比较每个记录的值,然后将具有大致相同相似性度量的字符串分组,然后通过这些组运行强力方法。我无法实现线性时间,但它可能会有所帮助。另外,如果我正确地思考这一点,这可能会错过字符串 A 和 B 之间潜在的模糊匹配,因为它们与字符串 C(生成的检查字符串)的相似性非常不同,尽管彼此非常相似。
有什么想法吗?
PS我意识到我可能使用了错误的时间复杂度术语 - 这是一个我有基本掌握的概念,但还不够好,所以我可以当场将算法放入正确的类别。如果我使用了错误的术语,我欢迎更正,但希望我至少能表达我的观点。
编辑
一些评论者问,考虑到记录之间的模糊匹配,我的策略是选择什么要删除哪些(即给定“foo”、“boo”和“coo”,它们将被标记为重复并删除)。我应该注意,我不是在这里寻找自动删除。这个想法是在 60 多万条记录数据库中标记潜在的重复项,以供人工审查和评估之用。如果存在一些误报也没关系,只要它是大致可预测/一致的数量即可。我只需要了解重复项的普遍程度。但如果模糊匹配传递需要一个月的时间才能运行,那么这根本就不是一个选择。
I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc).
I am looking for a strategy to remove inexact duplicates, and fuzzy matching seems to be the method of choice. My issue: many articles and SO questions deal with matching a single string against all records in a database. I am looking to deduplicate the entire database at once.
The former would be a linear time problem (comparing a value against a million other values, calculating some similarity measure each time). The latter is an exponential time problem (compare every record's values against every other record's value; for a million records, that's approx 5 x 10^11 calculations vs the 1,000,000 calculations for the former option).
I'm wondering if there is another approach than the "brute-force" method I mentioned. I was thinking of possibly generating a string to compare each record's value against, and then group strings that had roughly equal similarity measures, and then run the brute-force method through these groups. I wouldn't achieve linear time, but it might help. Also, if I'm thinking through this properly, this could miss a potential fuzzy match between strings A and B because the their similarity to string C (the generated check-string) is very different despite being very similar to each other.
Any ideas?
P.S. I realize I may have used the wrong terms for time complexity - it is a concept that I have a basic grasp of, but not well enough so I could drop an algorithm into the proper category on the spot. If I used the terms wrong, I welcome corrections, but hopefully I got my point across at least.
Edit
Some commenters have asked, given fuzzy matches between records, what my strategy was to choose which ones to delete (i.e. given "foo", "boo", and "coo", which would be marked the duplicate and deleted). I should note that I am not looking for an automatic delete here. The idea is to flag potential duplicates in a 60+ million record database for human review and assessment purposes. It is okay if there are some false positives, as long as it is a roughly predictable / consistent amount. I just need to get a handle on how pervasive the duplicates are. But if the fuzzy matching pass-through takes a month to run, this isn't even an option in the first place.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
看看http://en.wikipedia.org/wiki/Locality-sensitive_hashing。一种非常简单的方法是将每个地址(或其他地址)划分为一组重叠的 n 元语法。该 STACKOVERFLOW 成为集合 {STACKO, TACKO, ACKOV, CKOVE..., RFLOW}。然后使用大型哈希表或排序合并来查找冲突的 n 元语法并使用模糊匹配器检查冲突。因此 STACKOVERFLOW 和 SXACKOVRVLOX 将发生冲突,因为两者都与冲突的 n-gram ACKOV 相关联。
复杂性的下一个级别是选择一个随机哈希函数 - 例如具有任意密钥的 HMAC,并且在您找到的 n 元语法中,仅保留具有最小哈希值的一个。然后,您必须跟踪更少的 n 元语法,但只有在两种情况下的最小哈希值都是 ACKOV 时才会看到匹配。显然,n 元语法的长度和错误命中的概率之间存在权衡。事实上,人们似乎所做的就是通过连接同一记录中多个哈希函数的结果来使 n 变得很小并获得更高的精度,因此您需要同时在多个不同的哈希函数中获得匹配 -我认为这样的概率会更好。尝试谷歌搜索“重复检测minhash”
Have a look at http://en.wikipedia.org/wiki/Locality-sensitive_hashing. One very simple approach would be to divide up each address (or whatever) into a set of overlapping n-grams. This STACKOVERFLOW becomes the set {STACKO, TACKO, ACKOV, CKOVE... , RFLOW}. Then use a large hash-table or sort-merge to find colliding n-grams and check collisions with a fuzzy matcher. Thus STACKOVERFLOW and SXACKOVRVLOX will collide because both are associated with the colliding n-gram ACKOV.
A next level up in sophistication is to pick an random hash function - e.g. HMAC with an arbitrary key, and of the n-grams you find, keep only the one with the smallest hashed value. Then you have to keep track of fewer n-grams, but will only see a match if the smallest hashed value in both cases is ACKOV. There is obviously a trade-off here between the length of the n-gram and the probability of false hits. In fact, what people seem to do is to make n quite small and get higher precision by concatenating the results from more than one hash function in the same record, so you need to get a match in multiple different hash functions at the same time - I presume the probabilities work out better this way. Try googling for "duplicate detection minhash"
我认为您可能错误地计算了所有组合的复杂性。如果将一个字符串与所有其他字符串进行比较是线性的,这意味着由于长度较小,每次比较的时间复杂度为 O(1)。将每个字符串与其他每个字符串进行比较的过程不是指数而是二次,这并不全是坏事。简单来说,你正在比较 nC2 或 n(n-1)/2 对字符串,所以它只是 O(n^2)
我想不出一种方法可以按顺序对它们进行排序,因为你无法编写客观的比较器,但是即使你这样做,排序也会花费O(nlogn)进行合并排序,并且由于你有这么多记录并且可能不希望使用额外的内存,所以你会使用快速排序,在最坏的情况下需要O(n^2),与最坏情况下的时间相比没有任何改善蛮力。
I think you may have mis-calculated the complexity for all the combinations. If comparing one string with all other strings is linear, this means due to the small lengths, each comparison is O(1). The process of comparing each string with every other string is not exponential but quadratic, which is not all bad. In simpler terms you are comparing nC2 or n(n-1)/2 pairs of strings, so its just O(n^2)
I couldnt think of a way you can sort them in order as you cant write an objective comparator, but even if you do so, sorting would take O(nlogn) for merge sort and since you have so many records and probably would prefer using no extra memory, you would use quick sort, which takes O(n^2) in worst case, no improvement over the worst case time in brute force.
您可以使用 Levenshtein 转换器,它“接受查询项并返回[ s] 字典中与其相距 n 个拼写错误以内的所有术语”。 这是一个演示。
You could use a Levenshtein transducer, which "accept[s] a query term and return[s] all terms in a dictionary that are within n spelling errors away from it". Here's a demo.
所有记录的成对比较是 O(N^2),不是指数的。基本上有两种方法可以降低这种复杂性。
第一个是阻塞,您只比较已经具有易于计算的共同点的记录,例如前三个字母或常见的 n 元语法。这与局部敏感哈希基本上是相同的想法。 dedupe python 库 实现了许多阻止技术,并且 文档很好地概述了一般方法。
在最坏的情况下,与阻塞的成对比较仍然是 O(N^2)。最好的情况是 O(N)。在实践中,最好或最坏的情况都没有真正遇到。通常,分块会使要比较的对数量减少 99.9% 以上。
有一些有趣的、替代的记录链接范例,它们不基于成对比较。这些具有更好的更坏情况复杂性保证。请参阅 Beka Steorts 和 Michael Wick 的作品。
Pairwise comparisons of all the records is O(N^2) not exponential. There basically two ways to go to cut down on that complexity.
The first is blocking, where you only compare records that already have something in common that's easy to compute, like the first three letters or a common n-gram. This is basically the same idea as Locally Sensitive Hashing. The dedupe python library implements a number of blocking techniques and the documentation gives a good overview of the general approach.
In the worse case, pairwise comparisons with blocking is still O(N^2). In the best case it is O(N). Neither best or worst case are really met in practice. Typically, blocking reduces the number of pairs to compare by over 99.9%.
There are some interesting, alternative paradigms for record linkage that are not based on pairwise comparisons. These have better worse case complexity guarantees. See the work of Beka Steorts and Michael Wick.
我认为这是一次性清理。我认为问题不在于必须进行如此多的比较,而在于必须决定哪些比较值得进行。您提到了姓名和地址,因此请参阅此链接了解您将遇到的一些比较问题有。
确实,您必须进行近 5000 亿次强力比较才能将一百万条记录与其自身进行比较,但这是假设您从未跳过任何先前声明匹配的记录(即,从未在 j 循环中“中断”)下面的伪代码)。
我的 Pokey E-machines T6532 2.2GHz 能够每秒查找和读取 100 字节文本文件记录 1.4m 次,因此 5000 亿次比较大约需要 4 天。我不会花 4 天研究和编码一些奇特的解决方案(结果发现我还需要另外 x 天才能实际运行),并假设我的比较例程无法计算和保存我要比较的密钥,我'我只是让它暴力地进行所有这些比较,同时我找到其他事情要做:
即使给定的运行仅定位易于定义的匹配,希望它将剩余的不匹配记录减少到更合理的较小集合,以便进一步暴力-强制运行不是如此耗时。
但是,similarrecs() 似乎不太可能无法独立计算和保存正在比较的 a + b 的部分,在这种情况下,更有效的方法是:
如果允许,大多数数据库可以在一个命令行中执行上述操作调用您自己的自定义代码来计算每个记录的 fuzzykey()。
无论如何,困难的部分是根据上面的链接弄清楚是什么使两条记录重复。
I assume this is a one-time cleanup. I think the problem won't be having to do so many comparisons, it'll be having to decide what comparisons are worth making. You mention names and addresses, so see this link for some of the comparison problems you'll have.
It's true you have to do almost 500 billion brute-force compares for comparing a million records against themselves, but that's assuming you never skip any records previously declared a match (ie, never doing the "break" out of the j-loop in the pseudo-code below).
My pokey E-machines T6532 2.2gHz manages to do 1.4m seeks and reads per second of 100-byte text file records, so 500 billion compares would take about 4 days. Instead of spending 4 days researching and coding up some fancy solution (only to find I still need another x days to actually do the run), and assuming my comparison routine can't compute and save the keys I'd be comparing, I'd just let it brute-force all those compares while I find something else to do:
Even if a given run only locates easy-to-define matches, hopefully it reduces the remaining unmatched records to a more reasonable smaller set for which further brute-force runs aren't so time-consuming.
But it seems unlikely similarrecs() can't independently compute and save the portions of a + b being compared, in which case the much more efficient approach is:
Most databases can do the above in one command line, if you're allowed to invoke your own custom code for computing each record's fuzzykey().
In any case, the hard part is going to be figuring out what makes two records a duplicate, per the link above.
等价关系是特别好的匹配类型;它们满足三个属性:
这些优点在于它们允许您将数据划分为不相交的集合,以便任何给定集合中的每对元素都通过 ~ 相关。因此,您可以做的是应用并查找算法首先对所有数据进行分区,然后从分区中的每个集合中挑选出一个代表性元素;这完全消除了数据的重复(其中“重复”意味着“通过~相关”)。此外,这个解决方案是规范的,无论您从每个分区中选择哪个代表,您都会得到相同数量的最终值,并且每个最终值都是成对不重复的。
不幸的是,模糊匹配不是等价关系,因为它可能不是传递的(尽管它可能是自反的和对称的)。这样做的结果是没有一种规范的数据分区方法;您可能会发现,以任何方式尝试对数据进行分区,一组中的某些值与另一组中的值等效,或者单个组中的某些值不等效。
那么,在这些情况下,您到底想要什么行为呢?
Equivalence relations are particularly nice kinds of matching; they satisfy three properties:
What makes these nice is that they allow you to partition your data into disjoint sets such that each pair of elements in any given set are related by ~. So, what you can do is apply the union-find algorithm to first partition all your data, then pick out a single representative element from each set in the partition; this completely de-duplicates the data (where "duplicate" means "related by ~"). Moreover, this solution is canonical in the sense that no matter which representatives you happen to pick from each partition, you get the same number of final values, and each of the final values are pairwise non-duplicate.
Unfortunately, fuzzy matching is not an equivalence relation, since it is presumably not transitive (though it's probably reflexive and symmetric). The result of this is that there isn't a canonical way to partition the data; you might find that any way you try to partition the data, some values in one set are equivalent to values from another set, or that some values from within a single set are not equivalent.
So, what behavior do you want, exactly, in these situations?