当两个表之间不存在公共键时,是否可以比较它们?
我有两个表,我想比较它们是否有重复项。这些表只是基本的公司信息字段,如名称、城市、州等。我能看到的唯一可能的公共字段是名称列,但名称并不十分准确。有没有办法可以使用 LIKE 语句对两者进行比较?我也愿意接受任何人可能提出的任何其他建议。
谢谢。
I have two tables that I would like to compare for duplicates. These tables are just basic company information fields like name, city, state, etc. The only possibly common field that I can see would be the name column but the names are not quite exact. Is there a way that I can run a comparison between the two using a LIKE statement? I'm also open to any additional suggestions that anyone may have.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我会尝试使用 Double Metaphone 算法进行匹配,这是一种更复杂的 SOUNDEX 类型算法。
这是一个 MySQL 实现。
I would try matching using a Double Metaphone algorithm, which is a more sophisticated SOUNDEX-type algorithm.
Here is a MySQL implementation.
有些公司靠销售进行这种模糊匹配的数据清理产品来谋生。因此,您似乎不可能用简单(甚至极其复杂)的
LIKE
语句来解决这个问题。您需要的是能够比较两个字符串并返回相似度分数的东西,100% 的分数表示相同。类似于 Jaro-Winkler 算法。替代算法包括 Metaphone(或 Double Metaphone)和
Soundex()
。Soundex()
是最原始的解决方案。另一种解决方案是使用专业文本索引。这种方法最酷的一点是,我们可以指定一个同义词库来指定同义词,从而消除不相关的差异(INC = INCORPORATED、CO = COMPANY 等)。
Oracle和SQL Server包含这样的工具,但我不熟悉MySQL。
There are companies who make a good living by selling Data Cleansing products which undertake this kind of fuzzy matching. So it seems improbable that you could solve this with a simple (or even an extremely complicated)
LIKE
statement.What you need is something which can compare two strings and return a score for similarity, a score of 100% meaning identical. Something like the Jaro-Winkler algorithm. Alternative algorithms include Metaphone (or Double Metaphone) and
Soundex()
.Soundex()
is the crudest solution.An alternative solution would be to use a specialist text index. The cool thing about this approach is that we can specify a thesaurus to specify synonyms which iron out irrelevant differences (INC = INCORPORATED, CO = COMPANY, etc).
Oracle and SQL Server include such a tool but I'm not familiar with MySQL.
SOUNDEX() 将帮助您一定程度上。但它远非完美。
即使 string1 和 string2 拼写不同,soundex(string1) 也应等于 soundex(string2)。但正如我所说,它远非完美。
据我所知,没有现有的算法可以完美地做到这一点。
SOUNDEX() will help you to a certain extent. But it is far from perfect.
soundex(string1) is expected to be equal to soundex(string2) even if string1 and string2 are spelled differantly. But as I said, It is far from perfect.
As far as I know, there is no existing algorithm which does this perfectly.
嗯,没有 100% 保证正确的方法,不是。但是,您可能可以通过将所有“混乱”列转换为更规范的形式来取得一些进展,例如,将所有内容大写、修剪前导和尾随空格并确保一行中最多出现 1 个空格。还有诸如将“SMITH,JOHN”形式的名称更改为“JOHN SMITH”(反之亦然 - 只需选择一个形式并使用它)。当然,您应该复制记录,不要更改原件。您可以尝试丢弃更多信息(例如“JOHN SMITH”->“J SMITH”)——您会发现这会改变误报与漏报的平衡。
我可能会采取为每对记录分配相似度分数的方法。例如,如果规范化的姓名、地址和电子邮件地址完全一致,则打分 1000 分;否则,从 1000 中减去(某个倍数)Levenshtein 距离并使用它。您需要通过尝试并确定不同类型差异的相对重要性(例如,电话号码中的不同数字可能比两个人姓名中的 1 个字符差异更重要)来制定自己的评分方案。然后,您将能够通过实验建立一个分数,高于该分数您可以放心地将“重复”状态分配给一对记录,而低于该分数则需要手动检查;低于该分数,我们可以自信地说这两条记录不重复。
这里的现实目标是减少您需要执行的手动重复删除工作量。您不太可能完全消除它,除非所有的重复项是通过某些自动复制过程生成的。
Well, there's no 100% guaranteed-correct way, no. But you can probably make some progress by transforming all "messy" columns into a more canonical form, e.g. by capitalising everything, trimming leading and trailing spaces and ensuring at most 1 space appears in a row. Also things like changing names of the form "SMITH, JOHN" to "JOHN SMITH" (or vice versa -- just pick a form and go with it). And of course you should make copies of the records, don't change the originals. You can experiment with discarding further information (e.g. "JOHN SMITH" -> "J SMITH") -- you'll find this changes the balance of false positives to false negatives.
I would probably take the approach of assigning a similarity score to each pair of records. E.g. if the canonicalised names, addresses and email addresses agree exactly, assign a score of 1000; otherwise, subtract (some multiple of) the Levenshtein distance from 1000 and use that. You'll need to come up with your own scoring scheme by playing around and deciding the relative importance of different types of differences (e.g. a different digit in a phone number is probably more important than a 1-character difference in two people's names). You will then experimentally be able to establish a score above which you can confidently assign a status of "duplicate" to a pair of records, and a lower score above which manual checking is required; below that score, we can confidently say that the 2 records are not duplicates.
The realistic goal here is to reduce the amount of manual duplicate-removal work you'll need to do. You are unlikely to be able to eliminate it entirely, unless all the duplicates were generated through some automatic copying process.