Data cleanup cannot be fully automated in the general case. Many people try, but it's impossible to detect all the ways that the input data might be malformed. You can automate some percentage of the cases with techniques such as:
Force users to select company names from a list instead of typing them. Of course this is best for single entries, not for bulk uploads.
Compare the SOUNDEX of the input company names to the SOUNDEX of company names already in the database. This is useful for identifying possible matches, but it can also give false positives. So you need a human to review them.
Ultimately, you need to design your software to make it easy for an administrator to "merge" entries (and update any references from other database tables) as they are discovered to be duplicates of one another. There's no elegant way to do this with cascading foreign keys, you just have to write a bunch of UPDATE statements.
There is a whole type of systems called Master Data Management trying do this for different domains, such as partners, addresses, products. Typically large, full-featured systems, nothing that can be properly done in an ad-hoc fashion. These things sound easy at first, but get very difficult very soon.
Sorry I'm not being too cheery here, but this can quickly turn into a nightmare .. similar to trying to solve an np-complete problem ...
Linked in does this somehow. However, they don't do batch uploads... Basically you want to set some sort of difference calculator that will cause an action on some potential matches.
dropping words like "Inc", "The" and others is one rule, and then there is pattern matching or closely matching words that are misspelled.
Not an easy thing to do with batch uploads from a workflow standpoint. You will need a known data dictionary that is approved and then each upload/addition has to be vetted. Eventually the number of additions will dwindle.
I agree that this is not a database issue - it is a workflow issue.
EDIT
I would have an approved list, and then some rules that propagate a potential "good" name to the approved list. How you implement that is left as an exercise for the reader...
发布评论
评论(5)
FWIW,这与数据库规范化无关。 这是一个数据清理任务。
一般情况下数据清理无法完全自动化。 许多人都尝试过,但不可能检测出输入数据可能格式错误的所有方式。 您可以使用以下技术自动处理一定比例的案例:
SOUNDEX
与数据库中已有公司名称的SOUNDEX
进行比较。 这对于识别可能的匹配非常有用,但它也可能给出误报。 所以你需要一个人来审查它们。最终,您需要设计您的软件,以便管理员能够轻松地“合并”条目(并更新其他数据库表中的任何引用),因为它们被发现是彼此重复的。 对于级联外键没有优雅的方法来做到这一点,您只需要编写一堆 UPDATE 语句即可。
FWIW, this has nothing to do with database normalization. This is a data cleanup task.
Data cleanup cannot be fully automated in the general case. Many people try, but it's impossible to detect all the ways that the input data might be malformed. You can automate some percentage of the cases with techniques such as:
SOUNDEX
of the input company names to theSOUNDEX
of company names already in the database. This is useful for identifying possible matches, but it can also give false positives. So you need a human to review them.Ultimately, you need to design your software to make it easy for an administrator to "merge" entries (and update any references from other database tables) as they are discovered to be duplicates of one another. There's no elegant way to do this with cascading foreign keys, you just have to write a bunch of UPDATE statements.
有一种称为主数据管理的系统试图针对不同的领域(例如合作伙伴、地址、产品)执行此操作。 通常是大型、功能齐全的系统,没有什么可以以临时方式正确完成的。 这些事情一开始听起来很容易,但很快就会变得非常困难。
抱歉,我在这里不太高兴,但这很快就会变成一场噩梦..类似于尝试解决 np 完全问题...
There is a whole type of systems called Master Data Management trying do this for different domains, such as partners, addresses, products. Typically large, full-featured systems, nothing that can be properly done in an ad-hoc fashion. These things sound easy at first, but get very difficult very soon.
Sorry I'm not being too cheery here, but this can quickly turn into a nightmare .. similar to trying to solve an np-complete problem ...
您看到当您尝试在此网站上输入新问题时会发生什么吗? 之前所有的问题可能都是一样的吗?
甚至可能还不够。 这里还不够。
Do you see what happens when you try to enter a new question on this site? All those previous questions that might be the same?
Probably even that will be insufficient. It's insufficient here.
链接以某种方式做到了这一点。 但是,他们不进行批量上传...
基本上,您想要设置某种差异计算器,这将导致对某些潜在匹配项采取行动。
删除诸如此类的词
“Inc”、“The”等是一条规则,然后存在拼写错误的模式匹配或紧密匹配的单词。
从工作流程的角度来看,批量上传并不是一件容易的事情。 您将需要一个已批准的已知数据字典,然后每次上传/添加都必须经过审查。 最终添加的数量将会减少。
我同意这不是数据库问题 - 这是工作流程问题。
编辑
我会有一个批准的列表,然后是一些将潜在的“好”名称传播到批准的列表的规则。 如何实现这一点留给读者作为练习......
Linked in does this somehow. However, they don't do batch uploads...
Basically you want to set some sort of difference calculator that will cause an action on some potential matches.
dropping words like
"Inc", "The" and others is one rule, and then there is pattern matching or closely matching words that are misspelled.
Not an easy thing to do with batch uploads from a workflow standpoint. You will need a known data dictionary that is approved and then each upload/addition has to be vetted. Eventually the number of additions will dwindle.
I agree that this is not a database issue - it is a workflow issue.
EDIT
I would have an approved list, and then some rules that propagate a potential "good" name to the approved list. How you implement that is left as an exercise for the reader...
此模式结构解决了您列出的问题。
This schema structure solves the problems you have listed.