如何标准化用户生成的大型公司名称数据集?

发布于 2024-07-11 17:37:52 字数 1459 浏览 7 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

陈独秀 2024-07-18 17:37:52

FWIW,这与数据库规范化无关。 这是一个数据清理任务。

一般情况下数据清理无法完全自动化。 许多人都尝试过,但不可能检测出输入数据可能格式错误的所有方式。 您可以使用以下技术自动处理一定比例的案例:

  • 强制用户从列表中选择公司名称而不是键入它们。 当然,这最适合单个条目,而不适合批量上传。
  • 将输入公司名称的 SOUNDEX 与数据库中已有公司名称的 SOUNDEX 进行比较。 这对于识别可能的匹配非常有用,但它也可能给出误报。 所以你需要一个人来审查它们。

最终,您需要设计您的软件,以便管理员能够轻松地“合并”条目(并更新其他数据库表中的任何引用),因为它们被发现是彼此重复的。 对于级联外键没有优雅的方法来做到这一点,您只需要编写一堆 UPDATE 语句即可。

FWIW, this has nothing to do with database normalization. This is a data cleanup task.

Data cleanup cannot be fully automated in the general case. Many people try, but it's impossible to detect all the ways that the input data might be malformed. You can automate some percentage of the cases with techniques such as:

  • Force users to select company names from a list instead of typing them. Of course this is best for single entries, not for bulk uploads.
  • Compare the SOUNDEX of the input company names to the SOUNDEX of company names already in the database. This is useful for identifying possible matches, but it can also give false positives. So you need a human to review them.

Ultimately, you need to design your software to make it easy for an administrator to "merge" entries (and update any references from other database tables) as they are discovered to be duplicates of one another. There's no elegant way to do this with cascading foreign keys, you just have to write a bunch of UPDATE statements.

吾家有女初长成 2024-07-18 17:37:52

有一种称为主数据管理的系统试图针对不同的领域(例如合作伙伴、地址、产品)执行此操作。 通常是大型、功能齐全的系统,没有什么可以以临时方式正确完成的。 这些事情一开始听起来很容易,但很快就会变得非常困难。

抱歉,我在这里不太高兴,但这很快就会变成一场噩梦..类似于尝试解决 np 完全问题...

There is a whole type of systems called Master Data Management trying do this for different domains, such as partners, addresses, products. Typically large, full-featured systems, nothing that can be properly done in an ad-hoc fashion. These things sound easy at first, but get very difficult very soon.

Sorry I'm not being too cheery here, but this can quickly turn into a nightmare .. similar to trying to solve an np-complete problem ...

轮廓§ 2024-07-18 17:37:52

您看到当您尝试在此网站上输入新问题时会发生什么吗? 之前所有的问题可能都是一样的吗?

甚至可能还不够。 这里还不够。

Do you see what happens when you try to enter a new question on this site? All those previous questions that might be the same?

Probably even that will be insufficient. It's insufficient here.

梦一生花开无言 2024-07-18 17:37:52

链接以某种方式做到了这一点。 但是,他们不进行批量上传...
基本上,您想要设置某种差异计算器,这将导致对某些潜在匹配项采取行动。

删除诸如此类的词
“Inc”、“The”等是一条规则,然后存在拼写错误的模式匹配或紧密匹配的单词。

从工作流程的角度来看,批量上传并不是一件容易的事情。 您将需要一个已批准的已知数据字典,然后每次上传/添加都必须经过审查。 最终添加的数量将会减少。

我同意这不是数据库问题 - 这是工作流程问题。

编辑

我会有一个批准的列表,然后是一些将潜在的“好”名称传播到批准的列表的规则。 如何实现这一点留给读者作为练习......

Linked in does this somehow. However, they don't do batch uploads...
Basically you want to set some sort of difference calculator that will cause an action on some potential matches.

dropping words like
"Inc", "The" and others is one rule, and then there is pattern matching or closely matching words that are misspelled.

Not an easy thing to do with batch uploads from a workflow standpoint. You will need a known data dictionary that is approved and then each upload/addition has to be vetted. Eventually the number of additions will dwindle.

I agree that this is not a database issue - it is a workflow issue.

EDIT

I would have an approved list, and then some rules that propagate a potential "good" name to the approved list. How you implement that is left as an exercise for the reader...

獨角戲 2024-07-18 17:37:52
company table    
  id
  name

company_synonym table
  company_id
  name

此模式结构解决了您列出的问题。

company table    
  id
  name

company_synonym table
  company_id
  name

This schema structure solves the problems you have listed.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文