有推荐的重复数据删除软件吗?

发布于 2024-12-01 00:30:57 字数 508 浏览 1 评论 0原文

我正在寻找一些与 MS SQL Server 兼容的重复数据删除软件。我有一个相当广泛且混乱的表格,其中包含来自世界各地的各种不同语言的地址。该表被设置为将重复项作为父/子记录进行处理,因此需要一些处理匹配的功能(即不仅仅是删除重复项)。

编辑:这是结构

ParentID | MasterID | PropertyName | Address1 | Address2 | PostalCode | City | StateProvinceCode | CountryCode | PhoneNumber

MasterID 对于每条记录都是唯一的。

ParentID 包含每个条目的父记录的 MasterID,父记录是 MasterID = ParentID 所在的位置。

CountryCode 是两个字母的 ISO 国家/地区代码(不是电话代码)。

I am looking for some dedupe software that is compatible with MS SQL Server. I have a rather extensive and messy table that contains addresses from all over the world in all different languages. The table is set up to handle dupes as parent/child records so some functionality to handle a match is required (ie not just deleting a dupe).

Edit: Here's the structure

ParentID | MasterID | PropertyName | Address1 | Address2 | PostalCode | City | StateProvinceCode | CountryCode | PhoneNumber

The MasterID is unique for each record.

ParentID contains the MasterID for the parent record of each entry, and the parent record is where the MasterID = ParentID.

CountryCode is the two letter ISO country code (not telephone code).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

桃气十足 2024-12-08 00:30:57

众所周知,重复地址很难追踪。写一个地址大约有 10 种有效方法,这可能会导致问题。

事实上,您的业务规则有时允许重复,这让我认为您最好推出自己的软件来查找不可接受的重复并将其删除。

过去,我通过将地址通过免费的地理编码服务(例如 Google 的地图 API)并查找彼此之间某个阈值(10 英尺或其他)内的点来对地址进行此​​操作。此时,您可以确定它是否符合“不可接受的重复项”的条件并将其删除。

要查找坐标之间的距离,我建议查找大圆距离。祝你好运!

Address duplicates are notoriously difficult to track down. There are about 10 valid ways to write one address, which can make for problems.

The fact that you have business rules that allow for duplicates some of the time makes me think you might be better off rolling your own piece of software to find unacceptable dupes and remove them.

In the past I have done this with addresses by putting the address through a free geo-coding service (Google's mapping API for instance) and looking for points that are within a certain threshold of each other (10 feet or something). At this point you can determine if it qualifies as an "unacceptable duplicate" and delete it.

To find distances between coordinates I would recommend finding the Great Circle Distance. Good luck!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文