有推荐的重复数据删除软件吗?
我正在寻找一些与 MS SQL Server 兼容的重复数据删除软件。我有一个相当广泛且混乱的表格,其中包含来自世界各地的各种不同语言的地址。该表被设置为将重复项作为父/子记录进行处理,因此需要一些处理匹配的功能(即不仅仅是删除重复项)。
编辑:这是结构
ParentID | MasterID | PropertyName | Address1 | Address2 | PostalCode | City | StateProvinceCode | CountryCode | PhoneNumber
MasterID
对于每条记录都是唯一的。
ParentID
包含每个条目的父记录的 MasterID
,父记录是 MasterID = ParentID
所在的位置。
CountryCode
是两个字母的 ISO 国家/地区代码(不是电话代码)。
I am looking for some dedupe software that is compatible with MS SQL Server. I have a rather extensive and messy table that contains addresses from all over the world in all different languages. The table is set up to handle dupes as parent/child records so some functionality to handle a match is required (ie not just deleting a dupe).
Edit: Here's the structure
ParentID | MasterID | PropertyName | Address1 | Address2 | PostalCode | City | StateProvinceCode | CountryCode | PhoneNumber
The MasterID
is unique for each record.
ParentID
contains the MasterID
for the parent record of each entry, and the parent record is where the MasterID = ParentID
.
CountryCode
is the two letter ISO country code (not telephone code).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
众所周知,重复地址很难追踪。写一个地址大约有 10 种有效方法,这可能会导致问题。
事实上,您的业务规则有时允许重复,这让我认为您最好推出自己的软件来查找不可接受的重复并将其删除。
过去,我通过将地址通过免费的地理编码服务(例如 Google 的地图 API)并查找彼此之间某个阈值(10 英尺或其他)内的点来对地址进行此操作。此时,您可以确定它是否符合“不可接受的重复项”的条件并将其删除。
要查找坐标之间的距离,我建议查找大圆距离。祝你好运!
Address duplicates are notoriously difficult to track down. There are about 10 valid ways to write one address, which can make for problems.
The fact that you have business rules that allow for duplicates some of the time makes me think you might be better off rolling your own piece of software to find unacceptable dupes and remove them.
In the past I have done this with addresses by putting the address through a free geo-coding service (Google's mapping API for instance) and looking for points that are within a certain threshold of each other (10 feet or something). At this point you can determine if it qualifies as an "unacceptable duplicate" and delete it.
To find distances between coordinates I would recommend finding the Great Circle Distance. Good luck!