使用 SQL Server 2005 模糊匹配可能的重复项的良好 SQL 策略
我想在大型数据库中查找与 COMPANYNAME 和 ADDRESSLINE1 等字段匹配的可能候选重复记录
示例:
对于具有以下 COMPANYNAME 的记录:
- “Acme, Inc.”
我希望我的查询能够吐出具有这些 COMPANYNAME 值的其他记录作为可能的重复项:
- “Acme Corporation”
- “Acme,Incorporated”
- “Acme”
我知道如何进行联接、相关子查询等来执行拉取机制我想要的数据集。我知道这之前已经介绍过。我有兴趣听到关于进行模糊搜索的最佳方法的想法 - 我应该使用全文索引或 soundex 函数还是我在此过程中不知道的其他功能? (我正在使用 SQL Server 2005)
任何帮助表示感谢!
I want to find possible candidate duplicate records in a large database matching on fields like COMPANYNAME and ADDRESSLINE1
Example:
For a record with the following COMPANYNAME:
- "Acme, Inc."
I would like for my query to spit out other records with these COMPANYNAME values as possible dups:
- "Acme Corporation"
- "Acme, Incorporated"
- "Acme"
I know how to do the joins, correlated subqueries, etc. to do the mechanics of pulling the set of data I want. And I know that has been covered on here before. I am interested hearing thoughts on the best way to do the fuzzy searching - should I use full-text indexing or the soundex function or something else that I am unware of for this process? (I am using SQL Server 2005)
Any help is appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这当然取决于您的具体要求,但在 SQL 中使用 CONTAINS 使您能够执行邻近搜索以及主题搜索和模糊搜索。
http://www. developer.com/db/article.php/3446891/Understanding-SQL-Server-Full-Text-Indexing.htm
http://msdn.microsoft.com/en-us/library/ms187787(SQL.90).aspx
It will of course depend on your exact requirements, but using CONTAINS in your SQL gives you the ability to carry out proximity searches, as well as thematic and fuzzy searches.
http://www.developer.com/db/article.php/3446891/Understanding-SQL-Server-Full-Text-Indexing.htm
http://msdn.microsoft.com/en-us/library/ms187787(SQL.90).aspx
我建议使用 SSIS 任务定期清理数据。 SSIS 有模糊匹配运算符,并且有第三方提供商提供更强大的组件。有关该主题的一些文章:
如果预算如果允许并且操作的大小是值得的,您甚至可以考虑 MDS 服务器: SQL Server 2008 R2 主数据服务。
I would recommend using an SSIS task to periodically clean up the data. SSIS has fuzzy matching operators, and there are third party providers that offer more powerfull components. Some articles on the topic:
If the budget permits and the size of operation is worth it, you can even consider an MDS server: SQL Server 2008 R2 Master Data Services.
另外,新的 SSIS 数据质量工具包可在 http://www. melissadata.com/dqt/total-data-quality-integration.htm
Also a new SSIS Data Quality Toolkit is available at http://www.melissadata.com/dqt/total-data-quality-integration.htm